Ceph Introduction
Ceph – Technical Deep Dive
Ceph is a distributed, software-defined storage system designed to provide scalable, fault-tolerant storage without a centralized metadata bottleneck. It is built around the RADOS object store, which serves as the foundation for all storage interfaces.
Core Architecture
RADOS (Reliable Autonomic Distributed Object Store)
RADOS is the lowest layer of Ceph and is responsible for:
- storing data as objects,
- replicating or erasure-coding data,
- ensuring strong consistency,
- handling recovery and rebalancing automatically.
Each object is stored on multiple OSDs (Object Storage Daemons) according to the configured replication or erasure coding rules.
OSDs
An OSD is a daemon responsible for:
- managing a physical disk or logical volume,
- serving read/write requests,
- replicating data to peer OSDs,
- performing data recovery and backfilling,
- maintaining object metadata and OMAP key-value data.
OSDs communicate directly with each other and clients using an internal Ceph protocol over TCP.
MON (Monitor)
Monitors maintain the cluster maps, including:
- monitor map (MONs),
- OSD map,
- CRUSH map,
- MDS map.
MONs use the Paxos consensus algorithm to maintain strong consistency of cluster state. A quorum of MONs is required for the cluster to operate.
MGR (Manager)
Managers provide:
- cluster telemetry and statistics,
- background tasks and orchestration hooks,
- REST APIs and dashboards,
- support for modules such as Prometheus, alerts, and balancer.
MGRs do not participate in the data path but are critical for observability and automation.
Data Placement and CRUSH
Ceph uses the CRUSH (Controlled Replication Under Scalable Hashing) algorithm to calculate object placement deterministically. Clients compute object locations locally, eliminating centralized lookup services.
CRUSH takes into account:
- failure domains (disk, host, rack, row, datacenter),
- weights and rules,
- replication or erasure coding profiles.
This design enables linear scalability and predictable failure behavior.
Placement Groups (PGs)
Objects are grouped into placement groups (PGs), which act as logical containers that map to a set of OSDs.
PGs:
- distribute data evenly across OSDs,
- reduce metadata overhead,
- simplify recovery and rebalancing operations.
The number of PGs directly affects performance, recovery speed, and cluster stability.
Consistency and Replication
Ceph provides strong consistency:
- writes are acknowledged only after all replicas confirm the operation,
- primary OSD coordinates writes to secondary replicas,
- journaling ensures crash consistency.
For erasure-coded pools, Ceph splits objects into data and parity chunks and reconstructs missing data on failure.
Recovery and Self-Healing
When an OSD or node fails:
- affected PGs are marked degraded,
- data is automatically re-replicated or reconstructed,
- CRUSH recalculates placement dynamically,
- recovery traffic is throttled to avoid impacting client I/O.
Once the failed component returns, Ceph rebalances data back to optimal placement.
Storage Interfaces
Ceph exposes multiple storage interfaces built on RADOS:
RADOS Block Device (RBD)
- distributed block storage,
- supports snapshots, clones, and live migration,
- commonly used with virtualization platforms.
CephFS
- POSIX-compliant distributed file system,
- metadata handled by MDS (Metadata Servers),
- supports snapshots and quotas.
RADOS Gateway (RGW)
- object storage compatible with S3 and Swift APIs,
- uses RADOS objects and OMAPs for metadata,
- supports multisite replication and lifecycle policies.
Scalability and Failure Model
Ceph is designed to:
- scale horizontally by adding OSDs and nodes,
- tolerate multiple simultaneous failures,
- avoid single points of failure,
- operate on commodity hardware.
The system favors availability and consistency while maintaining high durability through replication or erasure coding.
Summary
Ceph is a fully distributed storage platform that combines deterministic data placement, strong consistency, and self-healing mechanisms. Its architecture enables it to scale from small clusters to multi-petabyte deployments while maintaining predictable performance and fault tolerance.