Understanding Ceph and Its Place in the Market
Last month, the Ceph community released its first set of bug fixes to the 10.2 Jewel release with version 10.2.1. This is good news, but what is Ceph? Ceph is a software-defined distributed object storage solution. Originally developed by Inktank in 2012, the solution was later acquired by Red Hat, Inc., in 2014. It is open source and licensed under the Lesser General Public License (LGPL).
In the world of Ceph, data is treated and stored like objects. This is unlike traditional (and legacy) data storage solutions, where data is written to and read from the storage volumes via sectors and at sector offsets (often referred to as blocks). When dealing with large amounts of large data, treating them as objects is the way to do it. It is also much easier to manage. In fact, this is how the cloud functions—with objects. This object-drive model enables Ceph for simplified scalability to meet consumer demand easily. These objects are replicated across an entire cluster of nodes, giving Ceph its fault-tolerance and further reducing single points of failure.
Last year, the Ceph community released the support for an Erasure Coded pool. This means that instead of duplicating data objects and consuming doubles or multiples of the original data, through the concept of Erasure Code, the object gets fed through an algorithm where it is re-computed with additional padding to handle data inconsistencies or failures. This does come at a bit of a performance cost. Anyway, Ceph has also been designed to self-heal and self-manage. All of this happens at a lower level and is transparent to the user or a client application.
For accessibility, Ceph exposes three interfaces into user space. The first of which is an object store. This object store is accessible via a RESTful interface and supports both OpenStack Swift and the Amazon Simple Storage Service (S3). Through this method, a web application can send direct PUT, GET and DELETE methods to the object store without having to rewrite application code or worry where the object gets stored.
The second interface is a thinly provisioned block device. The goal behind this is to allow Ceph to slide right into existing computing environments. Applications and virtual environments accessing file/block volumes do not need to be re-architected but still would be able to leverage most of the features, functionality and resiliency that Ceph has to offer. An advantage to Ceph's object-based model, the block device and the filesystem interfaces (see below) above it are well equipped for snapshots, clonings and better load-balancing support.
The third and final interface is a filesystem. Although at the end of the day, the filesystem will provide a lot of the same accessibility and functionality as the block device, in the Ceph implementation, the built-in filesystem does remove the block device layer (reducing the amount of total stacked layers) and connects directly to the object store back end. This does simplify maintenance and debugging.
As it is released today, Ceph is entirely managed from the command line. Red Hat redistributes Ceph with a web-based management user interface called Calamari. Calamari simplifies general Ceph administration. It ships with a server and client components. The client component provides the web-based Dashboard. It directly communicates with the server via a RESTful API.
Now, while Ceph itself solves many industry problems, even more so with how data is managed and scaled, it is only a piece of a much larger puzzle. Ceph is designed to handle two things: 1) it enables fault tolerance by distributing data (replicated or erasure-coded) across a cluster of nodes, and 2) it provides user access to that same data. What happens above and below this is entirely up to the storage administrator. For instance, below the Ceph framework, how is the hardware monitored? How are drive failures detected and corrected? Above the framework, how are block and filesystem volumes exported? How is high availability to these same volumes enabled?
This is where software redistributors come into the picture. Vendors, such as Red Hat, SUSE, Canonical (Ubuntu) and so on, glue all of these pieces together and unify them under a single management space. Some are more prepared at this than others. To add additional credibility, many big names in the data storage industry have jumped on the Ceph train, including SanDisk, SolidFire (now part of NetApp) and more. Ceph is used by these vendors in some form or another.
Ceph also has some pretty strong and commercial competition. To name a few, in the on-premises cloud or hybrid space there is Cleversafe (an IBM Company), Amplidata (now HGST, a Western Digital Company), Scality, Amazon (AWS), Microsoft (Azure) and Google (Google Cloud Storage).
I see a strong and promising future for Ceph. Sure, like any other data storage solution it doesn't address all data storage needs, but it's here, and it's yet another contender in the software-defined storage arena.