Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Hands-On With Ceph

Object Storage, Block Storage, Filesystem & More

LinuxCon Europe 2012

Barcelona, Catalunya, España

Nov 7, 2012

Who the $?#% am I?

Florian Haas

florian@hastexo.com


www.hastexo.com

Native Object Storage

Block Storage

ReSTful Storage

Distributed Filesystem

Ceph is based on a distributed, autonomic, redundant native object store named RADOS.

Reliable
Autonomic
Distributed
Object
Store

RADOS is a flat namespace.

Each object has a name, any number of attributes, and a payload of (almost) arbitrary size.

Objects are assigned to Placement Groups (PGs).

Each PG has an ordered list of Object Storage Devices (OSDs) where its contents are stored in a redundant fashion.

Object placement is entirely algorithmic.

There is no central lookup or distributed hashtable.

Now what is so special about that?

Think of how you checked into your hotel when you arrived here.

Photo credit: NCinDC CC-BY 2.0

You probably went here:

Photo credit: prayitno CC-BY 2.0

And got one of these:

Photo credit: Braden Kowitz kowitz CC-BY-SA 2.0.

Front desk, key card: central data lookup (with caching)

Works just fine for a small hotel.

What if our hotel is huge?

We could add more front desks, and hire more people.

Doesn't work too well.

We could also build several identical buildings, and assign guests on a pseudo-random basis.

Several buildings, random assignment: distributed, partitioned hashtable

But what if our hotel was

gigantic?

Like, a billion rooms?

Room numbers? Meaningless.

# 156,398,481

Is your room even still there?

What are the odds that it's on fire, right now?

Remember:
at scale, something always fails.

What we really need is:

(1) Something you already know about yourself

Photo credit: Sean Hagen (rebelcan) CC-BY 2.0

(2) A system that takes that information and automatically guides you to your room.

So you no longer care where your room is.

Photo credit: schnaars CC-BY 2.0

(3) Robots that automatically move all your stuff when you can't enter your room.

Think housekeeping.

(4) Magic replicating minions that duplicate all your things and store them safely, elsewhere as soon as you've entered your room.

Because you don't want to lose them in a fire.

And Ceph does exactly that for you.

All of it.

Controlled
Replication
Under
Scalable
Hashing

All OSDs know about and can propagate the current map describing object placement.

Monitor servers (MONs) arbitrate the cluster status and act as authorities for the placement map.

They use a distributed consensus protocol based on Paxos.

Both MONs and OSDs operate entirely in user space.

Enough talk.

Let's take a look.

daisy

eric

frank

Applications can interact with RADOS using a number of APIs.

There are also several high-level client layers that RADOS ships with.

RADOS Block Device (RBD) is a thin‑provisioned block device interface that stripes data across multiple RADOS objects.

It supports cheap, read‑only redirect‑on‑write snapshots.

RBD also supports efficient cloning.

This makes it very well suited for maintaining template-based virtual machines.

RBD comes in two flavors.

rbd is a kernel-level block device driver merged upstream in Linux 2.6.37.

qemu-rbd is a userspace storage driver for Qemu and KVM.

It is built on the librados C API.

Again, let's take a peek.

alice

daisy

Ceph provides ReSTful HTTP(S) access to the object store.

It does so through a FastCGI application, radosgw.

radosgw uses the libradospp C++ API.

radosgw runs in any web server that supports FastCGI.

The canonical deployment approach is with Apache and mod_fastcgi.

radosgw currently understands the Amazon S3 and OpenStack Swift APIs.

radosgw supports native load balancing and scaleout.

Aaaaaaand... you know what's next.

alice

bob

charlie

And now.

Yeah, you've been waiting for this. I know.

ceph is a distributed filesystem built on top of RADOS.

It's been in the mainline kernel since 2.6.32.

Its goal:

An HPC filesystem similar to Lustre, without its shortcomings.

ceph layers POSIX semantics on top of RADOS.

It introduces directories, attributes, permission bits and everything else that a POSIX filesystem needs.

All filesystem metadata lives itself in RADOS objects.

Ceph uses another type of daemon, a metadata server (MDS), to manage this metadata.

Just like all other Ceph daemons, the MDS runs entirely in userspace.

Only the filesystem client runs in the kernel.

ceph mounts are writable from any client and play nicely with flock() and fcntl() locking.

ceph supports arbitrary directory level snapshots.

reflink() is currently unsupported.

ceph also has spiffy accounting and statistics support through virtual extended attributes.

And here will be my final demo for the day...

alice

bob

charlie

That's it!

Admit it, you're geeked out now.

If you're not, you have no soul.

Thanks to:

Sage Weil @liewegas & crew for Ceph

Bartek Szopka @bartaz for impress.js

Markus Gutschke for shellinabox (and his recipe collection!)

Inktank @inktank for the Ceph logo

This talk:

http://www.hastexo.com/lceu2012

https://github.com/fghaas/lceu2012