More recommendations for Ceph and OpenStack
Posted on Thu 03 August 2017 in hints-and-kinks • 5 min read
A few months ago, we shared our Dos and Don’ts, as they relate to Ceph and OpenStack. Since that post has proved quite popular, here are a few additional considerations for your Ceph-backed OpenStack cluster.
Do configure your images for VirtIO-SCSI
By default, RBD-backed Nova instances use the virtio-blk
driver to
expose RBD images to the guest – either as ephemeral drives, or as
persistent volumes. In this default configuration, VirtIO presents a
virtual PCI device to the guest that represents the paravirtual I/O
bus, and devices are named /dev/vda
, /dev/vdb
, and so
forth. VirtIO block devices are lightweight and efficient, but they
come with a drawback: they don’t support the discard
operation.
Not being able to use discard
means the guest cannot mount a
filesystem with mount -o discard
, and it also cannot clean up freed
blocks on a filesystem with fstrim
. This can be a security concern
for your users, who might want to be able to really, actually delete
data from within the guest (after overwriting it, presumably). It can
also be an operational concern for you as the cluster operator.
This is because not supporting discard
also means that RADOS objects
owned by the corresponding RBD image and never removed during the
image’s lifetime – they persist until the whole image is deleted. So
your cluster may carry the overhead of perhaps tens of thousands of
RADOS objects that no-one actually cares about.
Thankfully, there is an alternative VirtIO disk driver that does
support discard
: the paravirtualized VirtIO SCSI controller,
virtio-scsi
.
Enabling the VirtIO SCSI controller is something you do by setting a
couple of Glance image properties, namely hw_scsi_model
and
hw_disk_bus
. You do so by running the following openstack
commands
on your image:
openstack image set \
--property hw_scsi_model=virtio-scsi \
--property hw_disk_bus=scsi \
<name or ID of your image>
Then, if you boot an instance from this image, you’ll see that its
block device names switch from /dev/vdX
to /dev/sdX
, and you also
get everything else you expect from a SCSI stack. For example, there’s
/proc/scsi/scsi
, you can extract information about your bus,
controller, and LUs with lsscsi
command, and so on.
It’s important to note that this image property is inherited by the
instance booted from that image, which also passes it on to all
volumes that you may subsequently attach to that instance. Thus,
openstack server add volume
will now add /dev/sdb
, not /dev/vdb
,
and you will automatically get the benefits of discard
on your
volumes, as well.
Do set disk I/O limits on your Nova flavors
In a Ceph cluster that acts as backing storage for OpenStack, naturally many OpenStack VMs share the bandwidth and IOPS of your whole cluster. When that happens, occasionally you may have a VM that’s very busy (meaning it produces a lot of I/O), which the Ceph cluster will attempt to process to the best of its abilities. In doing so, since RBD has no built-in QoS guarantees (yet), it might cause other VMs to suffer from reduced throughput, increased latency, or both.
The trouble with this is that it’s almost impossible for your users to calculate and reckon with. They’ll see a VM that sustains, say, 10,000 IOPS at times, and then drop to 2,000 with no warning or explanation. It is much smarter to pre-emptively limit Ceph RBD performance from the hypervisor, and luckily, OpenStack Nova absolutely allows you to do that. This concept is known as instance resource quotas, and you set them via flavor properties. For example, an you may want to limit a specific flavor to 1,500 IOPS and a maximum throughput of 100 MB/s:
openstack flavor set \
--property quota:disk_total_bytes_sec=$((100<<20))
--property quota:disk_total_iops_sec=1500
m1.medium
In the background, these settings are handed through to libvirt and ultimately fed into cgroup limitations for Qemu/KVM, when a VM with this flavor spins up. So these limits aren’t specific to RBD, but they come in particularly handy when dealing with RBD.
Obviously, since flavors can be public, but can also be limited to specific tenants, you can set relatively low instance resource quotas in public flavors, and then make flavors with higher resource quotas available to select tenants only.
Do differentiate Cinder volume types by disk I/O limits
In addition to setting I/O limits on flavors for VMs, you can also influence the I/O characteristics of volumes. You do so by specifying distinct Cinder volume types. Volume types are frequently used to enable users to select a specific Cinder backend — say, to stick volumes either on a NetApp box or on RBD, but it’s perfectly OK if you define multiple volume types using the same backend. You can then set characteristics like maximum IOPS or maximum throughput via Cinder QoS specifications. A QoS specification akin to the Nova flavor decribed above — limiting throughput to 100 MB/s and 1,500 IOPS would be created like this:
openstack volume qos create \
--consumer front-end
--property total_bytes_sec=$((100<<20)) \
--property total_iops_sec=1500 \
"100MB/s-1500iops"
You would then create a corresponding volume type, and associate the QoS spec with it:
openstack volume type create \
--public \
"100MB/s-1500iops"
openstack volume qos associate \
"100MB/s-1500iops" \
"100MB/s-1500iops"
Again, as with Nova flavors, you can make volume types public, but you can also limit them to specific tenants.
Don’t forget about suspend files
When you suspend a Nova/libvirt/KVM instance, what really happens is what libvirt calls a managed save: the instance’s entire memory is written to a file, and then KVM process shuts down. This is actually quite neat because it means that the VM does not consume any CPU cycles nor memory until it restarts, and it will continue right where it left off, even if the compute node is rebooted in the interim.
You should understand that these savefiles are not compressed in any
way: if your instance has 16GB of RAM, that’s a 16GB file that
instance suspension drops into /var/lib/nova/save
. This can add up
pretty quickly: if a single compute node hosts something like 10
suspended instances, their combined save file size can easily exceed
100 GB. Obviously, this can put you in a really bad spot if this fills
up your /var
(or worse, /
) filesystem.
Of course, if you already have a Ceph cluster, you can put it to good use here too: just deep-mount a CephFS file system into that spot. Here’s an Ansible playbook snippet that you may use as inspiration:
---
- hosts:
- compute-nodes
vars:
ceph_mons:
- ceph-mon01
- ceph-mon02
- ceph-mon03
cephfs_client: cephfs
cephfs_secret: "{{ vaulted_cephfs_secret }}"
tasks:
- name: "install ceph-fs-common package"
apt:
name: ceph-fs-common
state: installed
- name: "create ceph directory"
file:
dest: /etc/ceph
owner: root
group: root
mode: '0755'
state: directory
- name: "create cephfs secretfile"
copy:
dest: /etc/ceph/cephfs.secret
owner: root
group: root
mode: '0600'
content: '{{ cephfs_secret }}'
- name: "mount savefile directory"
mount:
fstype: ceph
path: /var/lib/nova/save
src: "{{ ceph_mons | join(',') }}:/nova/save/{{ ansible_hostname }}"
opts: "name={{ cephfs_client }},secretfile=/etc/ceph/cephfs.secret"
state: mounted
- name: "fix savefile directory ownership"
file:
path: /var/lib/nova/save
owner: libvirt-qemu
group: kvm
state: directory
Got more?
Do you have Ceph/OpenStack hints of your own? Leave them in the comments below and we’ll include them in the next installment.
This article originally appeared on the hastexo.com
website (now defunct).