Ceph RBD Mirroring

From Proxmox VE
Jump to navigation Jump to search

There are two possible ways to set up mirroring of RBD images to other Ceph clusters. One is using journaling, the other is using snapshots.

The journal based approach will cause more load on your cluster as each write operation needs to be written twice. Once to the actual data and then to the journal. The journal is read by the target cluster and replayed.

When using snapshot mirroring, the source image is being snapshotted according to a set schedule and the target cluster will fetch the new snapshots.

Journal based mirroring can run into the situation that the target cluster cannot replay the journal fast enough. Either because the network in between the two clusters is not fast enough or because the target cluster itself is too slow. This will result in the source cluster filling up with journal objects. In such a situation, consider switching over to snapshot based mirroring.

This guide is based on the official Ceph RBD mirror documentation with specifics in mind for a hyperconverged Proxmox VE + Ceph setup.

Overview

We assume two clusters, site A and site B. The target of this guide is to set up mirroring from A to B. Adding two way mirroring is doing most steps again, in the other direction.

The pools on both clusters need to be named the same.

To follow this guide, use any node on site A. On site B, run the commands on the node on which the RBD mirror daemon should be running.

Yellowpin.svg Note: Nodes with the RBD mirror daemon must be able to access all Ceph nodes in both clusters!
Yellowpin.svg Note: KRBD does not support journal based mirroring! This means that for LXC containers you need to use snapshot mirroring. For VMs you can disable KRBD in the Proxmox VE storage configuration.
┌────────────────┐     ┌────────────────┐
│     Site A     │     │     Site B     │
│ ┌────────────┐ │     │ ┌────────────┐ │
│ │   Node 1   │ │     │ │   Node 1   │ │
│ │            │ │     │ │            │ │
│ │            │>──>──>┼─┼─RBD Mirror │ │
│ └────────────┘ │     │ └────────────┘ │
│ ┌────────────┐ │     │ ┌────────────┐ │
│ │   Node 2   │ │     │ │   Node 2   │ │
│ └────────────┘ │     │ └────────────┘ │
│ ┌────────────┐ │     │ ┌────────────┐ │
│ │   Node 3   │ │     │ │   Node 3   │ │
│ └────────────┘ │     │ └────────────┘ │
└────────────────┘     └────────────────┘

The RBD Mirror daemon is responsible to fetch the journal or snapshots from the source and to apply it in the targer cluster. It runs on the target cluster.

Set up users

Site A

At the beginning, we need to set up the needed users. There will be two users. One on the source cluster (site A) with which the rbd-mirror daemon on the target cluster (site B) authenticates against site A. The second user is the one with which the rbd-mirror authenticates against the target cluster (site B).

Let's first create the user in the source cluster (site A):

root@site-a $ ceph auth get-or-create client.rbd-mirror-peer-a mon 'profile rbd' osd 'profile rbd' -o /etc/pve/priv/site-b.client.rbd-mirror-peer-a.keyring

We need to make this file available over at the other cluster, site B. Either use SCP or copy the contents of the file manually to the following location on site B:

/etc/pve/priv/site-a.client.rbd-mirror-peer-a.keyring

The `site-a` part at the beginning defines how the source cluster will be called by the target cluster! If you use something else, make sure to use the same name throughout the guide!

Site B

We need to create a local user for the rbd-mirror daemon on the target cluster.

root@site-b $  ceph auth get-or-create client.rbd-mirror.$(hostname) mon 'profile rbd-mirror' osd 'profile rbd' -o /etc/pve/priv/ceph.client.rbd-mirror.$(hostname).keyring


Yellowpin.svg Note: We use `$(hostname)` to match the unique ID to what is used for other Ceph services such as monitors.
Yellowpin.svg Note: You can restrict the permissions to a specific pool if you write 'profile rbd pool=mypool'

Copy Ceph config of site-a to site-b

In order for the rbd-mirror to access the Ceph cluster on site A, we need to copy over the `ceph.conf` file from site A to site B and name it correctly.

We place it in the `/etc/pve` directory to make it available on all nodes and symlink it into the `/etc/ceph` directory.

For example:

root@site-a $ scp /etc/pve/ceph.conf root@<rbd_mirror_host_in_site_B>:/etc/pve/site-a.conf

Switch to the other cluster:

root@site-b $ ln -s /etc/pve/site-a.conf /etc/ceph/site-a.conf

Make sure that the name of the config file matches the name used in the keyring that stores the authentication infos.

Enable mirroring on pools

Run the following command on both clusters to enable mirroring:

$ rbd mirror pool enable <pool> <mode>

If you want to use journal based mirroring, you can set <mode> to pool. This will mirror all images that have the `journaling` feature enabled.

For snapshot based mirroring or if you want to manually enable mirroring in journal based mirroring, set <mode> to image.

For example if you want image based mirroring that allows you to choose between snapshot or journal based mirroring for each image:

$ rbd mirror pool enable <pool> image

Configure peers

Next we need to tell the pool on site B which keyring and Ceph config file it should use to connect to the peer (site A).

root@site-b $ rbd mirror pool peer add <pool> client.rbd-mirror-peer-a@site-a

You can check the settings by running

root@site-b $ rbd mirror pool info <pool>
Mode: image
Site Name: 44d5aca2-d47c-4f1f-bfa8-2c52281619ee

Peer Sites: 

UUID: aa08d6ab-a8a4-4cb4-ba92-6a03c738b8ca
Name: site-a
Mirror UUID: 
Direction: rx-tx
Client: client.rbd-mirror-peer-a

The direction should be `rx-tx` and the client should be set correctly to match the keyring file. The name should also be shown correctly (site A). Should you need to change any of these settings, you can do so with:

rbd mirror pool peer set <pool> <uuid> <property> <value>

The source cluster (site A) does not yet know about the peer, as it hasn't connected yet.

Set up the rbd-mirror daemon

We need to install the `rbd-mirror` first:

root@site-b $ apt install rbd-mirror

Since we have our keyring files stored in the `/etc/pve/priv` directory which can only be read by the user `root`, we need to enable and modify the systemd unit file for the rbd-mirror.

root@site-b $ systemctl enable ceph-rbd-mirror.target
root@site-b $ cp /usr/lib/systemd/system/ceph-rbd-mirror@.service /etc/systemd/system/ceph-rbd-mirror@.service
root@site-b $ sed -i -e 's/setuser ceph.*/setuser root --setgroup root/' /etc/systemd/system/ceph-rbd-mirror@.service

With this, we changed it so, that the rbd-mirror is run as root. Next we need to create and start the service. Here we need to make sure to call it as we called the local user for the target cluster that we created earlier, otherwise the daemon won't be able to authenticate against the target cluster (site B).

root@site-b $ systemctl enable --now ceph-rbd-mirror@rbd-mirror.$(hostname).service

If we check the status and logs of the `ceph-rbd-mirror@rbd-mirror.<hostname>.service` service, we should see that it comes up and does not log any authentication errors.

The source cluster (site A) should now have a peer configured and direction will be `tx-only`:

root@site-a $ rbd mirror pool info <pool>

Configure images

Before we can start mirroring the images, we need to define which images should be mirrored.

The `mode` defines if the image is mirrored using snapshots or a journal.

To enable the mirroring of an image, run

rbd mirror image enable <pool>/<image> <mode>

This needs to be done on the source, site A.

Snapshot based mirror

To use snapshots, configure the image with `mode` `snapshot`, for example:

root@site-a $ rbd mirror image enable rbd/vm-100-disk-0 snapshot

This command can take a moment or two.

Now, every time we want the current state to be mirrored to the target cluster (site B) we need a snapshot. We can create them manually with:

rbd mirror image snapshot <pool>/<image>

Snapshot schedule

Since it would be cumbersome to always create mirror snapshots manually, we can define a snapshot schedule so they will be taken automatically.

rbd mirror snapshot schedule add --pool <pool> <interval>

For example, every 5 minutes:

root@site-a $ rbd mirror snapshot schedule add 5m

You can also use other suffixes for days (d) or hours (h) and specify it more explicitly for a single pool with the `--pool <pool>` parameter.

To verify the schedule run:

root@site-a $ rbd mirror snapshot schedule status

It can take a few moments for the newly created schedule to show up!

Journal based mirror

To enable journal based mirroring for an image, run the command with the `journal` mode. For example:

root@site-a $ rbd mirror image enable rbd/vm-100-disk-0 journal

This will automatically enable the `journal` feature for the image. Compare the output of

root@site-a $ rbd info <pool>/<image>

before and after you enable journal based mirroring for the first time.

Journal based mirroring also needs the `exclusive-lock` feature enabled for the images, which should be the default.

Yellowpin.svg Note: KRBD does not support journal based mirroring!

Last steps

Once the rbd-mirror is up and running, you should see a peer configured in the source cluster (site A):

root@site-a $ rbd mirror pool info <pool>
Mode: image
Site Name: ce99d398-91ab-4667-b4f2-307ba0bec358

Peer Sites: 

UUID: 87441fdf-3a61-4840-a869-34b25b47a964
Name: 44d5aca2-d47c-4f1f-bfa8-2c52281619ee
Mirror UUID: 1abf773b-6c95-420c-8ceb-35ee346521db
Direction: tx-only

On the target cluster (site B) you will see the image if you run

root@site-b $ rbd ls --pool <pool>

and if you used snapshot based mirroring, you should see snapshots appearing on the target cluster (site B) very quickly.

rbd snap ls --all --pool <pool> <image>

For example:

root@site-b $ rbd snap ls --all --pool rbd vm-100-disk-0

You can get detailed information about the mirroring by running:

root@site-b $ rbd mirror pool status <pool> --verbose

Failover Recovery

A common scenario is that the source cluster, site A in this guide, will have some kind of failure, and we want to fail over to the other cluster, site B.

You will have to make sure that the VM and container configuration files are synced to the other site yourself. For example, with a recurring rsync job. The container configuration files for each node are located at

/etc/pve/lxc

and for VMs in

/etc/pve/qemu-server

Make sure that no guest has anything configured that is specific to only the source cluster, like an ISO image or a storage used for the disk images.

If you would just try to start the guests on the remaining secondary cluster (site B), a container will not start, and a VM could start (if KRBD is disabled), but will report IO errors very quickly. This is due to the fact that the target images are marked as such (non-primary) and won't allow writing to them from our guests.

Promote images on site B

By promoting an image or a all images in a pool, we can tell Ceph that they are now the primary ones to be used. In a planned failover, we would first demote the images on site A before we promote the images on site B. In a recovery situation with site A down, we need to `--force` the promotion.

To promote a single image, run the following command:

root@site-b $ rbd mirror image promote <pool>/<image> --force
Yellowpin.svg Note: If you want to test the scenario where both clusters are healthy, do not use the --force flag, but demote the image on site-a first:
root@site-a $ rbd mirror image demote {pool}/{image}

To promote all images in a pool, run the following command:

root@site-b $ rbd mirror pool promote <pool> --force

After this, our guests should start fine.

Resync and switch back to site A

Once site A is back up and operational, we want to plan our switch back. For this, we first need to demote the images on site A.

Yellowpin.svg Note: Do not start guests on site A at this point!

For all images in a pool:

root@site-a $ rbd mirror pool demote <pool>

For specific images:

root@site-a $ rbd mirror image demote <pool>/<image>

We also need to set up an RBD mirror daemon on site A that connects to site B (two-way mirror). If not done yet, now is the time to set this up. The steps are the same, but in reverse order.

Once the RBD mirror daemon on site A is up and running, the images need to be flagged for a resync. Until then, the RBD mirror daemon on site A will log problems. Run the following commands for each image (or script it):

$ rbd mirror image resync <pool>/<image>

After a short time, the images should be mirrored from site B to site A now. You can verify it by running

rbd mirror pool status <pool> --verbose

by checking the `last_update` line for each image.

If you want to move a guest back, make sure that the configuration on site A is still valid and hasn't changed during the time on site B.

Then power down the guest and wait for another successful mirroring to site A. Once we are sure that the disk images have been mirrored after we have shutdown the guest, we can demote the image(s) on site B and promote them on site A.

root@site-b $ rbd mirror image demote <pool>/<image>

Or for all primary images in a pool:

root@site-b $ rbd mirror pool demote <pool>

Promote single images on site A:

root@site-a $ rbd mirror image promote <pool>/<image>

Promote all non-primary images in a pool:

root@site-a $ rbd mirror pool promote <pool>

After a short time, we should see that the images on site A are now the primary ones and on site B that the images are mirrored again:

$ rbd mirror pool status <pool> --verbose