Ceph Luminous to Nautilus

From Proxmox VE
Revision as of 14:22, 31 May 2023 by Thomas Lamprecht (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

This article explains how to upgrade from Ceph Luminous to Nautilus (14.2.0 or higher) on Proxmox VE 6.x.

For more information see Release Notes

Assumption

We assume that all nodes are on the latest Proxmox VE 6.x version and Ceph is on version Luminous (12.2.12-pve1).

The cluster must be healthy and working.

Note

  • After upgrading to Proxmox VE 6.x and before upgrading to Ceph Nautilus,
  • Do not use the Proxmox VE 6.x tools for Ceph (pveceph), as they are not intended to work with Ceph Luminous.
  • If it's absolutely necessary to change the Ceph cluster before upgrading to Nautilus, use the Ceph native tools instead.
  • During the upgrade from Luminous to Nautilus it will not be possible to create a new OSD using a Luminous ceph-osd daemon after the monitors have been upgraded to Nautilus. Avoid adding or replacing any OSDs while the upgrade is in progress.
  • Avoid creating any RADOS pools while the upgrade is in progress.
  • You can monitor the progress of your upgrade anytime with the ceph versions command. This will tell you which Ceph version(s) are running for each type of daemon.

Cluster Preparation

If your cluster was originally installed with a version prior to Luminous, ensure that it has completed at least one full scrub of all PGs while running Luminous. Failure to do so will cause your monitor daemons to refuse to join the quorum on start, leaving them non-functional.

If you are unsure whether or not your Luminous cluster has completed a full scrub of all PGs, check the state of your cluster by running:

ceph osd dump | grep ^flags

In order to be able to proceed to Nautilus, your OSD map must include the flags

  • recovery_deletes flag
  • purged_snapdirs flag

If your OSD map does not contain both these flags, you can simply wait for approximately 24-48 hours. In a standard cluster configuration this should be the ample time for all your placement groups to be scrubbed at least once. Then repeat the above process to recheck.

In case that you have just completed an upgrade to Luminous and want to proceed to Nautilus in short order, you can force a scrub on all placement groups with the following command, like:

ceph osd scrub all

Consider that this forced scrub may possibly have a negative impact on the performance of your Ceph clients. And verify afterwards that the above mentioned flags are set after the scrub has finished.

Adapt /etc/pve/ceph.conf

Since Nautilus, all daemons use the 'keyring' option for its keyring, so you have to adapt this. The easiest way is to move the global 'keyring' option into the 'client' section, and remove it everywhere else. Create the 'client' section if you don't have one.

For example:

From:

[global]
    ...
    keyring = /etc/pve/priv/$cluster.$name.keyring
[osd]
    keyring = /var/lib/ceph/osd/ceph-$id/keyring

To:

[global]
    ...
[client]
    keyring = /etc/pve/priv/$cluster.$name.keyring

Preparation on each Ceph cluster node

Change the current Ceph repositories from Luminous to Nautilus.

sed -i 's/luminous/nautilus/' /etc/apt/sources.list.d/ceph.list

Your /etc/apt/sources.list.d/ceph.list should look like this

deb http://download.proxmox.com/debian/ceph-nautilus buster main

Set the 'noout' flag

Set the noout flag for the duration of the upgrade (optional, but recommended):

ceph osd set noout

Or via the GUI in the OSD tab.

Upgrade on each Ceph cluster node

Upgrade all your nodes with the following commands. It will upgrade the Ceph on your node to Nautilus.

apt update
apt dist-upgrade

After the update you still run the old Luminous binaries.

Restart the monitor daemon

After upgrading all cluster nodes, you have to restart the monitor on each node where a monitor runs.

systemctl restart ceph-mon.target

Once all monitors are up, verify that the monitor upgrade is complete. Look for the nautilus string in the mon map. The command

ceph mon dump | grep min_mon_release

should report

min_mon_release 14 (nautilus)

If it does not, this implies that one or more monitors haven’t been upgraded and restarted, and/or that the quorum doesn't include all monitors.

Restart the manager daemons on all nodes

Then restart all managers on all nodes

systemctl restart ceph-mgr.target

Verify that the ceph-mgr daemons are running by checking ceph -s

ceph -s
...
 services:
  mon: 3 daemons, quorum foo,bar,baz
  mgr: foo(active), standbys: bar, baz
...

Restart the OSD daemon on all nodes

Important Steps before restarting OSD

If you have a cluster with IPv6 only, you need to set the following command in the global section of the ceph config

ms_bind_ipv4 = false
ms_bind_ipv6 = true

Otherwise, each OSD trys to bind to an IPv4 in addition to the IPv6 and fails if it cannot find an IPv4 address in the given public/cluster networks.

Next, restart all OSDs on all nodes

systemctl restart ceph-osd.target

On each host, tell ceph-volume to adapt the OSDs created with ceph-disk using the following two commands:

ceph-volume simple scan
ceph-volume simple activate --all

If you get a failure, your OSDs will not be recognized after a reboot.

“type”: “filestore”
  • Run again: ceph-volume simple activate --all

To verify that the OSDs start up automatically, it's recommended that each OSD host is rebooted following the step above.

Note that ceph-volume does not have the same hot-plug capability like ceph-disk had, where a newly attached disk is automatically detected via udev events.

You will need to scan the main data partition for each ceph-disk OSD explicitly, if

  • the OSD isn’t currently running when the above scan command is run,
  • a ceph-disk-based OSD is moved to a new host,
  • the host OSD is reinstalled,
  • or the /etc/ceph/osd directory is lost.

For example:

ceph-volume simple scan /dev/sdb1

The output will include the appopriate ceph-volume simple activate command to enable the OSD.

Upgrade all CephFS MDS daemons

For each CephFS file system,

  1. Reduce the number of ranks to 1 (if you plan to restore it later, first take notes of the original number of MDS daemons).:
    ceph status
    ceph fs set <fs_name> max_mds 1
  2. Wait for the cluster to deactivate any non-zero ranks by periodically checking the status:
    ceph status
  3. Take all standby MDS daemons offline on the appropriate hosts with:
    systemctl stop ceph-mds.target
  4. Confirm that only one MDS is online and is on rank 0 for your FS:
    ceph status
  5. Upgrade the last remaining MDS daemon by restarting the daemon:
    systemctl restart ceph-mds.target
  6. Restart all standby MDS daemons that were taken offline:
    systemctl start ceph-mds.target
  7. Restore the original value of max_mds for the volume:
    ceph fs set <fs_name> max_mds <original_max_mds>

Disallow pre-Nautilus OSDs and enable all new Nautilus-only functionality

ceph osd require-osd-release nautilus

Unset 'noout' and check cluster status

Unset the 'noout' flag. You can do this in the GUI or with this command.

ceph osd unset noout

Now check if your Ceph cluster is healthy.

ceph -s

Upgrade Tunables

If your CRUSH tunables are older than Hammer, Ceph will now issue a health warning. If you see a health alert to that effect, you can revert this change with:

ceph config set mon mon_crush_min_required_version firefly

If Ceph does not complain, however, then we recommend you also switch any existing CRUSH buckets to straw2, which was added back in the Hammer release. If you have any ‘straw’ buckets, this will result in a modest amount of data movement, but generally nothing too severe.:

ceph osd getcrushmap -o backup-crushmap
ceph osd crush set-all-straw-buckets-to-straw2

If there are problems, you can easily revert with:

ceph osd setcrushmap -i backup-crushmap

Moving to ‘straw2’ buckets will unlock a few recent features, like the crush-compat balancer mode added back in Luminous.

Enable msgrv2 protocol and update Ceph configuration

To enable the new v2 network protocol, issue the following command:

ceph mon enable-msgr2

This will instruct all monitors that bind to the old default port 6789 for the legacy v1 protocol to also bind to the new 3300 v2 protocol port. To see if all monitors have been updated run

ceph mon dump

and verify that each monitor has both a v2: and v1: address listed.

Updating /etc/pve/ceph.conf

For each host that has been upgraded, you should update your /etc/pve/ceph.conf file so that it either specifies no monitor port (if you are running the monitors on the default ports) or references both the v2 and v1 addresses and ports explicitly. Things will still work if only the v1 IP and port are listed, but each CLI instantiation or daemon will need to reconnect after learning the monitors also speak the v2 protocol, slowing things down a bit and preventing a full transition to the v2 protocol.

It is recommended to add all monitor ips (without port) to 'mon_host' in the global section like this:

[global]
    ...
    mon_host = 10.0.0.100 10.0.0.101 10.0.0.102
    ...

For details see: Messenger V2

Legacy BlueStore stats reporting

After the upgrade, ceph -s may show the below message.

HEALTH_WARN Legacy BlueStore stats reporting detected on 6 OSD(s)

In Ceph Nautilus 14.2.0 the pool utilization stats reported (ceph df) changed. This change needs an on-disk format change on the Bluestore OSDs.

To get the new stats format, the OSDs need to be manually "repaired". This will change the on-disk format. Alternatively, the OSDs can be destroyed and recreated, but this will create more recovery traffic.

systemctl stop ceph-osd@<N>.service 
ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-<N>/
systemctl start ceph-osd@<N>.service 

Once all OSDs are "repaired" the health warning will disappear.

Resolving the `insecure global_id reclaim` Warning

With Ceph Nautilus version 14.2.20 we released an update to fix a security issue (CVE-2021-20288) where Ceph was not ensuring that reconnecting/renewing clients were presenting an existing ticket when reclaiming their global_id value. An attacker that was able to authenticate could claim a global_id in use by a different client and potentially disrupt other cluster services.

Affected Versions:

  • for server: all previous versions
  • for clients:
    • kernel: none
    • user-space: all since (and including) Luminous 12.2.0

Attacker Requirements/Impact: Don't panic, the risk on a default Proxmox VE managed ceph setup is rather low, we still recommend upgrading in a timely manner. Any attacker would require all of the following points:

  • have a valid authentication key for the cluster
  • know or guess the global_id of another client
  • run a modified version of the Ceph client code to reclaim another client’s global_id
  • construct appropriate client messages or requests to disrupt service or exploit Ceph daemon assumptions about global_id uniqueness

Addressing the Health Warnings

You will then still see two HEALTH warnings:

  1. client is using insecure global_id reclaim
  2. mons are allowing insecure global_id reclaim

To address those you need to first either ensure all VMs using ceph on a storage without KRBD run the newer client library. For that, either fully restart the VMs (reboot over API or stop ad start), or migrate them to another node in the cluster that has that ceph update already installed. You also need to restart the pvestatd and pvedaemon Proxmox VE daemons accessing the ceph cluster periodically to gather status data or to execute API calls. Either use the web-interface (Node -> System) or the command-line:

 systemctl try-reload-or-restart pvestatd.service pvedaemon.service

Next you can resolve the monitor warning by enforcing the stricter behavior that is possible now. Execute the following command on one of the nodes in the Proxmox VE Ceph cluster:

 ceph config set mon auth_allow_insecure_global_id_reclaim false

Note: As said, that will cut-off any old client after the ticket validity times out (72h), so only execute that once the client warning was resolved and disappeared.

See the following forum post for details and discussion: https://forum.proxmox.com/threads/ceph-nautilus-and-octopus-security-update-for-insecure-global_id-reclaim-cve-2021-20288.88038/#post-385756

Command-line Interface

see https://ceph.com/rbd/new-in-nautilus-rbd-performance-monitoring/

enable

ceph mgr module enable rbd_support

then these are avail

rbd perf image iotop
 
rbd perf image iostat