Ceph Luminous to Nautilus: Difference between revisions
mNo edit summary |
|||
Line 292: | Line 292: | ||
rbd perf image iostat | rbd perf image iostat | ||
[[Category: HOWTO]][[Category: Installation]] | [[Category: HOWTO]][[Category: Installation]][[Category: Ceph Upgrade]] |
Latest revision as of 14:22, 31 May 2023
Introduction
This article explains how to upgrade from Ceph Luminous to Nautilus (14.2.0 or higher) on Proxmox VE 6.x.
For more information see Release Notes
Assumption
We assume that all nodes are on the latest Proxmox VE 6.x version and Ceph is on version Luminous (12.2.12-pve1).
The cluster must be healthy and working.
Note
- After upgrading to Proxmox VE 6.x and before upgrading to Ceph Nautilus,
- Do not use the Proxmox VE 6.x tools for Ceph (pveceph), as they are not intended to work with Ceph Luminous.
- If it's absolutely necessary to change the Ceph cluster before upgrading to Nautilus, use the Ceph native tools instead.
- During the upgrade from Luminous to Nautilus it will not be possible to create a new OSD using a Luminous ceph-osd daemon after the monitors have been upgraded to Nautilus. Avoid adding or replacing any OSDs while the upgrade is in progress.
- Avoid creating any RADOS pools while the upgrade is in progress.
- You can monitor the progress of your upgrade anytime with the ceph versions command. This will tell you which Ceph version(s) are running for each type of daemon.
Cluster Preparation
If your cluster was originally installed with a version prior to Luminous, ensure that it has completed at least one full scrub of all PGs while running Luminous. Failure to do so will cause your monitor daemons to refuse to join the quorum on start, leaving them non-functional.
If you are unsure whether or not your Luminous cluster has completed a full scrub of all PGs, check the state of your cluster by running:
ceph osd dump | grep ^flags
In order to be able to proceed to Nautilus, your OSD map must include the flags
- recovery_deletes flag
- purged_snapdirs flag
If your OSD map does not contain both these flags, you can simply wait for approximately 24-48 hours. In a standard cluster configuration this should be the ample time for all your placement groups to be scrubbed at least once. Then repeat the above process to recheck.
In case that you have just completed an upgrade to Luminous and want to proceed to Nautilus in short order, you can force a scrub on all placement groups with the following command, like:
ceph osd scrub all
Consider that this forced scrub may possibly have a negative impact on the performance of your Ceph clients. And verify afterwards that the above mentioned flags are set after the scrub has finished.
Adapt /etc/pve/ceph.conf
Since Nautilus, all daemons use the 'keyring' option for its keyring, so you have to adapt this. The easiest way is to move the global 'keyring' option into the 'client' section, and remove it everywhere else. Create the 'client' section if you don't have one.
For example:
From:
[global] ... keyring = /etc/pve/priv/$cluster.$name.keyring [osd] keyring = /var/lib/ceph/osd/ceph-$id/keyring
To:
[global] ... [client] keyring = /etc/pve/priv/$cluster.$name.keyring
Preparation on each Ceph cluster node
Change the current Ceph repositories from Luminous to Nautilus.
sed -i 's/luminous/nautilus/' /etc/apt/sources.list.d/ceph.list
Your /etc/apt/sources.list.d/ceph.list should look like this
deb http://download.proxmox.com/debian/ceph-nautilus buster main
Set the 'noout' flag
Set the noout flag for the duration of the upgrade (optional, but recommended):
ceph osd set noout
Or via the GUI in the OSD tab.
Upgrade on each Ceph cluster node
Upgrade all your nodes with the following commands. It will upgrade the Ceph on your node to Nautilus.
apt update apt dist-upgrade
After the update you still run the old Luminous binaries.
Restart the monitor daemon
After upgrading all cluster nodes, you have to restart the monitor on each node where a monitor runs.
systemctl restart ceph-mon.target
Once all monitors are up, verify that the monitor upgrade is complete. Look for the nautilus string in the mon map. The command
ceph mon dump | grep min_mon_release
should report
min_mon_release 14 (nautilus)
If it does not, this implies that one or more monitors haven’t been upgraded and restarted, and/or that the quorum doesn't include all monitors.
Restart the manager daemons on all nodes
Then restart all managers on all nodes
systemctl restart ceph-mgr.target
Verify that the ceph-mgr daemons are running by checking ceph -s
ceph -s
... services: mon: 3 daemons, quorum foo,bar,baz mgr: foo(active), standbys: bar, baz ...
Restart the OSD daemon on all nodes
Important Steps before restarting OSD
If you have a cluster with IPv6 only, you need to set the following command in the global section of the ceph config
ms_bind_ipv4 = false ms_bind_ipv6 = true
Otherwise, each OSD trys to bind to an IPv4 in addition to the IPv6 and fails if it cannot find an IPv4 address in the given public/cluster networks.
Next, restart all OSDs on all nodes
systemctl restart ceph-osd.target
On each host, tell ceph-volume to adapt the OSDs created with ceph-disk using the following two commands:
ceph-volume simple scan ceph-volume simple activate --all
If you get a failure, your OSDs will not be recognized after a reboot.
- One of such failures can be Required devices (block and data) not present for bluestore This may happen if you have filestore OSDs, due to a bug in Ceph tooling (see http://wordpress.hawkless.id.au/index.php/2019/05/10/ceph-nautilus-required-devices-block-and-data-not-present-for-bluestore/).
- To fix it, edit the /etc/ceph/osd/{OSDID}-GUID.json files created for each filestore OSD and add a line (check syntax is correct JSON, each attrib has to end in , except the last one).
“type”: “filestore”
- Run again: ceph-volume simple activate --all
To verify that the OSDs start up automatically, it's recommended that each OSD host is rebooted following the step above.
Note that ceph-volume does not have the same hot-plug capability like ceph-disk had, where a newly attached disk is automatically detected via udev events.
You will need to scan the main data partition for each ceph-disk OSD explicitly, if
- the OSD isn’t currently running when the above scan command is run,
- a ceph-disk-based OSD is moved to a new host,
- the host OSD is reinstalled,
- or the /etc/ceph/osd directory is lost.
For example:
ceph-volume simple scan /dev/sdb1
The output will include the appopriate ceph-volume simple activate command to enable the OSD.
Upgrade all CephFS MDS daemons
For each CephFS file system,
- Reduce the number of ranks to 1 (if you plan to restore it later, first take notes of the original number of MDS daemons).:
ceph status ceph fs set <fs_name> max_mds 1
- Wait for the cluster to deactivate any non-zero ranks by periodically checking the status:
ceph status
- Take all standby MDS daemons offline on the appropriate hosts with:
systemctl stop ceph-mds.target
- Confirm that only one MDS is online and is on rank 0 for your FS:
ceph status
- Upgrade the last remaining MDS daemon by restarting the daemon:
systemctl restart ceph-mds.target
- Restart all standby MDS daemons that were taken offline:
systemctl start ceph-mds.target
- Restore the original value of max_mds for the volume:
ceph fs set <fs_name> max_mds <original_max_mds>
Disallow pre-Nautilus OSDs and enable all new Nautilus-only functionality
ceph osd require-osd-release nautilus
Unset 'noout' and check cluster status
Unset the 'noout' flag. You can do this in the GUI or with this command.
ceph osd unset noout
Now check if your Ceph cluster is healthy.
ceph -s
Upgrade Tunables
If your CRUSH tunables are older than Hammer, Ceph will now issue a health warning. If you see a health alert to that effect, you can revert this change with:
ceph config set mon mon_crush_min_required_version firefly
If Ceph does not complain, however, then we recommend you also switch any existing CRUSH buckets to straw2, which was added back in the Hammer release. If you have any ‘straw’ buckets, this will result in a modest amount of data movement, but generally nothing too severe.:
ceph osd getcrushmap -o backup-crushmap ceph osd crush set-all-straw-buckets-to-straw2
If there are problems, you can easily revert with:
ceph osd setcrushmap -i backup-crushmap
Moving to ‘straw2’ buckets will unlock a few recent features, like the crush-compat balancer mode added back in Luminous.
Enable msgrv2 protocol and update Ceph configuration
To enable the new v2 network protocol, issue the following command:
ceph mon enable-msgr2
This will instruct all monitors that bind to the old default port 6789 for the legacy v1 protocol to also bind to the new 3300 v2 protocol port. To see if all monitors have been updated run
ceph mon dump
and verify that each monitor has both a v2: and v1: address listed.
Updating /etc/pve/ceph.conf
For each host that has been upgraded, you should update your /etc/pve/ceph.conf file so that it either specifies no monitor port (if you are running the monitors on the default ports) or references both the v2 and v1 addresses and ports explicitly. Things will still work if only the v1 IP and port are listed, but each CLI instantiation or daemon will need to reconnect after learning the monitors also speak the v2 protocol, slowing things down a bit and preventing a full transition to the v2 protocol.
It is recommended to add all monitor ips (without port) to 'mon_host' in the global section like this:
[global] ... mon_host = 10.0.0.100 10.0.0.101 10.0.0.102 ...
For details see: Messenger V2
Legacy BlueStore stats reporting
After the upgrade, ceph -s may show the below message.
HEALTH_WARN Legacy BlueStore stats reporting detected on 6 OSD(s)
In Ceph Nautilus 14.2.0 the pool utilization stats reported (ceph df) changed. This change needs an on-disk format change on the Bluestore OSDs.
To get the new stats format, the OSDs need to be manually "repaired". This will change the on-disk format. Alternatively, the OSDs can be destroyed and recreated, but this will create more recovery traffic.
systemctl stop ceph-osd@<N>.service ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-<N>/ systemctl start ceph-osd@<N>.service
Once all OSDs are "repaired" the health warning will disappear.
Resolving the `insecure global_id reclaim` Warning
With Ceph Nautilus version 14.2.20 we released an update to fix a security issue (CVE-2021-20288) where Ceph was not ensuring that reconnecting/renewing clients were presenting an existing ticket when reclaiming their global_id value. An attacker that was able to authenticate could claim a global_id in use by a different client and potentially disrupt other cluster services.
Affected Versions:
- for server: all previous versions
- for clients:
- kernel: none
- user-space: all since (and including) Luminous 12.2.0
Attacker Requirements/Impact: Don't panic, the risk on a default Proxmox VE managed ceph setup is rather low, we still recommend upgrading in a timely manner. Any attacker would require all of the following points:
- have a valid authentication key for the cluster
- know or guess the global_id of another client
- run a modified version of the Ceph client code to reclaim another client’s global_id
- construct appropriate client messages or requests to disrupt service or exploit Ceph daemon assumptions about global_id uniqueness
Addressing the Health Warnings
You will then still see two HEALTH warnings:
client is using insecure global_id reclaim
mons are allowing insecure global_id reclaim
To address those you need to first either ensure all VMs using ceph on a storage without KRBD run the newer client library. For that, either fully restart the VMs (reboot over API or stop ad start), or migrate them to another node in the cluster that has that ceph update already installed. You also need to restart the pvestatd and pvedaemon Proxmox VE daemons accessing the ceph cluster periodically to gather status data or to execute API calls. Either use the web-interface (Node -> System) or the command-line:
systemctl try-reload-or-restart pvestatd.service pvedaemon.service
Next you can resolve the monitor warning by enforcing the stricter behavior that is possible now. Execute the following command on one of the nodes in the Proxmox VE Ceph cluster:
ceph config set mon auth_allow_insecure_global_id_reclaim false
Note: As said, that will cut-off any old client after the ticket validity times out (72h), so only execute that once the client warning was resolved and disappeared.
See the following forum post for details and discussion: https://forum.proxmox.com/threads/ceph-nautilus-and-octopus-security-update-for-insecure-global_id-reclaim-cve-2021-20288.88038/#post-385756
Command-line Interface
see https://ceph.com/rbd/new-in-nautilus-rbd-performance-monitoring/
enable
ceph mgr module enable rbd_support
then these are avail
rbd perf image iotop rbd perf image iostat