Ceph mClock Tuning

From Proxmox VE
Jump to navigation Jump to search

With Ceph Quincy (17), the scheduler for OSD operations changed from wpq to mclock_scheduler. This brings some changes in how to tune an OSD, especially if the production performance is impacted by recovery or rebalance operations.

The old methods of controlling how much performance is spent on backfilling and recovering are ignored by the mclock_scheduler. These include parameters like osd_recovery_sleep or osd_snap_trim_sleep.

Instead, it now uses profiles that define different types of operations, such as client or recovery. For each category, there are three basic parameters. A limit of how many IOPS are allowed, as well as a reservation to not starve it completely. For anything in between the limit and the reservation, a weight is used to balance the available IOPS between the different kind of operations.

For more details, check out the Ceph documentation about the Core Concepts of mclock_scheduler and the mClock config reference.

Check Current Scheduler

To check which scheduler an OSD is using, run the following command:

ceph config show-with-defaults  osd.<ID> | grep op_queue

Adjusting mClock Profile

If the default balanced profile does not prioritize client operations enough, consider switching to the custom profile. This allows you to configure limits, reservations, and weights in more detail for the different types of operations. The following command will change it temporarily for all running OSDs.

ceph tell osd.* injectargs "--osd_mclock_profile=custom"

To make it permanent, either configure it in the ceph.conf file or in the config DB. The section on switching to the old scheduler has examples using a different parameter.

Verify that the OSD is using the custom profile by running

ceph config show-with-defaults  osd.<ID> | grep mclock_profile

Then you can adjust the weight, limit and reservation for each type of operation (client, recovery, best-effort). The available parameters are:

  • osd_mclock_scheduler_client_res
  • osd_mclock_scheduler_client_wgt
  • osd_mclock_scheduler_client_lim
  • osd_mclock_scheduler_background_recovery_res
  • osd_mclock_scheduler_background_recovery_wgt
  • osd_mclock_scheduler_background_recovery_lim
  • osd_mclock_scheduler_background_best_effort_res
  • osd_mclock_scheduler_background_best_effort_wgt
  • osd_mclock_scheduler_background_best_effort_lim

'res' and 'lim' parameters are in IOPS while 'wgt' has no unit. You can try to weigh client OPs higher:

ceph tell osd.* injectargs "--osd_mclock_scheduler_client_wgt=4"

Additionally, it could be useful to lower the limit and / or reservation for recovery OPs. To get the current values for these parameters, run:

ceph config show-with-defaults  osd.<ID> | grep mclock_scheduler

Run it against a few OSDs to get an idea how these values are. They will be a bit different for each OSD by default.

To set them temporarily for all OSDs we use the ceph tell osd.* injectargs command. For example, to reduce the reservation for recovery OPs:

ceph tell osd.* injectargs "--osd_mclock_scheduler_background_recovery_res={value}"

If you want to apply a setting to only one specific OSD, use osd.<ID> instead of osd.*.

It is still possible to adjust the number of backfills, see Ceph docs.

Yellowpin.svg Note: The res and lim values are in IOPS up until Ceph version 17.2.6 (archived docs). From version 17.2.7 and 18.2.0 onward they are percent (0.0 to 1.0)! (docs)


Switch to the custom profile, increase client weight and pin background recovery IOPS. Which values for the background recovery limit and reservation work is something you need to find out. Maybe start a bit higher and if it is still impacting production, lower it.

ceph tell osd.* injectargs "--osd_mclock_profile=custom"
ceph tell osd.* injectargs "--osd_mclock_scheduler_client_wgt=4"
ceph tell osd.* injectargs "--osd_mclock_scheduler_background_recovery_lim=0.1"
ceph tell osd.* injectargs "--osd_mclock_scheduler_background_recovery_res=0.1"

Switch to Old Scheduler

For now, it is possible to use the old wpq scheduler. At some point it the future, it will most likely be deprecated.

To switch back, you need to set it either in the ceph.conf file or in the config DB.


Add the following section to the ceph.conf file:

	osd_op_queue = wpq

Config DB

ceph config set osd osd_op_queue wpq

Afterwards, a restart of the OSDs is needed.