[PVE-User] CEPH: How to remove an OSD without experiencing inactive placement groups

Fri Jan 2 15:52:24 CET 2015

I hadn't heard of Sheepdog ... I'll have to keep a lookout for that too.

For years I've been shouting about ZFS but the mystical 'block pointer rewrite' has yet to surface, which impacts flexibility and means there's effectively no 'defragment'. I'm willing to try anything really, but there's still nothing that ticks all the boxes which makes me wonder if I'm being unreasonable to think there'd be a filesystem that could do it all.

Chris

-----Original Message-----
From: Adam Thompson [mailto:athompso at athompso.net] 
Sent: 19 December 2014 18:54
To: Chris Murray; Eneko Lacunza; pve-user at pve.proxmox.com
Subject: Re: [PVE-User] CEPH: How to remove an OSD without experiencing inactive placement groups

On 14-12-19 12:46 PM, Chris Murray wrote:
> Hi Eneko,
>
> Thank you. From what I'm reading, those options affect the amount of concurrent recovery that is happening. Forgive my ignorance, but how does it address the 78 placement groups which were inactive from the beginning of the process and past the end of the process?
>
> My google for the following doesn't turn up much:
> "stuck inactive" "osd max backfills" "osd recovery max active"
>
> I don't understand why these would become 'stuck inactive' until I brought the OSD up again. If it were a case of the lack of IO in the pool getting in the way of recovery (which I can understand with only nine disks), why were there 78 pgs inactive from the beginning, then (presumably the same) 78 at the end? I might expect in that situation that the VMs would be slow, and at the end of the process or part-way through when the IO has subsided, CEPH would decide that they become one of the active states again. I'm not familiar with the inner workings of CEPH and they are probably complex enough to just go over my head anyway; just trying to understand roughly what it's chosen to do there and why. I can see why those tunables might improve the responsiveness during the recovery process though.

AFAIK you're exactly right about those settings.
What I found was the only way to work around it was to adjust the "size" 
and "min_size" pool options to "1" before removing the OSD, then set them back to whatever you wanted after OSD removal.
I think what's happening is that CEPH is noticing that there are a bunch of pages that, while replicated elsewhere, are still valid, that are now offline... not 100% sure.

I wish sheepdog would hurry up and mature, it's much less complicated for small-scale situations (1<n<32 hosts) like you and I are running.  
After ignoring multiple warnings from Proxmox staff, I configured sheepdog, saw fantastic performance (esp. compared to CEPH) and ... 
promptly got burned when the next update changed the metadata format with *no* in-place upgrade option.  (But until then it was awesome.)

CEPH is a solid option, and I'm glad PVE includes it, but it's very big and complex and cumbersome for low disk-count, low host-count setups.  
(E.g. I have 4 hosts, with 2 OSDs each.  CEPH isn't really designed to scale down that small, at least not very well.)

--
-Adam Thompson
  athompso at athompso.net

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2015.0.5577 / Virus Database: 4253/8757 - Release Date: 12/18/14