[PVE-User] Poor CEPH performance? or normal?

Fri Jul 27 14:05:53 CEST 2018

rbd striping is a per image setting. you may need to make the rbd image 
and migrate data.

kind regards

Ronny Aasen

On 07/26/18 12:25, Mark Adams wrote:
> Hi Ronny,
>
> Thanks for your suggestions. Do you know if it is possible to change an
> existing rbd pool to striping? or does this have to be done on first setup?
>
> Regards,
> Mark
>
> On Wed, 25 Jul 2018, 19:20 Ronny Aasen, <ronny+pve-user at aasen.cx> wrote:
>
>> On 25. juli 2018 02:19, Mark Adams wrote:
>>> Hi All,
>>>
>>> I have a proxmox 5.1 + ceph cluster of 3 nodes, each with 12 x WD 10TB
>> GOLD
>>> drives. Network is 10Gbps on X550-T2, separate network for the ceph
>> cluster.
>>> I have 1 VM currently running on this cluster, which is debian stretch
>> with
>>> a zpool on it. I'm zfs sending in to it, but only getting around ~15MiB/s
>>> write speed. does this sound right? it seems very slow to me.
>>>
>>> Not only that, but when this zfs send is running - I can not do any
>>> parallel sends to any other zfs datasets inside of the same VM. They just
>>> seem to hang, then eventually say "dataset is busy".
>>>
>>> Any pointers or insights greatly appreciated!
>> Greetings
>>
>> alwin gave you some good advice about filesystems and vm's, i wanted to
>> say a little about ceph.
>>
>> with 3 nodes, and the default and reccomended size=3 pools, you can not
>> tolerate any node failures. IOW, if you loose a node, or need to do
>> lengthy maintainance on it, you are running degraded. I allways have a
>> 4th "failure domain" node. so my cluster can selfheal (one of cephs
>> killer features)  from a node failure. your cluster should be
>>
>> 3+[how-many-node-failures-i-want-to-be-able-to-survive-and-still-operate-sanely]
>>
>> spinning osd's with bluestore benefit greatly from ssd DB/WAL's if your
>> osd's have ondisk DB/WAL you can gain a lot of performance by having the
>> DB/WAL on a SSD or better.
>>
>> ceph gains performance with scale(number of osd nodes) . so while ceph's
>> aggeregate performance is awesome, an individual single thread will not
>> be amazing. A given set of data will exist on all 3 nodes, and you will
>> hit 100% of nodes with any write.  so by using ceph with 3 nodes you
>> give ceph the worst case for performance. eg
>> with 4 nodes a write would hit 75%, with 6 nodes it would hit 50% of the
>> cluster. you see where this is going...
>>
>> But a single write will only hit one disk in 3 nodes, and will not have
>> a better performance then the disk it hits. you can cheat more
>> performance with rbd caching. and it is important for performance to get
>> a higher queue depth. afaik zfs uses a queue depth of 1, for ceph the
>> worst possible. you may have some success by buffering on one or both
>> ends of the transfer [1]
>>
>> if the vm have a RBD disk, you may (or may not) benefit from rbd fancy
>> striping[2],  since operations can hit more osd's in parallel.
>>
>>
>> good luck
>> Ronny Aasen
>>
>>
>> [1]
>>
>> https://everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/
>> [2] http://docs.ceph.com/docs/master/architecture/#data-striping
>> _______________________________________________
>> pve-user mailing list
>> pve-user at pve.proxmox.com
>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user