[PVE-User] Poor CEPH performance? or normal?

Sat Jul 28 13:00:15 CEST 2018

Hi Adam,

Thanks for your great round up there - Your points are excellent.

What I have ended up doing a few days ago (apologies have been too busy to
respond..) was set rbd cache = true under each client in the ceph.conf -
This got me from 15MB/s up to about 70MB/s. I then set the disk holding the
zfs dataset to writeback cache in proxmox (as you note below) and that has
bumped it up to about 130MB/s -- Which I am happy with for this setup.

Regards,
Mark

On 27 July 2018 at 14:46, Adam Thompson <athompso at athompso.net> wrote:

> On 2018-07-27 07:05, ronny+pve-user at aasen.cx wrote:
>
>> rbd striping is a per image setting. you may need to make the rbd
>> image and migrate data.
>>
>> On 07/26/18 12:25, Mark Adams wrote:
>>
>>> Thanks for your suggestions. Do you know if it is possible to change an
>>> existing rbd pool to striping? or does this have to be done on first
>>> setup?
>>>
>>
> Please be aware that striping will not result in any increased
> performance, if you are using "safe" I/O modes, i.e. your VM waits for a
> successful flush-to-disk after every sector.  In that scenario, CEPH will
> never give you write performance equal to a local disk because you're
> limited to the bandwidth of a single remote disk [subsystem] *plus* the
> network round-trip latency, which even if measured in microseconds, still
> adds up.
>
> Based on my experience with this and other distributed storage systems, I
> believe you will likely find that you get large write-performance gains by:
>
> 1. use the largest possible block size during writes.  512B sectors are
> the worst-case scenario for any remote storage.  Try to write in chunks of
> *at least* 1 MByte, and it's not unreasonable nowadays to write in chunks
> of 64MB or larger.  The rationale here is that you're spending more time
> sending data, and less time waiting for ACKs.  The more you can tilt that
> in favor of data, the better off you are.  (There are downsides to huge
> sector/block/chunk sizes, though - this isn't a "free lunch" scenario.  See
> #5.)
>
> 2. relax your write-consistency requirements.  If you can tolerate the
> small risk with "Write Back" you should see better performance, especially
> during burst writes.  During large sequential writes, there are not many
> ways to violate the laws of physics, and CEPH automatically amplifies your
> writes by (in your case) a factor of 2x due to replication.
>
> 3. switch to storage devices with the best possible local write speed, for
> OSDs.  OSDs are limited by the performance of the underlying device or
> virtual device.  (e.g. it's totally possible to run OSDs on a hardware
> RAID6 controller)
>
> 4. Avoid CoW-on-CoW.  Write amplification means you'll lose around 50% of
> your IOPS and/or I/O bandwidth for each level of CoW nesting, depending on
> workload.  So don't put CEPH OSDs on, ssy, BTRFS or ZFS filesystems.  A
> worst-case scenario would be something like running a VM using ZFS on top
> of CEPH, where the OSDs are located on BTRFS filsystems, which are in turn
> virtual devices hosted on ZFS filesystems.  Welcome to 1980's storage
> performance, in that case!  (I did it without realizing once... seriously,
> 5 MBps sequential writes was a good day!)  FWIW, CoW filesystems are
> generally awesome - just not when stacked.  A sufficiently fast external
> NAS running ZFS with VMs stored over NFS can provide decent performance,
> *if* tuned correctly.  iX Systems, for example, spends a lot of time &
> effort making this work well, including some lovely HA NAS appliances.
>
> 5. Remember the triangle.  You can optimize a distributed storage system
> for any TWO of: a) cost, b) resiliency/reliability/HA, or c) speed.  (This
> is a specific case of the traditional good/fast/cheap:pick-any-2 adage.)
>
>
> I'm not sure I'm saying anything new here, I may have just summarized the
> discussion, but the points remain valid.
>
> Good luck with your performance problems.
> -Adam
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>