[PVE-User] Poor CEPH performance? or normal?

Sun Jul 29 16:41:24 CEST 2018

On July 28, 2018 6:00:15 AM CDT, Mark Adams <mark at openvs.co.uk> wrote:
>Hi Adam,
>
>Thanks for your great round up there - Your points are excellent.
>
>What I have ended up doing a few days ago (apologies have been too busy
>to
>respond..) was set rbd cache = true under each client in the ceph.conf
>-
>This got me from 15MB/s up to about 70MB/s. I then set the disk holding
>the
>zfs dataset to writeback cache in proxmox (as you note below) and that
>has
>bumped it up to about 130MB/s -- Which I am happy with for this setup.
>
>Regards,
>Mark
>
>On 27 July 2018 at 14:46, Adam Thompson <athompso at athompso.net> wrote:
>
>> On 2018-07-27 07:05, ronny+pve-user at aasen.cx wrote:
>>
>>> rbd striping is a per image setting. you may need to make the rbd
>>> image and migrate data.
>>>
>>> On 07/26/18 12:25, Mark Adams wrote:
>>>
>>>> Thanks for your suggestions. Do you know if it is possible to
>change an
>>>> existing rbd pool to striping? or does this have to be done on
>first
>>>> setup?
>>>>
>>>
>> Please be aware that striping will not result in any increased
>> performance, if you are using "safe" I/O modes, i.e. your VM waits
>for a
>> successful flush-to-disk after every sector.  In that scenario, CEPH
>will
>> never give you write performance equal to a local disk because you're
>> limited to the bandwidth of a single remote disk [subsystem] *plus*
>the
>> network round-trip latency, which even if measured in microseconds,
>still
>> adds up.
>>
>> Based on my experience with this and other distributed storage
>systems, I
>> believe you will likely find that you get large write-performance
>gains by:
>>
>> 1. use the largest possible block size during writes.  512B sectors
>are
>> the worst-case scenario for any remote storage.  Try to write in
>chunks of
>> *at least* 1 MByte, and it's not unreasonable nowadays to write in
>chunks
>> of 64MB or larger.  The rationale here is that you're spending more
>time
>> sending data, and less time waiting for ACKs.  The more you can tilt
>that
>> in favor of data, the better off you are.  (There are downsides to
>huge
>> sector/block/chunk sizes, though - this isn't a "free lunch"
>scenario.  See
>> #5.)
>>
>> 2. relax your write-consistency requirements.  If you can tolerate
>the
>> small risk with "Write Back" you should see better performance,
>especially
>> during burst writes.  During large sequential writes, there are not
>many
>> ways to violate the laws of physics, and CEPH automatically amplifies
>your
>> writes by (in your case) a factor of 2x due to replication.
>>
>> 3. switch to storage devices with the best possible local write
>speed, for
>> OSDs.  OSDs are limited by the performance of the underlying device
>or
>> virtual device.  (e.g. it's totally possible to run OSDs on a
>hardware
>> RAID6 controller)
>>
>> 4. Avoid CoW-on-CoW.  Write amplification means you'll lose around
>50% of
>> your IOPS and/or I/O bandwidth for each level of CoW nesting,
>depending on
>> workload.  So don't put CEPH OSDs on, ssy, BTRFS or ZFS filesystems. 
>A
>> worst-case scenario would be something like running a VM using ZFS on
>top
>> of CEPH, where the OSDs are located on BTRFS filsystems, which are in
>turn
>> virtual devices hosted on ZFS filesystems.  Welcome to 1980's storage
>> performance, in that case!  (I did it without realizing once...
>seriously,
>> 5 MBps sequential writes was a good day!)  FWIW, CoW filesystems are
>> generally awesome - just not when stacked.  A sufficiently fast
>external
>> NAS running ZFS with VMs stored over NFS can provide decent
>performance,
>> *if* tuned correctly.  iX Systems, for example, spends a lot of time
>&
>> effort making this work well, including some lovely HA NAS
>appliances.
>>
>> 5. Remember the triangle.  You can optimize a distributed storage
>system
>> for any TWO of: a) cost, b) resiliency/reliability/HA, or c) speed. 
>(This
>> is a specific case of the traditional good/fast/cheap:pick-any-2
>adage.)
>>
>>
>> I'm not sure I'm saying anything new here, I may have just summarized
>the
>> discussion, but the points remain valid.
>>
>> Good luck with your performance problems.
>> -Adam
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at pve.proxmox.com
>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>_______________________________________________
>pve-user mailing list
>pve-user at pve.proxmox.com
>https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

That's a pretty good result.  You now have some very small windows where recently-written data could be lost, but in most applications not unreasonably so.
In exchange, you get very good throughout for spinning rust.
(FWIW, I gave up on CEPH because my nodes only have 2Gbps network each, but I am seeing similar speeds with local ZFS+ZIL+L2ARC on 15k SAS drives.  These are older systems, obviously.)
Thanks for sharing your solution!
-Adam
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.