[PVE-User] Poor CEPH performance? or normal?

Fri Jul 27 15:46:14 CEST 2018

On 2018-07-27 07:05, ronny+pve-user at aasen.cx wrote:
> rbd striping is a per image setting. you may need to make the rbd
> image and migrate data.
> 
> On 07/26/18 12:25, Mark Adams wrote:
>> Thanks for your suggestions. Do you know if it is possible to change 
>> an
>> existing rbd pool to striping? or does this have to be done on first 
>> setup?

Please be aware that striping will not result in any increased 
performance, if you are using "safe" I/O modes, i.e. your VM waits for a 
successful flush-to-disk after every sector.  In that scenario, CEPH 
will never give you write performance equal to a local disk because 
you're limited to the bandwidth of a single remote disk [subsystem] 
*plus* the network round-trip latency, which even if measured in 
microseconds, still adds up.

Based on my experience with this and other distributed storage systems, 
I believe you will likely find that you get large write-performance 
gains by:

1. use the largest possible block size during writes.  512B sectors are 
the worst-case scenario for any remote storage.  Try to write in chunks 
of *at least* 1 MByte, and it's not unreasonable nowadays to write in 
chunks of 64MB or larger.  The rationale here is that you're spending 
more time sending data, and less time waiting for ACKs.  The more you 
can tilt that in favor of data, the better off you are.  (There are 
downsides to huge sector/block/chunk sizes, though - this isn't a "free 
lunch" scenario.  See #5.)

2. relax your write-consistency requirements.  If you can tolerate the 
small risk with "Write Back" you should see better performance, 
especially during burst writes.  During large sequential writes, there 
are not many ways to violate the laws of physics, and CEPH automatically 
amplifies your writes by (in your case) a factor of 2x due to 
replication.

3. switch to storage devices with the best possible local write speed, 
for OSDs.  OSDs are limited by the performance of the underlying device 
or virtual device.  (e.g. it's totally possible to run OSDs on a 
hardware RAID6 controller)

4. Avoid CoW-on-CoW.  Write amplification means you'll lose around 50% 
of your IOPS and/or I/O bandwidth for each level of CoW nesting, 
depending on workload.  So don't put CEPH OSDs on, ssy, BTRFS or ZFS 
filesystems.  A worst-case scenario would be something like running a VM 
using ZFS on top of CEPH, where the OSDs are located on BTRFS 
filsystems, which are in turn virtual devices hosted on ZFS filesystems. 
  Welcome to 1980's storage performance, in that case!  (I did it without 
realizing once... seriously, 5 MBps sequential writes was a good day!)  
FWIW, CoW filesystems are generally awesome - just not when stacked.  A 
sufficiently fast external NAS running ZFS with VMs stored over NFS can 
provide decent performance, *if* tuned correctly.  iX Systems, for 
example, spends a lot of time & effort making this work well, including 
some lovely HA NAS appliances.

5. Remember the triangle.  You can optimize a distributed storage system 
for any TWO of: a) cost, b) resiliency/reliability/HA, or c) speed.  
(This is a specific case of the traditional good/fast/cheap:pick-any-2 
adage.)

I'm not sure I'm saying anything new here, I may have just summarized 
the discussion, but the points remain valid.

Good luck with your performance problems.
-Adam