[PVE-User] Custom storage in ProxMox 5

Sat Mar 31 01:58:06 CEST 2018

On 30/03/2018 7:40 PM, Alexandre DERUMIER wrote:
> Hi,
>
>>> Ceph has rather larger overheads
> Agree. they are overhead, but performance increase with each release.
> I think the biggest problem is that you can reach more than 70-90k iops with 1 vm disk currently.
> and maybe latency could be improve too.

The performance I got with Ceph was sub optimal - as mentioned earlier, 
if you throw lots of money at Enterprise hardware & SSD's then its ok, 
but that sort of expenditure was not possible for our SMB. Something not 
mentioned is that it does not do well on small setups. A bare minimum is 
5 nodes with multiple OSD's.

>
>>> much bigger PITA to admin
> don't agree. I'm running 5 ceph cluster (around 200TB ssd). almost 0 maintenance.

Adding/replacing/removing OSD's can be a nightmare. The potential to 
trash your pools is only one mistep away. Adding extra nodes with 
lizardfs is trivial and lightweight. It took me minutes to add several 
desktop stations as chunkservers in our office. Judicious use of labels 
and goals allows us to distribute data amoungst them as desired by 
performance and space. The three high performance compute nodes get a 
copy of all chunks, which speeds up local reads.

>
>>> does not perform as well on whitebox hardware
> define whitebox hardware ?

Consumer grade drives and disk controllers. Maybe a Ozzie phrase?

> the only thing is to no use consumer ssd (because they sucks with direct io)

True, they are shocking. Also the lack of power loss protection.

>
>>> an Ceph run multiple replication and ec levels for different files on the same volume?
> you can manage it by pool. (as block storage).

With lizardfs I can set individual replication levels and ec modes per 
file (VM). Immensely useful for different classes of VM's. I have server 
VM's on replica level 4, desktop VM's replica 2 (higher performance), 
archived data ec(5,2) (fast read, slow write). I use ZFS underneath for 
transparent compression.

Halo writes are in testing for the next release which allows even more 
performance trade offs for VM's where minor data loss is not critical.

As with ZFS, chunks are checksummed and also versioned.

All this in a single namespace (lizardfs fusemount), similar to a single 
Ceph pool.

A qemu block driver is in the works, which should step round fuse 
performance limitations.

>>> Near instantaneous spanshots
> yes
>
>>> and restores
> a lit bit slower to rollback

A *lot* slower for me - snapshot restores were taking literally hours, 
and killed ceph performance in the process. It made snapshots useless 
for us.

>
>>> at any level of the filesystem you choose?
> why are you talking about filesystem ? Are you mounting lizard inside your vm? or for hosting vm disk ?

lizardfs exposes a posix filesystem via fuse, similar to CephFS. It 
replication goals can be set per file, unlike ceph, which are per pool. 
They can be changed per file on the fly.

lizardfs is not without its limitations of course.

  * Its fuse based, which is a performance hit. A block level qemu
    driver is in the works.
  * Its metadata server is Active/Passive, not active/active and fall
    over has to be managed by custom scripts using cluster tools such as
    keepalived, corosync etc. A built in fallover tool is in the next
    release.

-- 
Lindsay