[pve-devel] ZFS Storage Patches

Chris Allen ca.allen at gmail.com
Fri Mar 14 20:21:43 CET 2014


> I have already made some test and I have not be able to make any
> conclusive tests proving performance should be hurt by using sparse.

Yeah it shouldn't affect the ZFS mechanics at all, the ZVOL will just lack
a reservation.


> Is sparse a way to provision more than a 100% then?
Yes.  That, and it enables you to take advantage of compression on the
volume.  Without sparse the volume is always going to take away the same
amount of space from the pool (due to the hard reservation) regardless of
whether or not compression and/or dedup is on.  You just have to be careful
to monitor pool capacity.  Bad things will happen if your SAN server runs
out of space...  I attached a quick and dirty script I wrote to monitor
pool capacity and status, and send an e-mail alert if the pool degrades or
a capacity threshold is hit.  I run it from cron every 30 minutes.


> For me 8k block size for volumes seems to be given more write speed.
8k for me too is much better than 4k.  With 4k I tend to hit my IOPS limit
easily, with not much throughput, and I get a lot of IO delay on VMs when
the SAN is fairly busy.  Currently I'm leaning towards 16k, sparse, with
lz4 compression.  If you go the sparse route then compression is a no
brainer as it accelerates performance on the underlying storage
considerably.  Compression will lower your IOPS and data usage both are
good things for performance.  ZFS performance drops as usage rises and gets
really ugly at around 90% capacity.  Some people say it starts to drop with
as little as 10% used, but I have not tested this.  With 16k block sizes
I'm getting good compression ratios - my best volume is 2.21x, my worst
1.33x, and the average is 1.63x.  So as you can see a lot of the time my
real block size on disk is going to be effectively smaller than 16k.  The
tradeoff here is that compression ratios will go up with a larger block
size, but you'll have to do larger operations and thus more waste will
occur when the VM is doing small I/O.  With a large block size on a busy
SAN your I/O is going to get fragmented before it hits the disk anyway, so
I think 16k is good balance.  I only have 7200 RPM drives in my array, but
a ton of RAM and a big ZFS cache device, which is another reason I went
with 16k, to maximize what I get when I can get it.  I think with 15k RPM
drives 8k block size might be better, as your IOPS limit will be roughly
double that of 7200 RPM.

Dedup did not work out well for me.  Aside from the huge memory
consumption, it didn't save all that much space and to save the max space
you need to match the VM's filesystem cluster size to the ZVOL block size.
 Which means 4k for ext4 and NTFS (unless you change it during a Windows
install).  Also dedup really really slows down zpool scrubbing and possibly
rebuild.  This is one of the main reasons I avoid it.  I don't want scrubs
to take forever, when I'm paranoid of something potentially being wrong.


> Regards write caching: Why not simply use sync
> directly on the volume?

Good question.  I don't know.


> I have made no tests on Solaris - licens costs is out of my league. I
> regularly test FreeBSD, Linux and Omnios. In production I only use
> Omnios (15008 but will migrate all to r151014 when this is released
> and then only use LTS in the future).

I'm in the process of trying to run away from all things Oracle at my
company.  We keep getting burned by them.  It's so freakin' expensive, and
they hold you over a barrel with patches for both hardware and software.
 We bought some very expensive hardware from them, and a management
controller for a blade chassis had major bugs to the point it was
practically unusable out of the box.  Oracle would not under any
circumstance supply us with the new firmware unless we spent boatloads of
cash for a maintenance contract.  We ended up doing this because we needed
the controller to work as advertised.  This is what annoys me the most with
them - you buy a product and it doesn't do what is written on the box and
then you have to pay tons extra for it to do what they said it would do
when you bought it.  I miss Sun...



On Fri, Mar 14, 2014 at 10:52 AM, Michael Rasmussen <mir at datanom.net> wrote:

> On Fri, 14 Mar 2014 10:11:17 -0700
> Chris Allen <ca.allen at gmail.com> wrote:
>
> > > It was also part of latest 3.1. Double-click the mouse over your
> > > storage specification in Datacenter->storage and the panel pops up.
> > > Patched panel attached.
> >
> I forgot to mention that at the moment the code for creating ZFS
> storage is commented out
> in /usr/share/pve-manager/ext4/pvemanagerlib.js line 20465-20473
>
> >
> > No I haven't.  As far as I understand it sparse should not affect
> > performance whatsoever, it only changes whether or not a reservation is
> > created on the ZVOL.  Turning of write caching on the LU should decrease
> > performance, dramatically so, if you do not have a separate and very fast
> > ZIL device (eg. ZeusRAM).  Every block write to the ZVOL will be done
> > synchronously when write caching is turned off.
> >
> I have already made some test and I have not be able to make any
> conclusive tests proving performance should be hurt by using sparse. Is
> sparse a way to provision more than a 100% then?
>
> > I've done some testing with regards to block size, compression, and
> dedup.
> >  I wanted sparse support for myself and I figured while I was there I
> might
> > as well add a flag for turning off write caching.  For people with the
> > right (and expensive!) hardware the added safety of no write caching
> might
> > be worth it.
> >
> I have done the same. For me 8k block size for volumes seems to be given
> more write speed. Regards write caching: Why not simply use sync
> directly on the volume?
>
> > Have you tested the ZFS storage plugin on Solaris 11.1?  I first tried
> > using it with 11.1, but they changed how the LUN assignment for the views
> > works.  In 11.0 and OmniOS the first available LUN will get used when a
> new
> > view is created if no LUN is given.  But in 11.1 it gets populated with a
> > string that says "AUTO".  This of course means PVE can't connect to the
> > volume because it can't resolve the LUN.  Unfortunately I couldn't find
> > anything in the 11.1 documentation that described how to get the LUN.
>  I'm
> > assuming there's some kind of mechanism in 11.1 where you can get the
> > number on the fly, as it must handle them dynamically now.  But after a
> lot
> > of Googling and fiddling around I gave up and switched to OmniOS.  I
> don't
> > have a support contract with Oracle so that was a no go.  Anyway, just
> > thought I'd mention that in case you knew about it.
> >
> > In addition to that problem 11.1 also has a bug in the handling of the
> > iSCSI feature Immediate Data.  It doesn't implement it properly according
> > to the iSCSI RFC, and so you need to turn of Immediate Data on the client
> > in order to connect.  The patch is available to Oracle paying support
> > customers only.
> >
> I have made no tests on Solaris - licens costs is out of my league. I
> regularly test FreeBSD, Linux and Omnios. In production I only use
> Omnios (15008 but will migrate all to r151014 when this is released
> and then only use LTS in the future).
>
> --
> Hilsen/Regards
> Michael Rasmussen
>
> Get my public GnuPG keys:
> michael <at> rasmussen <dot> cc
> http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xD3C9A00E
> mir <at> datanom <dot> net
> http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE501F51C
> mir <at> miras <dot> org
> http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE3E80917
> --------------------------------------------------------------
> /usr/games/fortune -es says:
> I never failed to convince an audience that the best thing they
> could do was to go away.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-devel/attachments/20140314/c9559ef2/attachment.htm>


More information about the pve-devel mailing list