[pve-devel] ZFS Storage Patches

Chris Allen ca.allen at gmail.com
Fri Mar 14 20:24:10 CET 2014


Oops forgot to attach the script.  Here's the script I mentioned.


On Fri, Mar 14, 2014 at 12:21 PM, Chris Allen <ca.allen at gmail.com> wrote:

> > I have already made some test and I have not be able to make any
> > conclusive tests proving performance should be hurt by using sparse.
>
> Yeah it shouldn't affect the ZFS mechanics at all, the ZVOL will just lack
> a reservation.
>
>
> > Is sparse a way to provision more than a 100% then?
> Yes.  That, and it enables you to take advantage of compression on the
> volume.  Without sparse the volume is always going to take away the same
> amount of space from the pool (due to the hard reservation) regardless of
> whether or not compression and/or dedup is on.  You just have to be careful
> to monitor pool capacity.  Bad things will happen if your SAN server runs
> out of space...  I attached a quick and dirty script I wrote to monitor
> pool capacity and status, and send an e-mail alert if the pool degrades or
> a capacity threshold is hit.  I run it from cron every 30 minutes.
>
>
> > For me 8k block size for volumes seems to be given more write speed.
> 8k for me too is much better than 4k.  With 4k I tend to hit my IOPS limit
> easily, with not much throughput, and I get a lot of IO delay on VMs when
> the SAN is fairly busy.  Currently I'm leaning towards 16k, sparse, with
> lz4 compression.  If you go the sparse route then compression is a no
> brainer as it accelerates performance on the underlying storage
> considerably.  Compression will lower your IOPS and data usage both are
> good things for performance.  ZFS performance drops as usage rises and gets
> really ugly at around 90% capacity.  Some people say it starts to drop with
> as little as 10% used, but I have not tested this.  With 16k block sizes
> I'm getting good compression ratios - my best volume is 2.21x, my worst
> 1.33x, and the average is 1.63x.  So as you can see a lot of the time my
> real block size on disk is going to be effectively smaller than 16k.  The
> tradeoff here is that compression ratios will go up with a larger block
> size, but you'll have to do larger operations and thus more waste will
> occur when the VM is doing small I/O.  With a large block size on a busy
> SAN your I/O is going to get fragmented before it hits the disk anyway, so
> I think 16k is good balance.  I only have 7200 RPM drives in my array, but
> a ton of RAM and a big ZFS cache device, which is another reason I went
> with 16k, to maximize what I get when I can get it.  I think with 15k RPM
> drives 8k block size might be better, as your IOPS limit will be roughly
> double that of 7200 RPM.
>
> Dedup did not work out well for me.  Aside from the huge memory
> consumption, it didn't save all that much space and to save the max space
> you need to match the VM's filesystem cluster size to the ZVOL block size.
>  Which means 4k for ext4 and NTFS (unless you change it during a Windows
> install).  Also dedup really really slows down zpool scrubbing and possibly
> rebuild.  This is one of the main reasons I avoid it.  I don't want scrubs
> to take forever, when I'm paranoid of something potentially being wrong.
>
>
> > Regards write caching: Why not simply use sync
> > directly on the volume?
>
> Good question.  I don't know.
>
>
> > I have made no tests on Solaris - licens costs is out of my league. I
> > regularly test FreeBSD, Linux and Omnios. In production I only use
> > Omnios (15008 but will migrate all to r151014 when this is released
> > and then only use LTS in the future).
>
> I'm in the process of trying to run away from all things Oracle at my
> company.  We keep getting burned by them.  It's so freakin' expensive, and
> they hold you over a barrel with patches for both hardware and software.
>  We bought some very expensive hardware from them, and a management
> controller for a blade chassis had major bugs to the point it was
> practically unusable out of the box.  Oracle would not under any
> circumstance supply us with the new firmware unless we spent boatloads of
> cash for a maintenance contract.  We ended up doing this because we needed
> the controller to work as advertised.  This is what annoys me the most with
> them - you buy a product and it doesn't do what is written on the box and
> then you have to pay tons extra for it to do what they said it would do
> when you bought it.  I miss Sun...
>
>
>
> On Fri, Mar 14, 2014 at 10:52 AM, Michael Rasmussen <mir at datanom.net>wrote:
>
>> On Fri, 14 Mar 2014 10:11:17 -0700
>> Chris Allen <ca.allen at gmail.com> wrote:
>>
>> > > It was also part of latest 3.1. Double-click the mouse over your
>> > > storage specification in Datacenter->storage and the panel pops up.
>> > > Patched panel attached.
>> >
>> I forgot to mention that at the moment the code for creating ZFS
>> storage is commented out
>> in /usr/share/pve-manager/ext4/pvemanagerlib.js line 20465-20473
>>
>> >
>> > No I haven't.  As far as I understand it sparse should not affect
>> > performance whatsoever, it only changes whether or not a reservation is
>> > created on the ZVOL.  Turning of write caching on the LU should decrease
>> > performance, dramatically so, if you do not have a separate and very
>> fast
>> > ZIL device (eg. ZeusRAM).  Every block write to the ZVOL will be done
>> > synchronously when write caching is turned off.
>> >
>> I have already made some test and I have not be able to make any
>> conclusive tests proving performance should be hurt by using sparse. Is
>> sparse a way to provision more than a 100% then?
>>
>> > I've done some testing with regards to block size, compression, and
>> dedup.
>> >  I wanted sparse support for myself and I figured while I was there I
>> might
>> > as well add a flag for turning off write caching.  For people with the
>> > right (and expensive!) hardware the added safety of no write caching
>> might
>> > be worth it.
>> >
>> I have done the same. For me 8k block size for volumes seems to be given
>> more write speed. Regards write caching: Why not simply use sync
>> directly on the volume?
>>
>> > Have you tested the ZFS storage plugin on Solaris 11.1?  I first tried
>> > using it with 11.1, but they changed how the LUN assignment for the
>> views
>> > works.  In 11.0 and OmniOS the first available LUN will get used when a
>> new
>> > view is created if no LUN is given.  But in 11.1 it gets populated with
>> a
>> > string that says "AUTO".  This of course means PVE can't connect to the
>> > volume because it can't resolve the LUN.  Unfortunately I couldn't find
>> > anything in the 11.1 documentation that described how to get the LUN.
>>  I'm
>> > assuming there's some kind of mechanism in 11.1 where you can get the
>> > number on the fly, as it must handle them dynamically now.  But after a
>> lot
>> > of Googling and fiddling around I gave up and switched to OmniOS.  I
>> don't
>> > have a support contract with Oracle so that was a no go.  Anyway, just
>> > thought I'd mention that in case you knew about it.
>> >
>> > In addition to that problem 11.1 also has a bug in the handling of the
>> > iSCSI feature Immediate Data.  It doesn't implement it properly
>> according
>> > to the iSCSI RFC, and so you need to turn of Immediate Data on the
>> client
>> > in order to connect.  The patch is available to Oracle paying support
>> > customers only.
>> >
>> I have made no tests on Solaris - licens costs is out of my league. I
>> regularly test FreeBSD, Linux and Omnios. In production I only use
>> Omnios (15008 but will migrate all to r151014 when this is released
>> and then only use LTS in the future).
>>
>> --
>> Hilsen/Regards
>> Michael Rasmussen
>>
>> Get my public GnuPG keys:
>> michael <at> rasmussen <dot> cc
>> http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xD3C9A00E
>> mir <at> datanom <dot> net
>> http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE501F51C
>> mir <at> miras <dot> org
>> http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE3E80917
>> --------------------------------------------------------------
>> /usr/games/fortune -es says:
>> I never failed to convince an audience that the best thing they
>> could do was to go away.
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-devel/attachments/20140314/dd086688/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zfs_alert.py
Type: text/x-python
Size: 2474 bytes
Desc: not available
URL: <http://lists.proxmox.com/pipermail/pve-devel/attachments/20140314/dd086688/attachment.py>


More information about the pve-devel mailing list