[pve-devel] 4K drives

Thu Nov 8 22:04:14 CET 2012

some others infos

http://lists.gnu.org/archive/html/qemu-devel/2011-12/msg01570.html

"Running with mismatched host and guest logical block sizes is going
to become more important as 4k-sector disks become more widespread.
This is because we need a 512 byte disk to boot from.

Mismatched block sizes have two problems:

1) with cache=none or with non-raw protocols, you just cannot do 512-byte
granularity output.  You need to do read-modify-write cycles like "hybrid"
512b-logical/4k-physical disks do.  (Note that actually only the iSCSI
protocol supports 4k logical blocks).

2) when host block size < guest block size, guests issue 4k-aligned
I/O and expect it to be atomic.  This problem cannot really be solved
completely, because power or I/O failures could leave a partially-written
block ("torn page").  However, at least you can serialize reads against
overlapping writes, which guarantees correctness as long as shutdown is
clean and there are no I/O errors.

Read-modify-write cycles are of course slower, and need to serialize
writes which makes the situation even worse.  However, the performance
impact of emulating 512-byte sectors is within noise when partitions are
aligned.  File system blocks are usually 4k or bigger, and OSes tend
to use 4k-aligned buffers.  So when partitions are aligned no misaligned
I/O is sent and no bounce buffer is necessary either.

The situation is much different if partitions are misaligned or if the
guest is using O_DIRECT with a 512-byte aligned buffer.  I benchmarked
only the former using iozone on a RHEL6 guest (2GB memory, 20GB ext4
partition with the whole 4k-sector disk assigned to the guest).  Graphs
aren't really pretty, but two points are more or less discernible (also
more or less obvious):

- writes incur a larger overhead than reads by 5-10%;

- for larger file sizes the penalty is smaller, probably because
the I/O scheduler can work better (with almost no penalty for reads);
for smaller file sizes, up to 1M or even more for some scenarios,
misalignment worsened performance by 10-25%.

The series is structured as follows.

Patches 1 to 6 clean up the handling of flag bits, so that non-raw
protocols can always request read-modify-write operation (even when
cache != none).

Patches 7 to 11 distinguish host and guest block sizes in the
BlockDriverState.

Patches 12 to 15 reuse the request tracking mechanism to implement
RMW and to avoid torn pages.

Patch 16 passes down the host block size as physical block size so
that hopefully guest OSes try to align partitions.

Patch 17 adds an option to qemu-io that lets you test these scenarios
even without a 4k-sector disk.

Paolo Bonzini (17):
  block: do not rely on open_flags for bdrv_is_snapshot
  block: store actual flags in bs->open_flags
  block: pass protocol flags up to the format
  block: non-raw protocols never cache
  block: remove enable_write_cache
  block: move flag bits together
  raw: remove the aligned_buf
  block: rename buffer_alignment to guest_block_size
  block: add host_block_size
  raw: probe host_block_size
  iscsi: save host block size
  block: allow waiting only for overlapping writes
  block: allow waiting at arbitrary granularity
  block: protect against "torn reads" for guest_block_size > host_block_size
  block: align and serialize I/O when guest_block_size < host_block_size
  block: default physical block size to host block size
  qemu-io: add blocksize argument to open

 Makefile.objs     |    4 +-
 block.c           |  313 ++++++++++++++++++++++++++++++++++++++++++++++-------
 block.h           |   17 +---
 block/curl.c      |    1 +
 block/iscsi.c     |    2 +
 block/nbd.c       |    1 +
 block/raw-posix.c |   97 ++++++++++-------
 block/raw-win32.c |   42 +++++++
 block/rbd.c       |    1 +
 block/sheepdog.c  |    1 +
 block/vdi.c       |    1 +
 block_int.h       |   25 ++---
 hw/ide/core.c     |    2 +-
 hw/scsi-disk.c    |    2 +-
 hw/scsi-generic.c |    2 +-
 hw/virtio-blk.c   |    2 +-
 qemu-io.c         |   33 +++++-
 trace-events      |    1 +
 18 files changed, 429 insertions(+), 118 deletions(-)

-- 
1.7.7.1

"
----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier at odiso.com> 
À: "Dietmar Maurer" <dietmar at proxmox.com> 
Cc: pve-devel at pve.proxmox.com 
Envoyé: Jeudi 8 Novembre 2012 21:55:54 
Objet: [pve-devel] 4K drives 

for 4K drive, I don't know if this is related: 

http://www.linuxtopia.org/online_books/rhel6/rhel_6_technical_notes/rhel_6_technotes_virt.html 

3.1. Known Issues 
" 
Direct Asynchronous IO (AIO) that is not issued on filesystem block boundaries, and falls into a hole in a sparse file on ext4 or xfs filesystems, may corrupt file data if multiple I/O operations modify the same filesystem block. Specifically, if qemu-kvm is used with the aio=native IO mode over a sparse device image hosted on the ext4 or xfs filesystem, guest filesystem corruption will occur if partitions are not aligned with the host filesystem block size. Generally, do not use aio=native option along with cache=none for QEMU. This issue can be avoided by using one of the following techniques: 

Align AIOs on filesystem block boundaries, or do not write to sparse files using AIO on xfs or ext4 filesystems. 
KVM: Use a non-sparse system image file or allocate the space by zeroing out the entire file. 
KVM: Create the image using an ext3 host filesystem instead of ext4. 
KVM: Invoke qemu-kvm with aio=threads (this is the default). 
KVM: Align all partitions within the guest image to the host's filesystem block boundary (default 4k). 
" 

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier at odiso.com> 
À: "Dietmar Maurer" <dietmar at proxmox.com> 
Cc: pve-devel at pve.proxmox.com 
Envoyé: Jeudi 8 Novembre 2012 18:40:21 
Objet: Re: [pve-devel] new cache benchmark results 

seem to occur only with O_DIRECT (so cache=none and cache=directsync) 

this bugzilla was about cdrom (with large sector), but I don't know if it's apply on 4K hdd 

https://bugzilla.redhat.com/show_bug.cgi?id=608548 
" Technical note added. If any revisions are required, please edit the "Technical Notes" field 
accordingly. All revisions will be proofread by the Engineering Content Services team. 

New Contents: 
Cause: qemu did not align memory properly for O_DIRECT support 

Fix: qemu was changed to use properly aligned memory for I/O requests 

Consequence: I/O to device with large sector sizes like CDROMs dit not work in cache=none mode 
Result: I/O to devices with large sector sizes like CDROMs work in cache=none mode" 

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier at odiso.com> 
À: "Dietmar Maurer" <dietmar at proxmox.com> 
Cc: pve-devel at pve.proxmox.com 
Envoyé: Jeudi 8 Novembre 2012 18:06:47 
Objet: Re: [pve-devel] new cache benchmark results 

>>Another problem with cache=none is that it does not work with 4K sector drives. 
>> 
>>User reported problems with 4K iscsi and 4K local disks. 
>> 
Yes, I remember the post in the forum. It doesn't work with cache=none or cache=writeback 

>>Any idea how to solve that? 
This is strange, because it seem to be fixed a long time ago(I have seen some redhat bugzilla from 2011 about it) 
But I don't have hardware to test it :( 

I'll look in my archives tomorrow ;) 

----- Mail original ----- 

De: "Dietmar Maurer" <dietmar at proxmox.com> 
À: "Alexandre DERUMIER" <aderumier at odiso.com>, pve-devel at pve.proxmox.com 
Envoyé: Jeudi 8 Novembre 2012 17:50:58 
Objet: RE: [pve-devel] new cache benchmark results 

> Note : with shared storage, with writeback, I generally see big spike (faster 
> than cache=none), but after big slowdown (near zero) during 5-10s. 
> So I think this is when the host need to flush the datas, it add more 
> overhead. (maybe network latency have an impact...) 

Another problem with cache=none is that it does not work with 4K sector drives. 

User reported problems with 4K iscsi and 4K local disks. 

Any idea how to solve that? 
_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 

_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel