[pve-devel] Default cache mode for VM hard drives

Stanislav German-Evtushenko ginermail at gmail.com
Thu May 28 13:50:19 CEST 2015


Alexandre,

> qemu use librbd to access directly to ceph, so host don't have any
/dev/rbd.. or filesystem mount.
Ah, I understand, this is not a normal block device but userspace lib.

> ceph use O_DIRECT+O_DYNC to write to the journal of osds.
Is this done inside KVM process? If so then KVM keeps buffer for this
O_DIRECT writing. Therefore if multiple threads can access (and change)
this buffer at the same time then the similar issue can happen in theory.

Stanislav

On Thu, May 28, 2015 at 2:44 PM, Alexandre DERUMIER <aderumier at odiso.com>
wrote:

> >> qemu rbd access is only userland, so host don't have any cache or
> buffer.
> >>If RBD device does not use host cache then it is very likely that RBD
> utilizes O_DIRECT. I am not sure if there are other ways to avoid host
> cache.
>
> qemu use librbd to access directly to ceph, so host don't have any
> /dev/rbd.. or filesystem mount.
>
> >> When data is written to ceph, it's written to the journal of each osd
> and replicas, before ack to the client.
> >>It can't be written to all destination right at the same time. If buffer
> changed meanwhile then data that reach different nodes data can differ.
>
> ceph use O_DIRECT+O_DYNC to write to the journal of osds.
> Reads are always done on 1 primary osd.
>
>
>
> ----- Mail original -----
> De: "Stanislav German-Evtushenko" <ginermail at gmail.com>
> À: "aderumier" <aderumier at odiso.com>
> Cc: "dietmar" <dietmar at proxmox.com>, "pve-devel" <
> pve-devel at pve.proxmox.com>
> Envoyé: Jeudi 28 Mai 2015 13:10:52
> Objet: Re: [pve-devel] Default cache mode for VM hard drives
>
> Alexandre,
>
> The important point is whether O_DIRECT is used with Ceph or not. Don't
> you know?
>
> > qemu rbd access is only userland, so host don't have any cache or buffer.
> If RBD device does not use host cache then it is very likely that RBD
> utilizes O_DIRECT. I am not sure if there are other ways to avoid host
> cache.
>
> > When data is written to ceph, it's written to the journal of each osd
> and replicas, before ack to the client.
> It can't be written to all destination right at the same time. If buffer
> changed meanwhile then data that reach different nodes data can differ.
>
> Stanislav
>
> On Thu, May 28, 2015 at 1:58 PM, Alexandre DERUMIER < aderumier at odiso.com
> > wrote:
>
>
> >>BTW: can anybody test drbd_oos_test.c against Ceph? I guess we will have
> the same result.
>
> I think they are no problem with ceph, qemu cache option only
> enable|disable rbd_cache.
> qemu rbd access is only userland, so host don't have any cache or buffer.
> When data is written to ceph, it's written to the journal of each osd and
> replicas, before ack to the client.
>
>
>
>
> ----- Mail original -----
> De: "Stanislav German-Evtushenko" < ginermail at gmail.com >
> À: "aderumier" < aderumier at odiso.com >
> Cc: "dietmar" < dietmar at proxmox.com >, "pve-devel" <
> pve-devel at pve.proxmox.com >
> Envoyé: Jeudi 28 Mai 2015 10:27:34
> Objet: Re: [pve-devel] Default cache mode for VM hard drives
>
> Alexandre,
>
> > That's why we need to use barrier or FUA in last kernel in guest, when
> using O_DIRECT, to be sure that guest filesystem is ok and datas are
> flushed at regular interval.
>
> The problems are:
> - Linux swap - no barrier or something similar
> - Windows - I have no idea what Windows does to ensure consistency but the
> issue is reproducible for Windows 7.
>
> BTW: can anybody test drbd_oos_test.c against Ceph? I guess we will have
> the same result.
>
> Stanislav
>
> On Thu, May 28, 2015 at 11:22 AM, Stanislav German-Evtushenko <
> ginermail at gmail.com > wrote:
>
>
>
> Alexandre,
>
> > do you see the problem with qemu cache=directsync ? (O_DIRECT + O_DSYNC).
> Yes, it happens in less number of cases (may be 10 times less) but still
> happens. I have a reproducible case with Windows 7 and directsync.
>
> Stanislav
>
> On Thu, May 28, 2015 at 11:18 AM, Alexandre DERUMIER < aderumier at odiso.com
> > wrote:
>
> BQ_BEGIN
> >>Resume: when working in O_DIRECT mode QEMU has to wait until "write"
> system call is finished before changing this buffer OR QEMU has to create
> new buffer every time OR ... other ideas?
>
> AFAIK, only O_DSYNC can guarantee that data are really written to the last
> layer(disk platters)
>
> That's why we need to use barrier or FUA in last kernel in guest, when
> using O_DIRECT, to be sure that guest filesystem is ok and datas are
> flushed at regular interval.
> (To avoid incoherent filesystem with datas).
>
>
> do you see the problem with qemu cache=directsync ? (O_DIRECT + O_DSYNC).
>
>
>
>
>
> ----- Mail original -----
> De: "Stanislav German-Evtushenko" < ginermail at gmail.com >
> À: "dietmar" < dietmar at proxmox.com >
> Cc: "aderumier" < aderumier at odiso.com >, "pve-devel" <
> pve-devel at pve.proxmox.com >
> Envoyé: Jeudi 28 Mai 2015 09:54:32
> Objet: Re: [pve-devel] Default cache mode for VM hard drives
>
> Dietmar,
>
> fsync esures that data reaches underlying hardware but it does not help
> being sure that buffer is not changed until it is fully written.
>
> I will describe my understanding here why we get this problem with
> O_DIRECT and don't have without.
>
> ** Without O_DIRECT **
> 1. Application tries to write data from buffer
> 2. Data from buffer goes to host cache
> 3. RAID writers get data from host cache and put to /dev/loop1 and
> /dev/loop2
> Even if buffer changes data in host cache will not be changed so RAID is
> consistent.
>
> ** With O_DIRECT **
> 1. Application tries to write data from buffer
> 2. RAID writers get data from application (!!!) bufferand put to
> /dev/loop1 and /dev/loop2
> if meanwhile data in buffer is changed (this change can be done in
> different posix thread) then we have different data reachs /dev/loop1 and
> /dev/loop2
>
> Resume: when working in O_DIRECT mode QEMU has to wait until "write"
> system call is finished before changing this buffer OR QEMU has to create
> new buffer every time OR ... other ideas?
>
> Stanislav
>
> On Thu, May 28, 2015 at 10:31 AM, Dietmar Maurer < dietmar at proxmox.com >
> wrote:
>
>
> > I have just done the same test with mdadm and not DRBD. And what I found
> > that this problem was reproducible on the software raid too, just as it
> was
> > claimed by Lars Ellenberg. It means that problem is not only related to
> > DRBD but to O_DIRECT mode generally when we don't use host cache and a
> > block device reads data directly from userspace.
>
> We simply think the behavior is correct. If you want to be sure data is
> on disk you have to call fsync.
>
>
>
>
>
>
>
>
>
>
>
>
> BQ_END
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-devel/attachments/20150528/ef6030eb/attachment.htm>


More information about the pve-devel mailing list