[pve-devel] Bug 1458 - PVE 5 live migration downtime degraded to several seconds (compared to PVE 4)

Thu Jul 27 14:45:43 CEST 2017

the following issue was reported on the forum[1] and as bug #1458[2],
moving this here for further discussion of potential fixes.

when live-migrating over a unix socket, PVE 5 takes up to a few seconds
between completing the RAM transfer and pausing the source VM, and
resuming the target VM. in PVE 4, the same migration has a downtime of
almost 0.

AFAICT, the reason for this is a bug fix in PVE 5's qemu-server which
was required to support storage live migration in Qemu 2.9.

originally in PVE 4, the target VM in a live migration was started in
incoming migration mode and NOT continued on startup (whereas VMs rolled
back to a RAM snapshot where started in the same mode, but immediately
continued).

in June 2016[3], migration over ssh-forwarded unix sockets was
implemented. the check for skipping the continue command on startup of
the target VM was overlooked, so now VMs migrated over unix sockets were
started in incoming migration mode, but continued on startup. this does
not change the behaviour on startup, as a VM in incoming migration mode
is not actually running until a migration has happened. this does mean
that the downtime is vastly reduced for such migrations, as Qemu will
continue the target VM automatically as soon as the migration job is
completed.

the only things that happen after this automatic resume is
- finish tunnel
- moving the conf file logically between nodes
- resuming on the target side (which is a no-op in this case)

so the risk for inconsistencies seems pretty small.

later on, we introduced live-storage migration. in those cases, we now
have the following scenario:
- start storage migration jobs
- start RAM migration
- wait for RAM to be completed
- finish tunnel
- finish block jobs
- update conf file
- move the conf file logically between the nodes
- resume on target node

so depending on whether the migration goes over tcp (OK) or over unix
(not so much) we have very different behaviour and risk for
inconsistencies.

with the introduction to PVE 5, this different behaviour was fixed /
made consistent, by adapting the "manual resume" stance. this was needed
because Qemu 2.9 does not allow the storage migration over NBD and the
target VM itself to have write access to the same disks at the same
time. this fix was not backported to PVE 4, which means that storage
live-migration is potentially buggy there, but live-migration over unix
sockets is faster.

I wonder whether going the "immediately cont" route for live migrations
without local storage can cause any issues besides the obvious "moving
the conf file failed and VM is now active on the wrong node" one? if
not, I propose doing just that. otherwise, we could think about lowering
the polling interval when waiting for RAM migration to complete (in
phase2) - that should shave off a bit of the downtime as well.

in any case, I think we need to backport the manual resume in case of
local storage live migration fix to PVE 4.

1: https://forum.proxmox.com/threads/pve-5-live-migration-downtime-degradation-2-4-sec.35890
2: https://bugzilla.proxmox.com/show_bug.cgi?id=1458
3: 1c9d54bfd05e0d017a6e2ac5524d75466b1a4455