[PVE-User] HA migration behaviour vs. failures

Tue Jul 22 18:30:05 CEST 2014

Greetings,

I've been "playing" with the last version of proxmox (3 nodes cluster + glusterfs) for a couple of month.
My goal is to replace 3 RedHat 5 KVM servers (no HA) hosting ~100 VMs on NAS storage.

But i have some annoying issues with live migrations..
Sometimes it will work, but sometimes (with no reason) it won't.
When it fails (migration aborted), i try again then it works ! :(

Jul 11 14:48:49 starting ssh migration tunnel
Jul 11 14:48:50 starting online/live migration on localhost:60000
Jul 11 14:48:50 migrate_set_speed: 8589934592
Jul 11 14:48:50 migrate_set_downtime: 0.1
Jul 11 14:48:52 ERROR: online migrate failure - aborting
Jul 11 14:48:52 aborting phase 2 - cleanup resources
Jul 11 14:48:52 migrate_cancel
Jul 11 14:58:52 ERROR: migration finished with problems (duration 00:10:05)
TASK ERROR: migration problems

I tried to :
- disable spice.
- set cpu to 'default' (kvm64) instead of 'host'.
- shared storage 'Directory' (fuse mount.) instead of 'GlusterFS'.
But no luck, still random failures.

My problem with that is when the VMs will be added to HA cluster...because proxmox seems to stop the service when live migration fails.
I can't see why someone would wan't to stop a HA VM, because live migration fails but the VM is still running ?

I remember i have another cluster here (2 nodes RedHat 6 KVM cluster, VMs with HA) and when ha migration fails VMs stay started on the original node.
I thought it would be then possible to achieve the same behaviour with Proxmox ?

Having the VMs stopped in a HA cluster is a no go, so i ended doing some nasty changes in the code.
I'm still interested in a better solution, so far it seems to do what i need..

+++ /usr/share/cluster/pvevm    2014-07-22 15:22:29.703424516 +0200
@@ -28,6 +28,7 @@
  use constant OCF_NOT_RUNNING => 7;
  use constant OCF_RUNNING_MASTER => 8;
  use constant OCF_FAILED_MASTER => 9;
+use constant OCF_ERR_MIGRATE => 150;

  $ENV{'PATH'} = '/sbin:/bin:/usr/sbin:/usr/bin';

@@ -358,6 +359,9 @@

      upid_wait($upid);

+    check_running($status);
+    exit(OCF_ERR_MIGRATE) if $status->{running};
+
      # something went wrong if old config file is still there
      exit((-f $oldconfig) ? OCF_ERR_GENERIC : OCF_SUCCESS);

+++ /usr/share/perl5/PVE/API2/Qemu.pm   2014-07-22 15:51:31.909558803 +0200
@@ -1634,7 +1634,7 @@

         my $storecfg = PVE::Storage::config();

-       if (&$vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha') {
+       if (&$vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha' && !defined($migratedfrom)) {

             my $hacmd = sub {
                 my $upid = shift;


Regards,

-- 
Alexandre DHAUSSY.