[pve-devel] [PATCH 1/2] migrate: use ssh over socat provided UNIX socks as tunnel

Thomas Lamprecht t.lamprecht at proxmox.com
Tue May 31 11:38:43 CEST 2016



On 05/31/2016 11:34 AM, Thomas Lamprecht wrote:
> We cannot guarantee when the SSH forward Tunnel really becomes
> ready. The check with the mtunnel API call did not help here as it
> only checked that the SSH connection itself worked, which is
> useless as else SSH wouldn't connect at all.
>
> The Forward tunnel is a different channel in the SSH connection,
> independent of the SSH `qm mtunnel` channel, so only if that works
> it does not guarantees that our migration tunnel is up and ready.
>
> When the node(s) where under load, or when we did parallel
> migrations (migrateall), the migrate command was often started
> before a tunnel was open and ready to receive data. This led to
> a direct abortion of the migration and is the main cause in why
> parallel migrations often leave two thirds or more VMs on the
> source node.
> The issue was tracked down to SSH after debugging the QEMU
> process and enabling debug logging showed that the tunnel became
> often to late available and ready, or not at all.
>
> Fixing the Forward tunnel is quirky and not straight ahead, the
> only way SSH gives as a possibility is to use -N (no command)
> -f (background) and -o "ExitOnForwardFailure=yes", then it would
> wait in the foreground until the tunnel is ready and only then
> background itself. This is not quite the nicest way for our special
> use case and our code base.
> Waiting for the local port to become open and ready (through
> /proc/net/tcp[6]] as a proof of concept is not enough, even if the
> port is in the listening state and should theoretically accept
> connections this still failed often as the tunnel was not yet fully
> ready.
>
> Further another problem would still be open if we tried to patch the
> SSH Forward method we currently use - which we solve for free with
> the approach of this patch - namely the problem that the method
> to get an available port (next_migration_port) has a serious race
> condition which could lead to multiple use of the same port on a
> parallel migration (I observed this on my many test, seldom but if
> it happens its really bad).
>
> So lets now use socat to manage a connection over SSH. The end
> points are UNIX socket bound to the VMID - thus no port so no race
> and also no limitation of available ports (we reserved 50 for
> migration).
>
> The endpoints get created in /run/qemu-server/VMID.migrate and as
> KVM/QEMU is able to use UNIX socket just as well as TCP we have not
> to change much on the interaction with QEMU.
> QEMU is started with the migrate_incoming url at the local
> destination endpoint and creates the socket file, we then create
> a listening socket on the source side and connect over SSH to the
> destination.
> Now the migration can be started by issuing the migrate qmp command
> with an updated uri.
>
> Another small issue was that we used open2 to make the tunnel in a
> child process but never collected it when we closed tunnel.
>
> Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
>
> Cc: Wolfgang Bumiller <w.bumiller at proxmox.com>
> ---
>  PVE/QemuMigrate.pm | 184 ++++++++++++++++++++++-------------------------------
>  PVE/QemuServer.pm  |  22 ++++---
>  2 files changed, 90 insertions(+), 116 deletions(-)
>
> diff --git a/PVE/QemuMigrate.pm b/PVE/QemuMigrate.pm
> index a288627..865d06a 100644
> --- a/PVE/QemuMigrate.pm
> +++ b/PVE/QemuMigrate.pm
> @@ -4,7 +4,7 @@ use strict;
>  use warnings;
>  use PVE::AbstractMigrate;
>  use IO::File;
> -use IPC::Open2;
> +use POSIX qw(:sys_wait_h);
>  use PVE::INotify;
>  use PVE::Tools;
>  use PVE::Cluster;
> @@ -15,108 +15,62 @@ use PVE::RPCEnvironment;
>  
>  use base qw(PVE::AbstractMigrate);
>  
> -sub fork_command_pipe {
> -    my ($self, $cmd) = @_;
> -
> -    my $reader = IO::File->new();
> -    my $writer = IO::File->new();
> -
> -    my $orig_pid = $$;
> -
> -    my $cpid;
> -
> -    eval { $cpid = open2($reader, $writer, @$cmd); };
> -
> -    my $err = $@;
> -
> -    # catch exec errors
> -    if ($orig_pid != $$) {
> -	$self->log('err', "can't fork command pipe\n");
> -	POSIX::_exit(1);
> -	kill('KILL', $$);
> +sub fork_tunnel {
> +    my ($self, $raddr) = @_;
> +
> +    my $ssh_cmd = PVE::Tools::cmd2string([@{$self->{rem_ssh}}]);
> +    my $cmd = ['socat', 'EXEC:'. $ssh_cmd .' "socat - UNIX:' . $raddr .'"', 'UNIX-LISTEN:' . $raddr];
> +
> +    my $tunnel = {};
> +
> +    my $pid = fork();
> +    if (!defined($pid)) {
> +	$self->log('err', "forking tunnel failed");
> +	return undef;
> +    } elsif ($pid == 0) {
> +	exec(@$cmd);
> +	exit(-1);
> +    } else {
> +	$tunnel->{cpid} = $pid;
>      }
>  
> -    die $err if $err;
> +    $tunnel->{raddr} = $raddr;
>  
> -    return { writer => $writer, reader => $reader, pid => $cpid };
> +    return $tunnel;
>  }
>  
> -sub finish_command_pipe {
> -    my ($self, $cmdpipe, $timeout) = @_;
> +sub finish_tunnel {
> +    my ($self) = @_;
>  
> -    my $writer = $cmdpipe->{writer};
> -    my $reader = $cmdpipe->{reader};
> +    my $tunnel = $self->{tunnel};
> +    my $cpid = $tunnel->{cpid};
> +    return undef if !$cpid;
>  
> -    $writer->close();
> -    $reader->close();
> +    # collect child process
> +    for (my $i = 1; $i < 20; $i++) {
> +	my $waitpid = waitpid($cpid, WNOHANG);
> +	last if (defined($waitpid) && ($waitpid == $cpid));
>  
> -    my $cpid = $cmdpipe->{pid};
> +	if ($i == 10) {
> +	    $self->log('info', "ssh tunnel still running - terminating now with SIGTERM");
> +	    kill(15, $cpid);
>  
> -    if ($timeout) {
> -	for (my $i = 0; $i < $timeout; $i++) {
> -	    return if !PVE::ProcFSTools::check_process_running($cpid);
> -	    sleep(1);
> +	} elsif ($i >= 15) {
> +	    $self->log('info', "ssh tunnel still running - terminating now with SIGKILL");
> +	    kill(9, $cpid);
>  	}
> +	sleep (1);
>      }
>  
> -    $self->log('info', "ssh tunnel still running - terminating now with SIGTERM\n");
> -    kill(15, $cpid);
> +    delete $tunnel->{cpid};
>  
> -    # wait again
> -    for (my $i = 0; $i < 10; $i++) {
> -	return if !PVE::ProcFSTools::check_process_running($cpid);
> -	sleep(1);
> -    }
> +    # just to be sure, check on local..
> +    my $cmd = ['rm', '-f', $tunnel->{raddr}]; #
> +    PVE::Tools::run_command($cmd);
>  
> -    $self->log('info', "ssh tunnel still running - terminating now with SIGKILL\n");
> -    kill 9, $cpid;
> -    sleep 1;
> -}
> -
> -sub fork_tunnel {
> -    my ($self, $nodeip, $lport, $rport) = @_;
> -
> -    my @localtunnelinfo = $lport ? ('-L' , "$lport:localhost:$rport" ) : ();
> -
> -    my $cmd = [@{$self->{rem_ssh}}, @localtunnelinfo, 'qm', 'mtunnel' ];
> -
> -    my $tunnel = $self->fork_command_pipe($cmd);
> -
> -    my $reader = $tunnel->{reader};
> -
> -    my $helo;
> -    eval {
> -	PVE::Tools::run_with_timeout(60, sub { $helo = <$reader>; });
> -	die "no reply\n" if !$helo;
> -	die "no quorum on target node\n" if $helo =~ m/^no quorum$/;
> -	die "got strange reply from mtunnel ('$helo')\n"
> -	    if $helo !~ m/^tunnel online$/;
> -    };
> -    my $err = $@;
> -
> -    if ($err) {
> -	$self->finish_command_pipe($tunnel);
> -	die "can't open migration tunnel - $err";
> -    }
> -    return $tunnel;
> -}
> -
> -sub finish_tunnel {
> -    my ($self, $tunnel) = @_;
> -
> -    my $writer = $tunnel->{writer};
> -
> -    eval {
> -	PVE::Tools::run_with_timeout(30, sub {
> -	    print $writer "quit\n";
> -	    $writer->flush();
> -	});
> -    };
> -    my $err = $@;
> -
> -    $self->finish_command_pipe($tunnel, 30);
> -
> -    die $err if $err;
> +    # .. and remote side that socket disappeared and is ready to reuse
> +    unshift @{$cmd}, @{$self->{rem_ssh}};
> +    PVE::Tools::run_command($cmd);
>  }
>  
>  sub lock_vm {
> @@ -330,6 +284,7 @@ sub phase2 {
>  
>      my $raddr;
>      my $rport;
> +    my $ruri; # the whole migration dst. URL (protocol:/address[:port])
>      my $nodename = PVE::INotify::nodename();
>  
>      ## start on remote node
> @@ -353,14 +308,22 @@ sub phase2 {
>      # instead we pipe it through STDIN
>      PVE::Tools::run_command($cmd, input => $spice_ticket, outfunc => sub {
>  	my $line = shift;
> +	$self->log('info', $line);
>  
>  	if ($line =~ m/^migration listens on tcp:(localhost|[\d\.]+|\[[\d\.:a-fA-F]+\]):(\d+)$/) {
>  	    $raddr = $1;
>  	    $rport = int($2);
> +	    $ruri = "tcp:$raddr:$rport";
> +	}
> +	elsif ($line =~ m!^migration listens on unix:(/run/qemu-server/(\d+)\.migrate)$!) {
> +	    $raddr = $1;
> +	    die "Destination UNIX sockets VMID does not match source VMID" if $vmid ne $2;
> +	    $ruri = "unix:$raddr";
>  	}
>  	elsif ($line =~ m/^migration listens on port (\d+)$/) {
>  	    $raddr = "localhost";
>  	    $rport = int($1);
> +	    $ruri = "tcp:$raddr:$rport";
>  	}
>          elsif ($line =~ m/^spice listens on port (\d+)$/) {
>  	    $spice_port = int($1);
> @@ -372,14 +335,26 @@ sub phase2 {
>  
>      die "unable to detect remote migration address\n" if !$raddr;
>  
> -    ## create tunnel to remote port
> -    $self->log('info', "starting ssh migration tunnel");
> -    my $pfamily = PVE::Tools::get_host_address_family($nodename);
> -    my $lport = ($raddr eq "localhost") ? PVE::Tools::next_migrate_port($pfamily) : undef;
> -    $self->{tunnel} = $self->fork_tunnel($self->{nodeip}, $lport, $rport);
> +    if ($ruri =~ /^unix:/) {
> +	## create tunnel to remote port
> +	$self->log('info', "start remote tunnel");
> +	$self->{tunnel} = $self->fork_tunnel($raddr);
> +
> +	my $unix_socket_try = 0; # wait for the socket to become ready
> +	while (! -S "/run/qemu-server/$vmid.migrate") {
> +	    $unix_socket_try++;
> +	    if ($unix_socket_try > 100) {
> +		$self->{errors} = 1;
> +		$self->finish_tunnel();
> +		die "Timeout, migration socket $ruri did not get ready";
> +	    }
> +
> +	    usleep(10000);
> +	}
> +    }
>  
>      my $start = time();
> -    $self->log('info', "starting online/live migration on $raddr:$rport");
> +    $self->log('info', "starting online/live migration on $ruri");
>      $self->{livemigration} = 1;
>  
>      # load_defaults
> @@ -438,10 +413,10 @@ sub phase2 {
>      }
>  
>      eval {
> -        PVE::QemuServer::vm_mon_cmd_nocheck($vmid, "migrate", uri => "tcp:$raddr:$rport");
> +        PVE::QemuServer::vm_mon_cmd_nocheck($vmid, "migrate", uri => $ruri);
>      };
>      my $merr = $@;
> -    $self->log('info', "migrate uri => tcp:$raddr:$rport failed: $merr") if $merr;
> +    $self->log('info', "migrate uri => $ruri failed: $merr") if $merr;
>  
>      my $lstat = 0;
>      my $usleep = 2000000;
> @@ -538,13 +513,10 @@ sub phase2 {
>  	    die "unable to parse migration status '$stat->{status}' - aborting\n";
>  	}
>      }
> -    #to be sure tat the tunnel is closed 
> +
> +    # just to be sure that the tunnel always gets closed
>      if ($self->{tunnel}) {
> -	eval { finish_tunnel($self, $self->{tunnel});  };
> -	if (my $err = $@) {
> -	    $self->log('err', $err);
> -	    $self->{errors} = 1;
> -	}
> +	finish_tunnel($self);
>      }
>  }
>  
> @@ -580,11 +552,7 @@ sub phase2_cleanup {
>      }
>  
>      if ($self->{tunnel}) {
> -	eval { finish_tunnel($self, $self->{tunnel});  };
> -	if (my $err = $@) {
> -	    $self->log('err', $err);
> -	    $self->{errors} = 1;
> -	}
> +	finish_tunnel($self});

Ugh, send wrong file which has the syntax problematic "}" here still,
sorry I'll resend
(and need to more strictly cleanup my format-patch outputs directly... :/).

>      }
>  }
>  
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index 33779b3..dd94de4 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -4302,21 +4302,27 @@ sub vm_start {
>  
>  	my ($cmd, $vollist, $spice_port) = config_to_command($storecfg, $vmid, $conf, $defaults, $forcemachine);
>  
> -	my $migrate_port = 0;
>  	my $migrate_uri;
>  	if ($statefile) {
>  	    if ($statefile eq 'tcp') {
> -		my $localip = "localhost";
> +		# default to secure migrations: use ssh over a socat managed UNIX socket
> +		# pair, as we a ssh forwards tunnel is not deterministic reliable ready
> +		$migrate_uri = "unix:/run/qemu-server/$vmid.migrate";
> +
>  		my $datacenterconf = PVE::Cluster::cfs_read_file('datacenter.cfg');
> -		my $nodename = PVE::INotify::nodename();
>  		if ($datacenterconf->{migration_unsecure}) {
> -			$localip = PVE::Cluster::remote_node_ip($nodename, 1);
> -			$localip = "[$localip]" if Net::IP::ip_is_ipv6($localip);
> +		    my $nodename = PVE::INotify::nodename();
> +		    my $localip = PVE::Cluster::remote_node_ip($nodename, 1);
> +		    $localip = "[$localip]" if Net::IP::ip_is_ipv6($localip);
> +
> +		    my $pfamily = PVE::Tools::get_host_address_family($nodename);
> +		    my $migrate_port = PVE::Tools::next_migrate_port($pfamily) || 0;
> +
> +		    $migrate_uri = "tcp:${localip}:${migrate_port}";
>  		}
> -		my $pfamily = PVE::Tools::get_host_address_family($nodename);
> -		$migrate_port = PVE::Tools::next_migrate_port($pfamily);
> -		$migrate_uri = "tcp:${localip}:${migrate_port}";
> +
>  		push @$cmd, '-incoming', $migrate_uri;
> +
>  		push @$cmd, '-S';
>  	    } else {
>  		push @$cmd, '-loadstate', $statefile;





More information about the pve-devel mailing list