[pve-devel] pvedaemon hanging because of qga retry

Dietmar Maurer dietmar at proxmox.com
Fri May 18 07:27:54 CEST 2018


If you simply skip commands like 'guest-fsfreeze-thaw'
your VM will get totally unusable (frozen). So I am not
sure what you want to suggest?

A correct fix would be to implement an async command queue inside qemu...


> On May 18, 2018 at 7:13 AM Alexandre DERUMIER <aderumier at odiso.com> wrote:
> 
> 
> Seem to be introduced a long time ago in 2012
> 
> https://git.proxmox.com/?p=qemu-server.git;a=blobdiff;f=PVE/QMPClient.pm;h=9829986ae77e82d340974e4d4128741ef85b4a0e;hp=d026f4d4c3012203d96660a311b1890e84e6aa18;hb=6d04217600f2145ee80d5d62231b8ade34f2e5ff;hpb=037a97463447b06ebf79a7f1d40c596d9955acee
> 
> previously, connect timeout was 1s. 
> 
> I think we don't have qga support at this time. Not sure why it's have been
> increased for qmp command ?
> 
> 
> (with 1s, it's working fine if qga agent is down).
> 
> 
> 
> ----- Mail original -----
> De: "aderumier" <aderumier at odiso.com>
> À: "pve-devel" <pve-devel at pve.proxmox.com>
> Envoyé: Vendredi 18 Mai 2018 00:37:30
> Objet: Re: [pve-devel] pvedaemon hanging because of qga retry
> 
> in qmpclient : open_connection 
> 
> for (;;) { 
> $count++; 
> $fh = IO::Socket::UNIX->new(Peer => $sname, Blocking => 0, Timeout => 1); 
> last if $fh; 
> if ($! != EINTR && $! != EAGAIN) { 
> die "unable to connect to VM $vmid $sotype socket - $!\n"; 
> } 
> my $elapsed = tv_interval($starttime, [gettimeofday]); 
> if ($elapsed >= $timeout) { 
> die "unable to connect to VM $vmid $sotype socket - timeout after $count
> retries\n"; 
> } 
> usleep(100000); 
> } 
> 
> 
> we use $elapsed >= $timeout. 
> 
> Isn't this timeout for command execution time and not connect time ? 
> 
> I'm seeing at the end: 
> $self->{mux}->set_timeout($fh, $timeout); 
> 
> seem to be the command execution time in the muxer 
> 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Alexandre Derumier" <aderumier at odiso.com> 
> À: "pve-devel" <pve-devel at pve.proxmox.com> 
> Envoyé: Jeudi 17 Mai 2018 23:16:36 
> Objet: [pve-devel] pvedaemon hanging because of qga retry 
> 
> Hi, 
> I had a strange behaviour today, 
> 
> with a vm running + qga enabled, but qga service down in the vm 
> 
> after theses attempts, 
> 
> May 17 21:54:01 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp
> command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket
> - timeout after 101 retries 
> May 17 21:55:10 kvm14 pvedaemon[20088]: VM 745 qmp command failed - VM 745 qmp
> command 'guest-fsfreeze-thaw' failed - unable to connect to VM 745 qga socket
> - timeout after 101 retries 
> 
> 
> some api request give 596 errors, mainly for the 745 vm
> (/api2/json/nodes/kvm14/qemu/745/status/current), 
> but also for the server kvm14 on /api2/json/nodes/kvm14/qemu 
> 
> 
> restarting the pvedaemon have fixed the problem 
> 
> 10.59.100.141 - root at pam [17/05/2018:21:53:51 +0200] "POST
> /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:21:55:00 +0200] "POST
> /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:01:28 +0200] "POST
> /api2/json/nodes/kvm14/qemu/745/agent/fsfreeze-freeze HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:01:30 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:02:21 +0200] "GET
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:03:05 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:03:32 +0200] "GET
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:04:40 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:05:01 +0200] "GET
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:05:59 +0200] "GET
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:06:15 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:07:50 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:09:25 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:11:00 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:12:35 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:14:19 +0200] "GET
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:15:44 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:17:19 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:18:54 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:20:29 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:22:04 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:23:39 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:25:14 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:26:49 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:28:24 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:29:59 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:31:34 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:34:44 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.18 - root at pam [17/05/2018:22:35:30 +0200] "GET
> /api2/json/nodes/kvm14/qemu/733/status/current HTTP/1.1" 596 - 
> 10.59.100.141 - root at pam [17/05/2018:22:37:16 +0200] "GET
> /api2/json/nodes/kvm14/qemu HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:37:24 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:38:59 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 10.3.99.10 - root at pam [17/05/2018:22:40:08 +0200] "GET
> /api2/json/nodes/kvm14/qemu/745/status/current HTTP/1.1" 596 - 
> 
> 
> 
> I'm don't see errors log for fsfreeze (called directly through api), 
> 
> but 
> 
> } elsif ($cmd->{execute} eq 'guest-fsfreeze-freeze') { 
> # freeze syncs all guest FS, if we kill it it stays in an unfreezable 
> # locked state with high probability, so use an generous timeout 
> $timeout = 60*60; # 1 hour 
> 
> 
> it was still running in pvedaemon ? 
> 
> same with 
> # qm agent 745 fsfreeze-freeze 
> 
> 
> I thinked we do a quest-agent ping with a small timeout, before sending the
> longer commands. 
> 
> 
> 
> _______________________________________________ 
> pve-devel mailing list 
> pve-devel at pve.proxmox.com 
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
> 
> _______________________________________________ 
> pve-devel mailing list 
> pve-devel at pve.proxmox.com 
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel




More information about the pve-devel mailing list