[pve-devel] [PATCH 2/6] qemu_drive_mirror : handle multiple jobs

Fri Dec 23 11:33:18 CET 2016

>>Doesn't the job show up in query-block-jobs before the connection
>>finishes? If so then we could just assume if it's still at 0% after a
>>timeout the connection doesn't work? The question is though, what is a
>>good timeout?

the problem is that the qemu monitor is totally hanging...

>>Or we find (or add) a way to query the connection status, then we can 
>>poll that before we start polling the block-job status. 
>>(we already poll for stuff, so it's not worse than before ;p) 

maybe can we mount a tunnel (it's possible to use unix socket like for live migration).
but I would like to have something without ssh if possible.
maybe with socat or something like (it's possible to do native tls in qemu now for migration and drive mirror)
Like this we can be sure that the tunnel is established before launching drive-mirror ?

But yes, it could be great to avoid to maintain another qemu patch.
(and I'm pretty poor with C language and network programming)

----- Mail original -----
De: "Wolfgang Bumiller" <w.bumiller at proxmox.com>
À: "aderumier" <aderumier at odiso.com>
Cc: "pve-devel" <pve-devel at pve.proxmox.com>
Envoyé: Vendredi 23 Décembre 2016 10:53:12
Objet: Re: [pve-devel] [PATCH 2/6] qemu_drive_mirror : handle multiple jobs

On Fri, Dec 23, 2016 at 10:20:03AM +0100, Alexandre DERUMIER wrote: 
> Hi wolfgang, 
> 
> I have done more test with drive mirror and nbd target. 
> 
> It seem that the hang occur only if the target ip is unreacheable (no network reponse) 
> 
> # drive_mirror -n drive-scsi0 nbd://66.66.66.66/target 
> ERROR: VM 183 qmp command 'human-monitor-command' failed - got timeout 
> 
> If the ip address exist and up, 
> # drive_mirror -n drive-scsi0 nbd://10.3.94.89:666/target 
> Failed to connect socket: Connection refused 
> 
> 
> 
> I'm not sure, maybe it can hang too if pve-firewall do a drop instead a reject on target port. 
> 
> 
> 
> I think this come from in qemu net/socket.c, 
> 
> where we have an infinite loop. 
> 
> I'm not sure how to add a timeout here, help is welcome :) 

Doesn't the job show up in query-block-jobs before the connection 
finishes? If so then we could just assume if it's still at 0% after a 
timeout the connection doesn't work? The question is though, what is a 
good timeout? 

Or we find (or add) a way to query the connection status, then we can 
poll that before we start polling the block-job status. 
(we already poll for stuff, so it's not worse than before ;p) 

Then again, an actual timeout on qemu's side might even make it 
upstream...? Maybe? (If it doesn't I'd prefer polling as described above 
over maintaining yet another qemu patch ;-) ) 

But it won't be a simple select/poll timeout as they use non-blocking 
sockets and register the fd with a callback for the connection 
completion later (in net_socket_fd_init() it calls 
qemu_set_fd_handler() where it ends up controlled by the aio handlers). 

It might be possible to add a timer (qemu_timer_new() & friends 
perhaps?) to NetSocketState (the NetSocktState cleanup would have to 
remove it, and if it gets triggered it cancels the connection instead). 

> static int net_socket_connect_init(NetClientState *peer, 
> const char *model, 
> const char *name, 
> const char *host_str) 
> { 
> NetSocketState *s; 
> int fd, connected, ret; 
> struct sockaddr_in saddr; 
> 
> if (parse_host_port(&saddr, host_str) < 0) 
> return -1; 
> 
> fd = qemu_socket(PF_INET, SOCK_STREAM, 0); 
> if (fd < 0) { 
> perror("socket"); 
> return -1; 
> } 
> qemu_set_nonblock(fd); 
> 
> connected = 0; 
> for(;;) { 
> ret = connect(fd, (struct sockaddr *)&saddr, sizeof(saddr)); 
> if (ret < 0) { 
> if (errno == EINTR || errno == EWOULDBLOCK) { 
> /* continue */ 
> } else if (errno == EINPROGRESS || 
> errno == EALREADY || 
> errno == EINVAL) { 
> break; 
> } else { 
> perror("connect"); 
> closesocket(fd); 
> return -1; 
> } 
> } else { 
> connected = 1; 
> break; 
> } 
> } 
> s = net_socket_fd_init(peer, model, name, fd, connected); 
> if (!s) 
> return -1; 
> snprintf(s->nc.info_str, sizeof(s->nc.info_str), 
> "socket: connect to %s:%d", 
> inet_ntoa(saddr.sin_addr), ntohs(saddr.sin_port)); 
> return 0; 
> } 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "aderumier" <aderumier at odiso.com> 
> À: "Wolfgang Bumiller" <w.bumiller at proxmox.com> 
> Cc: "pve-devel" <pve-devel at pve.proxmox.com> 
> Envoyé: Mercredi 21 Décembre 2016 13:57:10 
> Objet: Re: [pve-devel] [PATCH 2/6] qemu_drive_mirror : handle multiple jobs 
> 
> >>Then it can still hang if the destination disappears between tcp_ping() 
> >>and the `drive-mirror` command, so I'd rather get better behavior on qemu's 
> >>side. It needs a time-out or a way to cancel it or something. 
> Yes sure! 
> 
> I'm currently looking at qemu code to see how nbd client works. 
> 
> ----- Mail original ----- 
> De: "Wolfgang Bumiller" <w.bumiller at proxmox.com> 
> À: "aderumier" <aderumier at odiso.com> 
> Cc: "pve-devel" <pve-devel at pve.proxmox.com>, "dietmar" <dietmar at proxmox.com> 
> Envoyé: Mercredi 21 Décembre 2016 12:20:28 
> Objet: Re: [pve-devel] [PATCH 2/6] qemu_drive_mirror : handle multiple jobs 
> 
> > On December 21, 2016 at 10:51 AM Alexandre DERUMIER <aderumier at odiso.com> wrote: 
> > 
> > 
> > >>IIRC that was the only blocker. 
> > >> 
> > >>Basically the patchset has to work *without* tcp_ping() since it is an 
> > >>unreliable check, and then we still have to catch failing connections 
> > >>_correctly_. (There's no point in knowing that "some time in the past 
> > >>you were able to connect to something which may or may not have been a 
> > >>qemu nbd server", we need to know whether the drive-mirror job itself 
> > >>was able to connect.) 
> > 
> > For me, the mirror job auto abort if connection is failing during the migration. Do you see another behaviour ? 
> 
> That covers one problem. IIRC the disk-deletion problem was that due 
> to wrong [] usage around an ipv6 address it could not connect in the 
> first place and didn't error as I would have hoped. 
> 
> > 
> > the tcp_ping was just before launching the drive mirror command, because it was hanging in this case. 
> 
> Then it can still hang if the destination disappears between tcp_ping() 
> and the `drive-mirror` command, so I'd rather get better behavior on qemu's 
> side. It needs a time-out or a way to cancel it or something. 
> 
> _______________________________________________ 
> pve-devel mailing list 
> pve-devel at pve.proxmox.com 
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
> 
>