[PVE-User] lxc hang situation

Fri Nov 30 16:51:50 CET 2018

Hi @proxmox,

Since some months we are experiencing frequent 'hang' situations on our 
proxmox nodes.

Today, again, such a situation occured. So we took some time to look at the 
situation on hand.

The situation 'started' when we did a

 pct start 1310

This did not return. And looking at the process list showed that we had this:

 21462 ?        Ss     0:00 /usr/bin/lxc-start -n 1310
 21619 ?        Z      0:00  \_ [lxc-start] <defunct>
 21758 ?        Ss     0:00 [lxc monitor] /var/lib/lxc 1310
 24681 ?        D      0:00  \_ [lxc monitor] /var/lib/lxc 1310

situation.

When looking at the wait-channel, the namespaces and the stack of 24681 we 
noticed that it was blocked in 

 [<0>] copy_net_ns+0x

After some more searching, we found with

 grep copy_net_ns /proc/[0-9]*/stack

that there where 2 more processes also blocked on copy_net_ns. These where
two ionclean processes in other containers. Killing them (with -9) showed
that restarted ionclean processes immediatly blocked again on copy_net_ns.

The system on which proxmox is running has 2 Intel(R) Xeon(R) CPU E5-2690 v4
CPU's with 14 cores and 28 threads. In proxmox with multithreading this shows
as 56 cpu's. So real concurrency is possible.

The problem seems like a race condition on some resource. But killing (with -9)
all the processes that are hanging on copy_net_ns does not make the kernel 
release the contented resource. After killing all the processes on copy_net_ns
and with no process having a stack showing copy_net_ns, starting a new container
immediately blocks again on copy_net_ns.  So only a reboot (as far as we know) 
solves this.

We played around with ip li set netns, on the veth devices, etc. but we could
not get the machine out of this situation in any way other then reboot.

Based on all this we found that in 

 https://github.com/lxc/lxd/issues/4468

it says that this problem should be solved in kernel 4.17.

We run the latest proxmox enterprise updates on this machine and it's kernel is

 PVE 4.15.18-30 (Thu, 15 Nov 2018 13:32:46 +0100)

As the kernel is ubuntu based would it be possible to start using the ubuntu
18.10 kernel which is 4.18 to get around this problem?

-- 
Kind regards,
Stephan Leemburg
IT Functions