[PVE-User] VM network disconnect issue after upgrade to PVE 6.1

Thu Feb 20 10:05:12 CET 2020

Hi all,

On february 11th we upgraded a PVE 5.3 cluster to 5.4, then to 6.1 .

This is an hyperconverged cluster with 3 servers, redundant network, 
Ceph with two storage pools, one HDD based and the other SSD based:

Each server consists of:
- Dell R530
- 1xXeon E5-2620 8c/16t 2.1Ghz
- 64GB RAM
- 4x1Gbit ethernet (2 bonds)
- 2x10Gbit ethernet (1 bond)
- 1xIntel S4500 480GB - System + Bluestore DB for HDDs
- 1xIntel S4500 480GB - Bluestore OSD
- 4x1TB HDD - Bluestore OSD (with 30GB db on SSD)

There are two Dell n1224T switches, each bond has one interface to each 
switch. Bonds are Active/passive, all active interfaces are on the same 
switch.

vmbr0 is on a 2x1Gbit bond0
Ceph public and private are on 2x10Gbit bond2
Backup network is IPv6 on 2x1Gbit bond1, to a Synology NAS.

SSD disk wearout is at 0%.

It seems that since the upgrade, were're experiencing network 
connectivity issues in the night, during the backup window.

We think that the backups may be the issue; until yesterday backups were 
done over vmbr0 with IPv4; as they nearly saturated the 1Gbit link, we 
changed the network and storage configuration so that backup NAS access 
was done over bond1, as it wasn't used previously. We're using IPv6 now 
because Synology can't configure two IPv4 on a bond from the GUI.

But it seems the issue has happened again tonight (SQL Server connection 
drop). VM has network connectivity on the morning, so it isn't a 
permanent problem.

We tried running the main VM backup yesterday morning, but couldn't 
reproduce the issue, although during regular backup all 3 nodes are 
doing backups and in the test we only performed the backup of the only 
VM storaged on SSD pool.

This VM has 8vcores, 10GB of RAM, one disk Virtio scsi0 300GB 
cache=writeback, network is e1000.

Backup reports:
NFO: status: 100% (322122547200/322122547200), sparse 22% (72698785792), 
duration 2416, read/write 3650/0 MB/s
INFO: transferred 322122 MB in 2416 seconds (133 MB/s)

And peaks like:
INFO: status: 70% (225552891904/322122547200), sparse 3% (12228284416), 
duration 2065, read/write 181/104 MB/s
INFO: status: 71% (228727980032/322122547200), sparse 3% (12228317184), 
duration 2091, read/write 122/122 MB/s
INFO: status: 72% (232054063104/322122547200), sparse 3% (12228349952), 
duration 2118, read/write 123/123 MB/s
INFO: status: 73% (235237539840/322122547200), sparse 3% (12230103040), 
duration 2147, read/write 109/109 MB/s
INFO: status: 74% (238500708352/322122547200), sparse 3% (12237438976), 
duration 2177, read/write 108/108 MB/s

Also, during backup we see the following messages in syslog of the 
physical node:
Feb 20 00:00:18 sotllo pve-ha-lrm[3930696]: VM 103 qmp command failed - 
VM 103 qmp command 'query-status' failed - got timeout
Feb 20 00:00:18 sotllo pve-ha-lrm[3930696]: VM 103 qmp command 
'query-status' failed - got timeout#012
Feb 20 00:00:28 sotllo pve-ha-lrm[3930759]: VM 103 qmp command failed - 
VM 103 qmp command 'query-status' failed - unable to connect to VM 103 
qmp socket - timeout after 31 retries
Feb 20 00:00:28 sotllo pve-ha-lrm[3930759]: VM 103 qmp command 
'query-status' failed - unable to connect to VM 103 qmp socket - timeout 
after 31 retries#012
Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command failed - 
VM 103 qmp command 'query-status' failed - unable to connect to VM 103 
qmp socket - timeout after 31 retries
Feb 20 00:00:38 sotllo pve-ha-lrm[3930822]: VM 103 qmp command 
'query-status' failed - unable to connect to VM 103 qmp socket - timeout 
after 31 retries#012
[...]
Feb 20 00:40:38 sotllo pve-ha-lrm[3948846]: VM 103 qmp command failed - 
VM 103 qmp command 'query-status' failed - got timeout
Feb 20 00:40:38 sotllo pve-ha-lrm[3948846]: VM 103 qmp command 
'query-status' failed - got timeout#012
Feb 20 00:41:28 sotllo pve-ha-lrm[3949193]: VM 103 qmp command failed - 
VM 103 qmp command 'query-status' failed - got timeout
Feb 20 00:41:28 sotllo pve-ha-lrm[3949193]: VM 103 qmp command 
'query-status' failed - got timeout#012

So it seems backup is having a big impact on the VM. This is only seen 
for 3 of the 4 VMs in HA, but for the other VMs it is just logged twice, 
and not everyday (there're on the HDD pool). For this VM there are lots 
of logs everyday.

CPU during backup is low in the physical server, about 1.5-3.5 max load 
and 10% max use.

Although it has been working fine until now, maybe e1000 emulation could 
be the issue? We'll have to schedule downtime but can try to change to 
virtio.

Any other ideas about what could be producing the issue?

Thanks a lot for reading through here!!

All three nodes have the same versions:

root at sotllo:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.13-3-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-5.3: 6.1-3
pve-kernel-helper: 6.1-3
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-11
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-19
pve-docs: 6.1-4
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-10
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es