[PVE-User] Multicast problems with Intel X540 - 10Gtek network card?

Tue Dec 4 15:57:10 CET 2018

Hi all,

We have just updated a 3-node Proxmox cluster from 3.4 to 5.2, Ceph 
hammer to Luminous and the network from 1 Gbit to 10Gbit... one of the 
three Proxmox nodes is new too :)

Generally all was good and VMs are working  well. :-)

BUT, we have some problems with the cluster; promxox1 node joins and 
then after about 4 minutes drops from the cluster.

All multicast tests
https://pve.proxmox.com/wiki/Multicast_notes#Using_omping_to_test_multicast
run fine except the last one:

*** proxmox1:

root at proxmox1:~# omping -c 600 -i 1 -F -q proxmox1 proxmox3 proxmox4

proxmox3 : waiting for response msg

proxmox4 : waiting for response msg

proxmox3 : joined (S,G) = (*, 232.43.211.234), pinging

proxmox4 : joined (S,G) = (*, 232.43.211.234), pinging

proxmox3 : given amount of query messages was sent

proxmox4 : given amount of query messages was sent

proxmox3 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.073/0.184/0.390/0.061

proxmox3 : multicast, xmt/rcv/%loss = 600/262/56%, min/avg/max/std-dev = 0.092/0.207/0.421/0.068

proxmox4 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.049/0.167/0.369/0.059

proxmox4 : multicast, xmt/rcv/%loss = 600/262/56%, min/avg/max/std-dev = 0.063/0.185/0.386/0.064

*** proxmox3:

root at proxmox3:/etc# omping -c 600 -i 1 -F -q proxmox1 proxmox3 proxmox4

proxmox1 : waiting for response msg

proxmox4 : waiting for response msg

proxmox4 : joined (S,G) = (*, 232.43.211.234), pinging

proxmox1 : waiting for response msg

proxmox1 : joined (S,G) = (*, 232.43.211.234), pinging

proxmox4 : given amount of query messages was sent

proxmox1 : given amount of query messages was sent

proxmox1 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.083/0.193/1.030/0.055

proxmox1 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.102/0.209/1.050/0.054

proxmox4 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.041/0.108/0.172/0.026

proxmox4 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.048/0.123/0.190/0.030

*** root at proxmox4:~# omping -c 600 -i 1 -F -q proxmox1 proxmox3 proxmox4

proxmox1 : waiting for response msg

proxmox3 : waiting for response msg

proxmox1 : waiting for response msg

proxmox3 : waiting for response msg

proxmox3 : joined (S,G) = (*, 232.43.211.234), pinging

proxmox1 : joined (S,G) = (*, 232.43.211.234), pinging

proxmox1 : given amount of query messages was sent

proxmox3 : given amount of query messages was sent

proxmox1 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.085/0.188/0.356/0.040

proxmox1 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.114/0.208/0.377/0.041

proxmox3 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.048/0.117/0.289/0.023

proxmox3 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.064/0.134/0.290/0.026

Ok, so it seems we have a network problem on proxmox1 node. Network 
cards are as follows:

- proxmox1: Intel X540 (10Gtek)
- proxmox3: Intel X710 (Intel)
- proxmox4: Intel X710 (Intel)

Switch is Dell N1224T-ON.

Does anyone have experience with Intel X540 chip network cards or Linux 
ixgbe network driver or 10Gtek manufacturer?

If we change corosync communication to 1 Gbit network cards (broadcom) 
connected to an old HPE 1800-24G switch, cluster is stable...

We also have a running cluster with Dell n1224T-ON switch and X710 
network cards without issues.

Thanks a lot
Eneko

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es