[PVE-User] Cluster fiasco and recovery

Thu Feb 14 10:29:04 CET 2019

MANY thanks Ian !!!
I am planning to add another node next month
also using LAG at bond0 and DNS failover lives at CTs
so adding IP to /etc/hosts @all nodes ;-)
MANY thanks to Dietmar and Martin who runs this pve-user list !!!

El 2019-02-14 08:15, Ian Coetzee escribió:
> Hi All,
> 
> I just wanted to share something that happened with me yesterday. I am
> hoping that this will save someone else 4 hours in the future.
> 
> We have a smallish cluster of 5 nodes (3 compute, 2 storage). The one
> compute node also doubles as a storage node.
> 
> Yesterday, I was adding another compute node into the mix when suddenly 
> the
> other compute nodes just randomly rebooted with nothing to show in the
> logs. As a result, the cluster lost quorum, as would be expected. What
> wasn't expected was that the cluster could not re-establish quorum. It
> looked to be a network related issue and I remembered that back in the 
> days
> when I started the cluster, I had a lot of issues between the linux 
> bonded
> interface and the switch, where I had to *if{down,up} bond0; *after a
> startup. It seems the kernel brings the bond up, and then tries to set 
> the
> bond mode to lacp.... But I digress.
> 
> So the first this I did was to troubleshoot this part by changing the 
> bond
> types on the switches. When that didn't work I tore down the bonds and 
> went
> back to the basics. Still the cluster can't establish quorum. After 
> pulling
> out most of my hair in frustration, I went to ask google. Came across a
> post dated a while ago where the OP had a member offline while joining 
> a
> new member. This is what pointed me to actually look at my 
> corosync.conf
> file.
> 
> Turned out, when I joined the first three members, it was from the cli
> (before the option was available in the gui) using the dns names of the
> other cluster member. This the important part, as you can probably 
> guess,
> my dns servers was also running in the cluster, as such, they went 
> offline
> when the compute nodes rebooted. Thus the cluster members had no idea 
> what
> the ip addresses are of the first 3 nodes...
> 
> I replaced the dns hostnames in the corosync.conf with the actual ip
> addresses and the cluster established quorum.
> 
> So my advice to you guys out there, make sure that:
> 
>    - Your corosync.conf uses ip addresses
>    or
>    - If you want to use hostnames, put those hostnames in your 
> /etc/hosts
>    file
> 
> I also think corosync should have logged errors that it is unable to
> resolve the hostnames to ip.
> 
> Just a little piece of advice from me to you.
> 
> Proxmox rocks though!
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user