[PVE-User] Cluster fiasco and recovery

Ian Coetzee proxmox at iancoetzee.za.net
Thu Feb 14 08:15:49 CET 2019


Hi All,

I just wanted to share something that happened with me yesterday. I am
hoping that this will save someone else 4 hours in the future.

We have a smallish cluster of 5 nodes (3 compute, 2 storage). The one
compute node also doubles as a storage node.

Yesterday, I was adding another compute node into the mix when suddenly the
other compute nodes just randomly rebooted with nothing to show in the
logs. As a result, the cluster lost quorum, as would be expected. What
wasn't expected was that the cluster could not re-establish quorum. It
looked to be a network related issue and I remembered that back in the days
when I started the cluster, I had a lot of issues between the linux bonded
interface and the switch, where I had to *if{down,up} bond0; *after a
startup. It seems the kernel brings the bond up, and then tries to set the
bond mode to lacp.... But I digress.

So the first this I did was to troubleshoot this part by changing the bond
types on the switches. When that didn't work I tore down the bonds and went
back to the basics. Still the cluster can't establish quorum. After pulling
out most of my hair in frustration, I went to ask google. Came across a
post dated a while ago where the OP had a member offline while joining a
new member. This is what pointed me to actually look at my corosync.conf
file.

Turned out, when I joined the first three members, it was from the cli
(before the option was available in the gui) using the dns names of the
other cluster member. This the important part, as you can probably guess,
my dns servers was also running in the cluster, as such, they went offline
when the compute nodes rebooted. Thus the cluster members had no idea what
the ip addresses are of the first 3 nodes...

I replaced the dns hostnames in the corosync.conf with the actual ip
addresses and the cluster established quorum.

So my advice to you guys out there, make sure that:

   - Your corosync.conf uses ip addresses
   or
   - If you want to use hostnames, put those hostnames in your /etc/hosts
   file

I also think corosync should have logged errors that it is unable to
resolve the hostnames to ip.

Just a little piece of advice from me to you.

Proxmox rocks though!



More information about the pve-user mailing list