[PVE-User] Fwd: Cluster fiasco and recovery

Thu Feb 14 16:31:57 CET 2019

Thank you Ian !

I had (in corosync) some nodes by name and others by hostnames in 
corosync. Due to adding to the cluster by IP and others by hostnames. I 
edited the corosync.conf and put only IP addresses. For example:

   node {
     name: proxmox5
     nodeid: 12
     quorum_votes: 1
     ring0_addr: 192.168.80.5

And then the command "pvecm nodes" return just nodes's IPs. Maybe using 
DNS for each corosync query was a lot, and maybe DNS has been busy.

This way, corosync should works better!

-------- Mensaje reenviado --------
Asunto: 	[PVE-User] Cluster fiasco and recovery
Fecha: 	Thu, 14 Feb 2019 09:15:49 +0200
De: 	Ian Coetzee <proxmox at iancoetzee.za.net>
Responder a: 	PVE User List <pve-user at pve.proxmox.com>
Para: 	PVE User List <pve-user at pve.proxmox.com>

Hi All,

I just wanted to share something that happened with me yesterday. I am
hoping that this will save someone else 4 hours in the future.

We have a smallish cluster of 5 nodes (3 compute, 2 storage). The one
compute node also doubles as a storage node.

Yesterday, I was adding another compute node into the mix when suddenly the
other compute nodes just randomly rebooted with nothing to show in the
logs. As a result, the cluster lost quorum, as would be expected. What
wasn't expected was that the cluster could not re-establish quorum. It
looked to be a network related issue and I remembered that back in the days
when I started the cluster, I had a lot of issues between the linux bonded
interface and the switch, where I had to *if{down,up} bond0; *after a
startup. It seems the kernel brings the bond up, and then tries to set the
bond mode to lacp.... But I digress.

So the first this I did was to troubleshoot this part by changing the bond
types on the switches. When that didn't work I tore down the bonds and went
back to the basics. Still the cluster can't establish quorum. After pulling
out most of my hair in frustration, I went to ask google. Came across a
post dated a while ago where the OP had a member offline while joining a
new member. This is what pointed me to actually look at my corosync.conf
file.

Turned out, when I joined the first three members, it was from the cli
(before the option was available in the gui) using the dns names of the
other cluster member. This the important part, as you can probably guess,
my dns servers was also running in the cluster, as such, they went offline
when the compute nodes rebooted. Thus the cluster members had no idea what
the ip addresses are of the first 3 nodes...

I replaced the dns hostnames in the corosync.conf with the actual ip
addresses and the cluster established quorum.

So my advice to you guys out there, make sure that:

- Your corosync.conf uses ip addresses
or
- If you want to use hostnames, put those hostnames in your /etc/hosts
file

I also think corosync should have logged errors that it is unable to
resolve the hostnames to ip.

Just a little piece of advice from me to you.

Proxmox rocks though!
_______________________________________________
pve-user mailing list
pve-user at pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user