[PVE-User] Cluster disaster

Mon Nov 14 12:33:27 CET 2016

On 14.11.2016 11:50, Dhaussy Alexandre wrote:
>
> Le 11/11/2016 à 19:43, Dietmar Maurer a écrit :
>> On November 11, 2016 at 6:41 PM Dhaussy Alexandre
>> <ADhaussy at voyages-sncf.com> wrote:
>>>> you lost quorum, and the watchdog expired - that is how the watchdog
>>>> based fencing works.
>>> I don't expect to loose quorum when _one_ node joins or leave the cluster.
>> This was probably a long time before - but I have not read through the whole
>> logs ...
> That makes no sense to me..
> The fact is : everything have been working fine for weeks.
>
>
> What i can see in the logs is : several reboots of cluster nodes
> suddently, and exactly one minute after one node joining and/or leaving
> the cluster.

The watchdog is set to an 60 second timeout, meaning that cluster leave caused
quorum loss, or other problems (you said you had multicast problems around that
time) thus the LRM stopped updating the watchdog, so one minute later it resetted
all nodes, which left the quorate partition.

> I see no problems with corosync/lrm/crm before that.
> This leads me to a probable network (multicast) malfunction.
>
> I did a bit of homeworks reading the wiki about ha manager..
>
> What i understand so far, is that every state/service change from LRM
> must be acknowledged (cluster-wise) by CRM master.

Yes and no, LRM and CRM are two state machines with synced inputs,
but that holds mainly for human triggered commands and the resulting
communication.
Meaning that commands like start, stop, migrate may not go through from
the CRM to the LRM. Fencing and such stuff works none the less, else it
would be a major design flaw :)

> So if a multicast disruption occurs, and i assume LRM wouldn't be able
> talk to the CRM MASTER, then it also couldn't reset the watchdog, am i
> right ?
>

No, the watchdog runs on each node and is CRM independent.
As watchdogs are normally not able to server more clients we wrote
the watchdog-mux (multiplexer).
This is a very simple C program which opens the watchdog with a
60 second timeout and allows multiple clients (at the moment CRM
and LRM) to connect to it.
If a client does not resets the dog for about 10 seconds, IIRC, the
watchdox-mux disables watchdogs updates on the real watchdog.
After that a node reset will happen *when* the dog runs out of time,
not instantly.

So if the LRM cannot communicate (i.e. has no quorum) he will stop
updating the dog, thus trigger independent what the CRM says or does.

> Another thing ; i have checked my network configuration, the cluster ip
> is set on a linux bridge...
> By default multicast_snooping is set to 1 on linux bridge, so i think it
> there's a good chance this is the source of my problems...
> Note that we don't use IGMP snooping, it is disabled on almost all
> network switchs.
>

Yes, multicast snooping has to be configured (recommended) or else turned off on the switch.
That's stated in some wiki articles, various forum posts and our docs, here:
http://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements

Hope that helps a bit understanding. :)

cheers,
Thomas

> Plus i found a post by A.Derumier (yes, 3 years old..) He did have
> similar issues with bridge and multicast.
> http://pve.proxmox.com/pipermail/pve-devel/2013-March/006678.html
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>