[PVE-User] Cluster doesn't recover automatically after blackout

Wed Aug 1 13:57:09 CEST 2018

On Wed, Aug 01, 2018 at 01:40:34PM +0200, Eneko Lacunza wrote:
> Hi Alwin,
> 
> El 01/08/18 a las 12:56, Alwin Antreich escribió:
> > On Wed, Aug 01, 2018 at 11:02:18AM +0200, Eneko Lacunza wrote:
> > > Hi all,
> > > 
> > > This morning there was a quite long blackout which powered off a cluster of
> > > 3 proxmox 5.1 servers.
> > > 
> > > All 3 servers the same make and model, so they need the same amount of time
> > > to boot.
> > > 
> > > When the power came back, servers started correctly but corosync couldn't
> > > set up a quorum. Events timing:
> > I recommend against, servers returning automatically to previous power
> > state after a power loss. A manual start up is better, as by then the
> > admin made sure power is back to normal operation. This will also reduce
> > the chance of breakage if there are subsequent power or hardware
> > failures.
> This is an off-site place with no knowledgeable sysadmins and servers don't
> have remote control cards. I'm sure they would screw the boot up  :)
> 
> I'm afraid we have to take the risk. :)
A boot delay, if the server have such a setting or switchable UPS power
plugs might help. :)

> > 
> > > 07:57:10 corosync start
> > > 07:57:15 first pmxcfs error quorum_initialize_failed: 2
> > > 07:57:52 network up
> > > 07:58:40 Corosync timeout
> > > 07:59:57 time sync works
> > > 
> > > What I can see is that network switch boot was slower than server's, but
> > > nonetheless network was operational about 45s before corosync gives up
> > > trying to set up a quorum.
> > > 
> > > I also can see that internet access wasn't back until 1 minute after
> > > corosync timeout (the time sync event).
> > > 
> > > A simple restart of pve-cluster at about 9:50 restored the cluster to normal
> > > state.
> > > 
> > > Is this expected? I expected that corosync would set up a quorum after
> > > network was operational....
> > When was multicast working again? That might have taken longer, as IGMP
> > snooping and the querier on the switch might just take longer to get
> > operating again.
> I don't have that info (or I don't know how to look that in the logs,
> /var/log/corosync is empty). I'm trying to plan an intentional blackout to
> test things again with technicians onsite, we could get more info that day.
Corosync writes into the syslog, there should be more to find.

> 
> Switch is HPE 1820-24G J9980A, it's L2 but quite dumb; we have serveral 18x0
> switches deployed with good results so far.
The switch may hold a log that shows its startup process.