[PVE-User] Cluster doesn't recover automatically after blackout

Eneko Lacunza elacunza at binovo.es
Wed Aug 1 16:12:19 CEST 2018


Hi,

El 01/08/18 a las 13:57, Alwin Antreich escribió:
> On Wed, Aug 01, 2018 at 01:40:34PM +0200, Eneko Lacunza wrote:
>> El 01/08/18 a las 12:56, Alwin Antreich escribió:
>>> On Wed, Aug 01, 2018 at 11:02:18AM +0200, Eneko Lacunza wrote:
>>>> Hi all,
>>>>
>>>> This morning there was a quite long blackout which powered off a cluster of
>>>> 3 proxmox 5.1 servers.
>>>>
>>>> All 3 servers the same make and model, so they need the same amount of time
>>>> to boot.
>>>>
>>>> When the power came back, servers started correctly but corosync couldn't
>>>> set up a quorum. Events timing:
>>> I recommend against, servers returning automatically to previous power
>>> state after a power loss. A manual start up is better, as by then the
>>> admin made sure power is back to normal operation. This will also reduce
>>> the chance of breakage if there are subsequent power or hardware
>>> failures.
>> This is an off-site place with no knowledgeable sysadmins and servers don't
>> have remote control cards. I'm sure they would screw the boot up  :)
>>
>> I'm afraid we have to take the risk. :)
> A boot delay, if the server have such a setting or switchable UPS power
> plugs might help. :)
Yes, I can do that at grub level, that's no problem. But I have to know 
first the correct amount for the delay ;)
>
>>>
>>>> 07:57:10 corosync start
>>>> 07:57:15 first pmxcfs error quorum_initialize_failed: 2
>>>> 07:57:52 network up
>>>> 07:58:40 Corosync timeout
>>>> 07:59:57 time sync works
>>>>
>>>> What I can see is that network switch boot was slower than server's, but
>>>> nonetheless network was operational about 45s before corosync gives up
>>>> trying to set up a quorum.
>>>>
>>>> I also can see that internet access wasn't back until 1 minute after
>>>> corosync timeout (the time sync event).
>>>>
>>>> A simple restart of pve-cluster at about 9:50 restored the cluster to normal
>>>> state.
>>>>
>>>> Is this expected? I expected that corosync would set up a quorum after
>>>> network was operational....
>>> When was multicast working again? That might have taken longer, as IGMP
>>> snooping and the querier on the switch might just take longer to get
>>> operating again.
>> I don't have that info (or I don't know how to look that in the logs,
>> /var/log/corosync is empty). I'm trying to plan an intentional blackout to
>> test things again with technicians onsite, we could get more info that day.
> Corosync writes into the syslog, there should be more to find.
Doesn't seem there is any more to me:
# grep corosync /var/log/syslog
Aug  1 07:57:11 proxmox1 corosync[1697]:  [MAIN  ] Corosync Cluster 
Engine ('2.4.2-dirty'): started and ready to provide service.
Aug  1 07:57:11 proxmox1 corosync[1697]: notice  [MAIN  ] Corosync 
Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Aug  1 07:57:11 proxmox1 corosync[1697]: info    [MAIN  ] Corosync 
built-in features: dbus rdma monitoring watchdog augeas systemd upstart 
xmlconf qdevices qnetd snmp pie relro bindnow
Aug  1 07:57:11 proxmox1 corosync[1697]:  [MAIN  ] Corosync built-in 
features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf 
qdevices qnetd snmp pie relro bindnow
Aug  1 07:58:40 proxmox1 systemd[1]: corosync.service: Start operation 
timed out. Terminating.
Aug  1 07:58:40 proxmox1 systemd[1]: corosync.service: Unit entered 
failed state.
Aug  1 07:58:40 proxmox1 systemd[1]: corosync.service: Failed with 
result 'timeout'.
Aug  1 09:51:35 proxmox1 corosync[32220]:  [MAIN  ] Corosync Cluster 
Engine ('2.4.2-dirty'): started and ready to provide service.

This last line is our manual pve-cluster restart .

>
>> Switch is HPE 1820-24G J9980A, it's L2 but quite dumb; we have serveral 18x0
>> switches deployed with good results so far.
> The switch may hold a log that shows its startup process.
Seems it was disabled, we have enabled it.

Thanks
Eneko

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es




More information about the pve-user mailing list