[PVE-User] Cluster doesn't recover automatically after blackout

Tue Aug 7 15:54:02 CEST 2018

Just some rant: I do think that the presented solutions are the wrong
approach to this problem. A HA cluster should recover automatically from
such simple failures (power loss, network outage) to achieve HA. If
there is manual intervention necessary - then the whole thing should not
be called "HA" cluster.

I know corosync is picky and for example does not start when a
configured network interface is not available yet. Hence, corosync
should be automatically restarted if it fails.

Intrudocing a "sleep" until the network is available is also a dirty
workaround - the problem is the cluster software - the software should
try to form a cluster endlessly (why should a "HA" software give up?).
Would a mars rover give up and shutdown when it could not ping the earth
for some days? Probably no, it would try endlessly.

regards
Klaus

Am 01.08.2018 um 11:02 schrieb Eneko Lacunza:
> Hi all,
> 
> This morning there was a quite long blackout which powered off a cluster
> of 3 proxmox 5.1 servers.
> 
> All 3 servers the same make and model, so they need the same amount of
> time to boot.
> 
> When the power came back, servers started correctly but corosync
> couldn't set up a quorum. Events timing:
> 
> 07:57:10 corosync start
> 07:57:15 first pmxcfs error quorum_initialize_failed: 2
> 07:57:52 network up
> 07:58:40 Corosync timeout
> 07:59:57 time sync works
> 
> What I can see is that network switch boot was slower than server's, but
> nonetheless network was operational about 45s before corosync gives up
> trying to set up a quorum.
> 
> I also can see that internet access wasn't back until 1 minute after
> corosync timeout (the time sync event).
> 
> A simple restart of pve-cluster at about 9:50 restored the cluster to
> normal state.
> 
> Is this expected? I expected that corosync would set up a quorum after
> network was operational....
> 
> # pveversion -v
> proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
> pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
> pve-kernel-4.13.13-6-pve: 4.13.13-41
> pve-kernel-4.13.13-2-pve: 4.13.13-33
> ceph: 12.2.2-pve1
> corosync: 2.4.2-pve3
> criu: 2.11.1-1~bpo90
> glusterfs-client: 3.8.8-1
> ksm-control-daemon: 1.2-2
> libjs-extjs: 6.0.1-2
> libpve-access-control: 5.0-8
> libpve-common-perl: 5.0-28
> libpve-guest-common-perl: 2.0-14
> libpve-http-server-perl: 2.0-8
> libpve-storage-perl: 5.0-17
> libqb0: 1.0.1-1
> lvm2: 2.02.168-pve6
> lxc-pve: 2.1.1-2
> lxcfs: 2.0.8-2
> novnc-pve: 0.6-4
> proxmox-widget-toolkit: 1.0-11
> pve-cluster: 5.0-20
> pve-container: 2.0-19
> pve-docs: 5.1-16
> pve-firewall: 3.0-5
> pve-firmware: 2.0-3
> pve-ha-manager: 2.0-5
> pve-i18n: 1.0-4
> pve-libspice-server1: 0.12.8-3
> pve-qemu-kvm: 2.9.1-9
> pve-xtermjs: 1.0-2
> qemu-server: 5.0-22
> smartmontools: 6.5+svn4324-1
> spiceterm: 3.0-5
> vncterm: 1.5-3
> zfsutils-linux: 0.7.6-pve1~bpo9
> 
> Thanks a lot
> Eneko
>