[PVE-User] node not rebooted after corosync crash

Fri Aug 17 16:30:34 CEST 2018

The ipmi_watchdog is a hardware watchdog which the OS pokes to keep happy.
If the OS hangs/crashes and therefore fails to poke it, then the IPMI
watchdog will reset the system.  It will not catch the case of an
individual daemon/process, like corosync, hanging/crashing on the system.

On Wed, Aug 15, 2018 at 4:41 AM Dmitry Petuhov <mityapetuhov at gmail.com>
wrote:

> Week ago on one of my PVE nodes suddenly crashed corosync.
>
> -------------->8=========
> corosync[4701]: error   [TOTEM ] FAILED TO RECEIVE
> corosync[4701]:  [TOTEM ] FAILED TO RECEIVE
> corosync[4701]: notice  [TOTEM ] A new membership (10.19.92.53:1992) was
> formed. Members left: 1 2 4
> corosync[4701]: notice  [TOTEM ] Failed to receive the leave message.
> failed: 1 2 4
> corosync[4701]:  [TOTEM ] A new membership (10.19.92.53:1992) was
> formed. Members left: 1 2 4
> corosync[4701]:  [TOTEM ] Failed to receive the leave message. failed: 1 2
> 4
> corosync[4701]: notice  [QUORUM] This node is within the non-primary
> component and will NOT provide any services.
> corosync[4701]: notice  [QUORUM] Members[1]: 3
> corosync[4701]: notice  [MAIN  ] Completed service synchronization,
> ready to provide service.
> corosync[4701]:  [QUORUM] This node is within the non-primary component
> and will NOT provide any services.
> corosync[4701]:  [QUORUM] Members[1]: 3
> corosync[4701]:  [MAIN  ] Completed service synchronization, ready to
> provide service.
> kernel: [29187555.500409] dlm: closing connection to node 2
> corosync[4701]: notice  [TOTEM ] A new membership (10.19.92.51:2000) was
> formed. Members joined: 1 2 4
> corosync[4701]:  [TOTEM ] A new membership (10.19.92.51:2000) was
> formed. Members joined: 1 2 4
> corosync[4701]: notice  [QUORUM] This node is within the primary
> component and will provide service.
> corosync[4701]: notice  [QUORUM] Members[4]: 1 2 3 4
> corosync[4701]: notice  [MAIN  ] Completed service synchronization,
> ready to provide service.
> corosync[4701]:  [QUORUM] This node is within the primary component and
> will provide service.
> corosync[4701]: notice  [CFG   ] Killed by node 1: dlm_controld
> corosync[4701]: error   [MAIN  ] Corosync Cluster Engine exiting with
> status -1 at cfg.c:530.
> corosync[4701]:  [QUORUM] Members[4]: 1 2 3 4
> corosync[4701]:  [MAIN  ] Completed service synchronization, ready to
> provide service.
> dlm_controld[688]: 29187298 daemon node 4 stateful merge
> dlm_controld[688]: 29187298 receive_start 4:6 add node with started_count 2
> dlm_controld[688]: 29187298 daemon node 1 stateful merge
> dlm_controld[688]: 29187298 receive_start 1:5 add node with started_count 4
> dlm_controld[688]: 29187298 daemon node 2 stateful merge
> dlm_controld[688]: 29187298 receive_start 2:17 add node with
> started_count 13
> corosync[4701]:  [CFG   ] Killed by node 1: dlm_controld
> corosync[4701]:  [MAIN  ] Corosync Cluster Engine exiting with status -1
> at cfg.c:530.
> dlm_controld[688]: 29187298 cpg_dispatch error 2
> dlm_controld[688]: 29187298 process_cluster_cfg cfg_dispatch 2
> dlm_controld[688]: 29187298 cluster is down, exiting
> dlm_controld[688]: 29187298 process_cluster quorum_dispatch 2
> dlm_controld[688]: 29187298 daemon cpg_dispatch error 2
> systemd[1]: corosync.service: Main process exited, code=exited,
> status=255/n/a
> systemd[1]: corosync.service: Unit entered failed state.
> systemd[1]: corosync.service: Failed with result 'exit-code'.
> kernel: [29187556.903177] dlm: closing connection to node 4
> kernel: [29187556.906730] dlm: closing connection to node 3
> dlm_controld[688]: 29187298 abandoned lockspace hp-big-gfs
> kernel: [29187556.924279] dlm: dlm user daemon left 1 lockspaces
> -------------->8=========
>
>
> But node did not rebooted.
>
> I use WATCHDOG_MODULE=ipmi_watchdog. Watchdog still running:
>
>
> -------------->8=========
>
> # ipmitool mc watchdog get
> Watchdog Timer Use:     SMS/OS (0x44)
> Watchdog Timer Is:      Started/Running
> Watchdog Timer Actions: Hard Reset (0x01)
> Pre-timeout interval:   0 seconds
> Timer Expiration Flags: 0x10
> Initial Countdown:      10 sec
> Present Countdown:      9 sec
>
> -------------->8=========
>
>
> The only down service is corosync.
>
>
> -------------->8=========
>
> # pveversion --verbose
> proxmox-ve: 5.0-21 (running kernel: 4.10.17-2-pve)
> pve-manager: 5.0-31 (running version: 5.0-31/27769b1f)
> pve-kernel-4.10.17-2-pve: 4.10.17-20
> pve-kernel-4.10.17-3-pve: 4.10.17-21
> libpve-http-server-perl: 2.0-6
> lvm2: 2.02.168-pve3
> corosync: 2.4.2-pve3
> libqb0: 1.0.1-1
> pve-cluster: 5.0-12
> qemu-server: 5.0-15
> pve-firmware: 2.0-2
> libpve-common-perl: 5.0-16
> libpve-guest-common-perl: 2.0-11
> libpve-access-control: 5.0-6
> libpve-storage-perl: 5.0-14
> pve-libspice-server1: 0.12.8-3
> vncterm: 1.5-2
> pve-docs: 5.0-9
> pve-qemu-kvm: 2.9.0-5
> pve-container: 2.0-15
> pve-firewall: 3.0-2
> pve-ha-manager: 2.0-2
> ksm-control-daemon: 1.2-2
> glusterfs-client: 3.8.8-1
> lxc-pve: 2.0.8-3
> lxcfs: 2.0.7-pve4
> criu: 2.11.1-1~bpo90
> novnc-pve: 0.6-4
> smartmontools: 6.5+svn4324-1
> zfsutils-linux: 0.6.5.11-pve17~bpo90
> gfs2-utils: 3.1.9-2
> openvswitch-switch: 2.7.0-2
> ceph: 12.2.0-pve1
>
> -------------->8=========
>
>
> I also have GFS2 in this cluster, which did not stop work after corosync
> crash (which scares me most).
>
>
> Shouldn't node reboot on corosync fail, and why it can still run? Or
> shall node have HA VMs to reboot, and just stay as it is if there's only
> regular autostarted VMs and no HA machines present?
>
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>