[PVE-User] DLM bug and quorum device

Thu Jul 12 15:09:43 CEST 2012

Hi,

We are building a 2 nodes cluster (proxmoxdev1 and proxmoxdev2) with
- LVMed iSCSI as shared storage
- Dell BMC IPMI card as fencing devices
- An iSCSI quorum disk

Each server has 2 NIC, one for Storage Network (iSCSI), one for user access
and cluster communication (will be separated with a third NIC in the
furture)

Software versions used :
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-10-pve: 2.6.32-63
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1

All nodes quorates, live migration works...Now let's run this scenario :
- Unplug the user access NIC on proxmoxdev2
- Heuristic checks fails, proxmoxdev2 is fenced, ressources restarts on
proxmoxdev1
- proxmoxdev2 restarts and does NOT quorate. This is normal, NIC is still
unpluged.
- Replug the NIC, and check logs (the details lines have been removed):

Jul 12 14:02:57 proxmoxdev2 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link
becomes ready
Jul 12 14:03:08 proxmoxdev2 corosync[1589]:   [CLM   ] CLM CONFIGURATION
CHANGE
Jul 12 14:03:08 proxmoxdev2 corosync[1589]:   [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Jul 12 14:03:28 proxmoxdev2 pmxcfs[1473]: [status] notice: node has quorum
Jul 12 14:03:28 proxmoxdev2 corosync[1589]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Jul 12 14:03:28 proxmoxdev2 pmxcfs[1473]: [dcdb] notice: all data is up to
date
Jul 12 14:03:29 proxmoxdev2 rgmanager[1997]: Quorum formed
Jul 12 14:03:29 proxmoxdev2 kernel: dlm: no local IP address has been set
Jul 12 14:03:29 proxmoxdev2 kernel: dlm: cannot start dlm lowcomms -107
Jul 12 14:03:31 proxmoxdev2 corosync[1589]:   [QUORUM] Members[2]: 1 2

"kernel: dlm" error lines seems to refer to a known bug already fixed by
redhat (rhbz#688154 and rhbz#679274)
Apparently, it is a bad timer check in qdiskd wich breaks votes for
quorum...

Here's a diff from redhat :
https://www.redhat.com/archives/cluster-devel/2011-March/msg00074.html
Other link : http://comments.gmane.org/gmane.linux.redhat.cluster/19598

No services (pvevm) are shown and rgmanager is not running on proxmoxdev2.
Running clustat returns :

Member Status: Quorate
Member Name     ID   Status
--------------------------------------------
proxmoxdev1       1     Online
proxmoxdev2       2     Online, Local
/dev/block/8:17    0     Online, Quorum Disk

Running clustat on proxmoxdev1 returns:

Member Status: Quorate
Member Name     ID   Status
--------------------------------------------
proxmoxdev1     1 Online, Local, rgmanager
proxmoxdev2     2 Online
/dev/block/8:17  0 Online, Quorum Disk

Service Name     Owner (Last)      State
----------------------------------------------------------
pvevm:100          proxmoxdev1     started

The only way to retreive au fully functional 2-nodes cluster is to restart
manualy proxmoxdev2 AFTER having replug the NIC

is it really the same bug as the redhat one and is there a workaround in
Proxmox ?
Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20120712/77087fef/attachment.htm>