[PVE-User] TASK ERROR: cluster not ready - no quorum?

Shain Miley smiley at npr.org
Mon Mar 9 20:04:06 CET 2015


Ok...after some testing it seems like the new 3.4 servers are dropping 
(or at least not getting) multicast packets:

Here is a test between two 3.4 proxmox servers:

root at proxmox3:~# asmping 224.0.2.1 proxmox1.npr.org
asmping joined (S,G) = (*,224.0.2.234)
pinging 172.31.2.141 from 172.31.2.33
   unicast from 172.31.2.141, seq=1 dist=0 time=1.592 ms
   unicast from 172.31.2.141, seq=2 dist=0 time=0.163 ms
   unicast from 172.31.2.141, seq=3 dist=0 time=0.136 ms
   unicast from 172.31.2.141, seq=4 dist=0 time=0.117 ms
........

--- 172.31.2.141 statistics ---
11 packets transmitted, time 10702 ms
unicast:
    11 packets received, 0% packet loss
    rtt min/avg/max/std-dev = 0.107/0.261/1.592/0.421 ms
multicast:
    0 packets received, 100% packet loss



and here are two other servers (ubuntu and debian) connected to the same 
set of switches as the servers above:

root at test2:~# asmping 224.0.2.1 testserver1.npr.org
asmping joined (S,G) = (*,224.0.2.234)
pinging 172.31.2.125 from 172.31.2.131
multicast from 172.31.2.125, seq=1 dist=0 time=0.203 ms
   unicast from 172.31.2.125, seq=1 dist=0 time=0.322 ms
   unicast from 172.31.2.125, seq=2 dist=0 time=0.143 ms
multicast from 172.31.2.125, seq=2 dist=0 time=0.150 ms
   unicast from 172.31.2.125, seq=3 dist=0 time=0.138 ms
multicast from 172.31.2.125, seq=3 dist=0 time=0.146 ms
   unicast from 172.31.2.125, seq=4 dist=0 time=0.122 ms
.........

--- 172.31.2.125 statistics ---
9 packets transmitted, time 8115 ms
unicast:
    9 packets received, 0% packet loss
    rtt min/avg/max/std-dev = 0.114/0.150/0.322/0.061 ms
multicast:
    9 packets received, 0% packet loss since first mc packet (seq 1) recvd
    rtt min/avg/max/std-dev = 0.118/0.142/0.203/0.026 ms

As you can see multicast works fine there.


All servers are running 2.6.32 kernels but not all the same version 
(2.6.32-23-pve - 2.6.32-37-pve)

Anyone have any suggestions as to why the Proxmox servers are not seeing 
the multicast traffic?

Thanks,

Shain

On 3/9/15 12:33 PM, Shain Miley wrote:
> I am looking into the possibility that there is a multicast issue here 
> as I am unable to ping any of the multicast ip address on any of the 
> nodes.
>
> I have reached out to cisco support for some additional help.
>
> I will let you know what I find out.
>
> Thanks again,
>
> Shain
>
>
> On 3/9/15 11:54 AM, Eneko Lacunza wrote:
>> It seems yesterday something happened at 20:40:53:
>>
>> Mar 08 20:40:53 corosync [TOTEM ] FAILED TO RECEIVE
>> Mar 08 20:41:05 corosync [CLM   ] CLM CONFIGURATION CHANGE
>> Mar 08 20:41:05 corosync [CLM   ] New Configuration:
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.48)
>> Mar 08 20:41:05 corosync [CLM   ] Members Left:
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.16)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.33)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.49)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.50)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.69)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.75)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.77)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.87)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.141)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.142)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.161)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.163)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.165)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.215)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.216)
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.219)
>> Mar 08 20:41:05 corosync [CLM   ] Members Joined:
>> Mar 08 20:41:05 corosync [QUORUM] Members[16]: 1 2 4 5 6 7 8 10 11 12 
>> 13 14 15 16 17 19
>> Mar 08 20:41:05 corosync [QUORUM] Members[15]: 1 2 4 5 6 7 8 11 12 13 
>> 14 15 16 17 19
>> Mar 08 20:41:05 corosync [QUORUM] Members[14]: 1 2 4 5 6 7 8 11 12 14 
>> 15 16 17 19
>> Mar 08 20:41:05 corosync [QUORUM] Members[13]: 1 2 4 5 6 7 8 11 12 15 
>> 16 17 19
>> Mar 08 20:41:05 corosync [QUORUM] Members[12]: 1 2 4 5 6 7 8 11 12 15 
>> 17 19
>> Mar 08 20:41:05 corosync [QUORUM] Members[11]: 1 2 4 5 6 7 8 11 12 15 17
>> Mar 08 20:41:05 corosync [QUORUM] Members[10]: 1 2 4 5 6 7 8 11 12 17
>> Mar 08 20:41:05 corosync [CMAN  ] quorum lost, blocking activity
>> Mar 08 20:41:05 corosync [QUORUM] This node is within the non-primary 
>> component and will NOT provide any services.
>> Mar 08 20:41:05 corosync [QUORUM] Members[9]: 1 2 5 6 7 8 11 12 17
>> Mar 08 20:41:05 corosync [QUORUM] Members[8]: 1 2 5 6 7 11 12 17
>> Mar 08 20:41:05 corosync [QUORUM] Members[7]: 1 2 5 6 7 12 17
>> Mar 08 20:41:05 corosync [QUORUM] Members[6]: 1 2 6 7 12 17
>> Mar 08 20:41:05 corosync [QUORUM] Members[5]: 1 2 7 12 17
>> Mar 08 20:41:05 corosync [QUORUM] Members[4]: 1 2 12 17
>> Mar 08 20:41:05 corosync [QUORUM] Members[3]: 1 12 17
>> Mar 08 20:41:05 corosync [QUORUM] Members[2]: 1 12
>> Mar 08 20:41:05 corosync [QUORUM] Members[1]: 12
>> Mar 08 20:41:05 corosync [CLM   ] CLM CONFIGURATION CHANGE
>> Mar 08 20:41:05 corosync [CLM   ] New Configuration:
>> Mar 08 20:41:05 corosync [CLM   ]     r(0) ip(172.31.2.48)
>> Mar 08 20:41:05 corosync [CLM   ] Members Left:
>> Mar 08 20:41:05 corosync [CLM   ] Members Joined:
>> Mar 08 20:41:05 corosync [TOTEM ] A processor joined or left the 
>> membership and a new membership was formed.
>> Mar 08 20:41:05 corosync [CPG   ] chosen downlist: sender r(0) 
>> ip(172.31.2.48) ; members(old:17 left:16)
>> Mar 08 20:41:05 corosync [MAIN  ] Completed service synchronization, 
>> ready to provide service
>>
>> Is the "pvecm nodes" similar in all nodes?
>>
>> I don't have experience troubleshooting corosync but it seems you 
>> have to re-estrablish the corosync cluster and quorum.
>>
>> Check "corosync-quorumtool -l -i" . Also check cman_tool command for 
>> diagnosing the cluster.
>>
>> Is corosync service loaded and running? Does restarting it change 
>> something (service cman restart) ?
>>
>>
>>
>> On 09/03/15 16:13, Shain Miley wrote:
>>> Oddly enough...there is nothing in the latest corosync 
>>> logfile...however the one from last night (when we started seeing 
>>> the problem) has a lot of info in it.
>>>
>>> Here is the link to entire file:
>>>
>>> http://717b5bb5f6a032ce28eb-fa7f03050c118691fd4b41bf00a93863.r71.cf1.rackcdn.com/corosync.log.1
>>>
>>> Thanks again for your help so far.
>>>
>>> Shain
>>>
>>> On 3/9/15 10:53 AM, Eneko Lacunza wrote:
>>>> What about /var/log/cluster/corosync.log ?
>>>>
>>>> On 09/03/15 15:34, Shain Miley wrote:
>>>>> Yes,
>>>>>
>>>>> All the nodes are pingable and resolvable via their hostname.
>>>>>
>>>>> Here is the ouput of 'pvecm nodes'
>>>>>
>>>>>
>>>>> root at proxmox13:~# pvecm nodes
>>>>> Node  Sts   Inc   Joined               Name
>>>>>    1   X    964                        proxmox22
>>>>>    2   X    964                        proxmox23
>>>>>    3   X    756                        proxmox24
>>>>>    4   X    808                        proxmox18
>>>>>    5   X    964                        proxmox19
>>>>>    6   X    964                        proxmox20
>>>>>    7   X    964                        proxmox21
>>>>>    8   X    964                        proxmox1
>>>>>    9   X      0                        proxmox2
>>>>>   10   X    756                        proxmox3
>>>>>   11   X    964                        proxmox4
>>>>>   12   M    696   2014-10-20 01:10:09  proxmox13
>>>>>   13   X    904                        proxmox14
>>>>>   14   X    848                        proxmox15
>>>>>   15   X    856                        proxmox16
>>>>>   16   X    836                        proxmox17
>>>>>   17   X    964                        proxmox25
>>>>>   18   X    960                        proxmox26
>>>>>   19   X    868                        proxmox28
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Shain
>>>>>
>>>>> On 3/9/15 10:23 AM, Eneko Lacunza wrote:
>>>>>> pvecm nodes
>>>>>
>>>>>
>>>>> -- 
>>>>> _NPR | Shain Miley| Manager of Systems and Infrastructure, Digital 
>>>>> Media | smiley at npr.org | p: 202-513-3649
>>>>
>>>>
>>>> -- 
>>>> Zuzendari Teknikoa / Director Técnico
>>>> Binovo IT Human Project, S.L.
>>>> Telf. 943575997
>>>>        943493611
>>>> Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
>>>> www.binovo.es
>>>
>>>
>>> -- 
>>> _NPR | Shain Miley| Manager of Systems and Infrastructure, Digital 
>>> Media | smiley at npr.org | p: 202-513-3649
>>
>>
>> -- 
>> Zuzendari Teknikoa / Director Técnico
>> Binovo IT Human Project, S.L.
>> Telf. 943575997
>>        943493611
>> Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
>> www.binovo.es
>
>
> -- 
> _NPR | Shain Miley| Manager of Systems and Infrastructure, Digital 
> Media | smiley at npr.org | p: 202-513-3649
>
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


-- 
_NPR | Shain Miley| Manager of Systems and Infrastructure, Digital Media 
| smiley at npr.org | p: 202-513-3649
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.proxmox.com/pipermail/pve-user/attachments/20150309/3e7894da/attachment.htm>


More information about the pve-user mailing list