[pve-devel] corosync problems - need help

Alexandre DERUMIER aderumier at odiso.com
Wed Sep 17 08:11:06 CEST 2014


one last thing I don't have tested,

is to update libqb, which is really old on wheezy (0.11)

Last version is 0.17

and I have seen bugs related to corosync hanging because of libqb

https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496


I'll try to backport package from debian sid.


----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier at odiso.com> 
À: "Dietmar Maurer" <dietmar at proxmox.com> 
Cc: pve-devel at pve.proxmox.com 
Envoyé: Mardi 16 Septembre 2014 23:56:09 
Objet: Re: [pve-devel] corosync problems - need help 

Some news, 

I finally stop/start the node (shutdown the vm too :( ), 

and finally it join correctly the cluster. 


So, I really don't known what could be hang... Damned... 


BTW, do you had already have a look at corosync2 + pacemaker ? (Seem that this the supported model in rhel7) 

I known that pacemker replace rgmanager, don't known if corosync2 need to do a lot of change in pmxfs. 



----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier at odiso.com> 
À: "Dietmar Maurer" <dietmar at proxmox.com> 
Cc: pve-devel at pve.proxmox.com 
Envoyé: Mardi 16 Septembre 2014 08:33:56 
Objet: Re: [pve-devel] corosync problems - need help 

>>First, int is 32bit. Second, interger overflow does not raise an exception in C. 
>>So that cannot be the reason. 

Ok, sorry. ( I thinked about this because in log I was seeing increment up to around 65000, then no more log ). 


What I have done yesterday : 

- update all nodes to 3.10 kernel 
- upgrade openvswitch to 2.3.0 (I had see an high cpu bug, and 2.3 fix it). 


But don't help. 

I have been able to bring back this node in the cluster for around 5min, then It begin to hang again. 


Today, I'll try to shutdown corosync on all servers, 

then start corosync on this node and join other nodes. 

(I want be sure that it's not because I have 2 more nodes in my cluster) 


I'll keep you in touch 

----- Mail original ----- 

De: "Dietmar Maurer" <dietmar at proxmox.com> 
À: "Alexandre DERUMIER" <aderumier at odiso.com> 
Cc: pve-devel at pve.proxmox.com 
Envoyé: Mardi 16 Septembre 2014 07:51:07 
Objet: RE: [pve-devel] corosync problems - need help 

> with retry around 65000 (16bits) 
> 
> 
> 
> and 
> int retries = 0; 
> result = cpg_join(dfsm->cpg_handle, &dfsm->cpg_group_name); 
> if (result == CPG_ERR_TRY_AGAIN) { 
> nanosleep(&tvreq, NULL); 
> ++retries; 
> if ((retries % 10) == 0) 
> cfs_dom_message(dfsm->log_domain, "cpg_join retry %d", 
> retries); 
> goto loop; 
> } 
> 
> 
> could it be related to retries integer type? 

First, int is 32bit. Second, interger overflow does not raise an exception in C. 
So that cannot be the reason. 
_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 
_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 



More information about the pve-devel mailing list