[PVE-User] Cman crash problem

Jonathan Schaeffer jonathan.schaeffer at univ-brest.fr
Mon Jul 8 12:56:48 CEST 2013


Hi all,

I'm experiencing a serious problem on our 4 nodes cluster (PVE 3.0).

It appeared after the network team changed network active equipments in 
the building (but this might not be the origin of the problem).

The symptoms are :

- The nodes appear in red on the web gui, except the one hosting the web 
service IP
- The VM, while still running correctly, do not show any information 
(running, rrd graphs, etc)

- clustat shows nodes as "online"
- some nodes seems to have been fenced (while not restarted)
(see log extracts : barbossa_fenced.log and jim_fenced.log)

- /var/log/cluster/corosync.log shows LOT of messages :
Jul 08 07:06:49 corosync [TOTEM ] Retransmit List: 13f54a 13f54b 13f54c 
13f54d 13f54e 13f54f 13f550 13f551 13f552 13f553 13f554 13f555 13f556 
13f557 13f558 13f559 13f55a 13f55b 13f55c 13f55d 13f55e

If I restart one node, the fencing is going to happen, other nodes will 
reboot and all the VMs hosted allong with them. I don't want this to happen.

I can provide more logs if necessary. Do you have an idea to help me 
understand what is going on here ?

Thanks,

Jonathan


barbossa_fenced.log :
Jul 03 12:07:21 fenced fencing deferred to jim
Jul 03 13:45:40 fenced receive_start 1:15 add node with started_count 8
Jul 03 13:45:40 fenced receive_start 2:11 add node with started_count 4
Jul 03 13:45:40 fenced receive_start 3:7 add node with started_count 1
Jul 04 00:29:35 fenced receive_start 1:16 add node with started_count 8
Jul 04 00:29:35 fenced receive_start 3:8 add node with started_count 1
Jul 04 00:38:31 fenced receive_start 2:17 add node with started_count 4
Jul 04 00:38:31 fenced receive_start 3:13 add node with started_count 1
Jul 04 00:38:31 fenced receive_start 1:21 add node with started_count 8
Jul 04 10:44:12 fenced receive_start 1:22 add node with started_count 8
Jul 04 10:44:12 fenced receive_start 3:14 add node with started_count 1
Jul 04 10:44:24 fenced receive_start 1:23 add node with started_count 8
Jul 04 10:44:24 fenced telling cman to remove nodeid 2 from cluster


jim_fenced.log :
Jul 03 12:07:21 fenced fencing node longjohn
Jul 03 12:07:32 fenced fence longjohn success
Jul 03 13:45:40 fenced receive_start 5:13 add node with started_count 6
Jul 03 13:45:40 fenced receive_start 2:11 add node with started_count 4
Jul 03 13:45:40 fenced receive_start 3:7 add node with started_count 1
Jul 04 00:29:35 fenced receive_start 3:8 add node with started_count 1
Jul 04 00:29:35 fenced receive_start 5:14 add node with started_count 6
Jul 04 00:38:31 fenced receive_start 2:17 add node with started_count 4
Jul 04 00:38:31 fenced receive_start 3:13 add node with started_count 1
Jul 04 00:38:31 fenced receive_start 5:19 add node with started_count 6
Jul 04 10:44:12 fenced receive_start 5:20 add node with started_count 6
Jul 04 10:44:12 fenced receive_start 3:14 add node with started_count 1
Jul 04 10:44:24 fenced telling cman to remove nodeid 2 from cluster
Jul 04 10:44:24 fenced receive_start 2:23 add node with started_count 4
Jul 04 10:44:24 fenced receive_start 3:15 add node with started_count 1
Jul 04 10:44:24 fenced receive_start 5:21 add node with started_count 6
Jul 04 10:44:46 fenced receive_start 5:22 add node with started_count 6
Jul 04 10:44:46 fenced receive_start 3:16 add node with started_count 1

longjohn_fenced.log :
Jul 03 09:47:12 fenced fenced 1352871249 started
Jul 03 11:28:46 fenced cluster is down, exiting
Jul 03 11:28:46 fenced daemon cpg_dispatch error 2
Jul 03 12:11:43 fenced fenced 1364188437 started
Jul 03 13:45:40 fenced receive_start 5:13 add node with started_count 6
Jul 03 13:45:40 fenced receive_start 1:15 add node with started_count 8
Jul 03 13:45:40 fenced receive_start 2:11 add node with started_count 4
Jul 04 00:29:35 fenced receive_start 1:16 add node with started_count 8
Jul 04 00:29:35 fenced receive_start 5:14 add node with started_count 6
Jul 04 00:38:31 fenced receive_start 2:17 add node with started_count 4
Jul 04 00:38:31 fenced receive_start 1:21 add node with started_count 8
Jul 04 00:38:31 fenced receive_start 5:19 add node with started_count 6
Jul 04 10:44:12 fenced receive_start 1:22 add node with started_count 8
Jul 04 10:44:12 fenced receive_start 5:20 add node with started_count 6
Jul 04 10:44:24 fenced receive_start 1:23 add node with started_count 8
Jul 04 10:44:24 fenced telling cman to remove nodeid 2 from cluster
Jul 04 10:44:24 fenced receive_start 2:23 add node with started_count 4
Jul 04 10:44:24 fenced receive_start 5:21 add node with started_count 6
Jul 04 10:44:46 fenced receive_start 5:22 add node with started_count 6
Jul 04 10:44:46 fenced receive_start 1:24 add node with started_count 8

flint_fenced.log :
Jul 03 11:18:30 fenced fenced 1364188437 started
Jul 03 12:07:21 fenced fencing deferred to jim
Jul 03 13:45:40 fenced receive_start 5:13 add node with started_count 6
Jul 03 13:45:40 fenced receive_start 1:15 add node with started_count 8
Jul 03 13:45:40 fenced receive_start 3:7 add node with started_count 1
Jul 04 00:38:31 fenced receive_start 3:13 add node with started_count 1
Jul 04 00:38:31 fenced receive_start 1:21 add node with started_count 8
Jul 04 00:38:31 fenced receive_start 5:19 add node with started_count 6
Jul 04 10:44:24 fenced receive_start 1:23 add node with started_count 8
Jul 04 10:44:24 fenced receive_start 3:15 add node with started_count 1
Jul 04 10:44:24 fenced receive_start 5:21 add node with started_count 6
Jul 04 10:44:24 fenced cluster is down, exiting




-- 
IUEM - Service Informatique
rue Dumont D'Urville
Technopôle Brest-Iroise
29280 Plouzané
France
http://www-iuem.univ-brest.fr/feiri
tel: +33 2 98 49 87 94



More information about the pve-user mailing list