[PVE-User] Nodes in "partial" offline condition

Alexandre Kouznetsov alk at ondore.com
Thu Sep 27 23:19:48 CEST 2012


Hello.

I have a 4 nodes in a Proxmox 2.1 cluster.
After a network configuration change on the node I'm using as web panel 
(hostname proxmox42) and rebooted (as Web GUI requested) I see the rest 
of the nodes offline (hostnames proxmox43-proxmox45). Well, they have a 
little red dot instead of a green one, in their icon in the in the web 
interface.

The fallen nodes responds via Web and SSH, with some errors on the Web 
GUI. The network configuration change I have done was to add a bridge on 
a previously unused NIC.

What can I do (places to look, tests to run) to see what is going on? My 
cluster has to go to production next week, I'm almost glad this happen 
now and not then.


Random details, don't know what may be relevant:

The "Datacenter" (root of the GUI hierarchy) section of the Web GUI 
shows this status:
"Search" tab lists all the resources but shows the details only for tab 
status for proxmox42's resources.
"Summary" tab shows all the nodes as "online".
I have reloaded the page, logged out and logged in (using root PAM 
account), same status.

Curiously, the "Summary" tabs of the fallen nodes are showing a valid 
status. I can see the CPU details, uptime, etc. The only thing out of 
order is the Load Average. They are doing or running nothing, but have 
Load Average above 1.
Some parts of the GUI does not shows details and displays a floating 
message "communication failure".

I can SSH to all the nodes and see that "pvecm status" and "pvecm nodes" 
shows all 4 nodes online and running.
SSH to each node works, "top" confirms a high Load Average but shows 
less than 1% CPU usage.
Apache access log shows successful connections to the API from proxmox42 
to the fallen nodes.

I have rebooted one of the nodes and it appear to online now, seems 
normal (Load Average, response to GUI). I have not rebooted any other 
node yet. I'm more interested to find out what's the condition and make 
sure i eliminate the cause, then getting my nodes back online ASAP.

Thank you.

-- 
Alexandre Kouznetsov




More information about the pve-user mailing list