Fencing
Note: Article about Proxmox VE 2.0 beta |
Introduction
To ensure data integrity, only one node is allowed to run a VM or any other cluster-service at a time. The use of power switches in the hardware configuration enables a node to power-cycle another node before restarting that node's HA services during a fail-over process. This prevents two nodes from simultaneously accessing the same data and corrupting it. Fence devices are used to guarantee data integrity under all failure conditions.
Configure nodes to boot immediately and always after power cycle
Check your bios settings and test if if works. Just unplug the power cord and test if the server boots up after reconnecting.
If you use integrated fence devices, you must configure ACPI (Advanced Configuration and Power Interface) to ensure immediate and complete fencing - here are the different options:
- make sure that you did not installed acpid (remove with: aptitude remove acpid)
- disable ACPI soft-off in the bios
- disable via acpi=off to the kernel boot command line
In any case, you need to make sure that the node turns off immediately when fenced. If you have delays here, the HA resources cannot be moved.
List of supported fence devices
APC Switch Rack PDU
E.g. AP7921, here is a example used in our test lab.
Create a user on the APC web interface
I just configured a new user via "Outlet User Management"
- user name: hpapc
- password: 12345678
Make sure that you enable "Outlet Access" and SSH and the most important part, make sure you connected the physical servers to the right power supply.
Example /etc/pve/cluster.conf.new with APC power fencing
This example uses the APC power switch as fencing device. Additionally, a simple "TestIP" is used for HA service and fail-over testing.
cp /etc/pve/cluster.conf /etc/pve/cluster.conf.new
nano /etc/pve/cluster.conf.new
<?xml version="1.0"?>
<cluster name="hpcluster765" config_version="28">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey">
</cman>
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.2.30" login="hpapc" name="apc" passwd="12345678"/>
</fencedevices>
<clusternodes>
<clusternode name="hp4" votes="1" nodeid="1">
<fence>
<method name="power">
<device name="apc" port="4" secure="on"/>
</method>
</fence>
</clusternode>
<clusternode name="hp1" votes="1" nodeid="2">
<fence>
<method name="power">
<device name="apc" port="1" secure="on"/>
</method>
</fence>
</clusternode>
<clusternode name="hp3" votes="1" nodeid="3">
<fence>
<method name="power">
<device name="apc" port="3" secure="on"/>
</method>
</fence>
</clusternode>
<clusternode name="hp2" votes="1" nodeid="4">
<fence>
<method name="power">
<device name="apc" port="2" secure="on"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<service autostart="1" exclusive="0" name="TestIP" recovery="relocate">
<ip address="192.168.7.180"/>
</service>
</rm>
</cluster>
Note
If you edit this file via CLI, you need to increase ALWAYS the "config_version" number. This guarantees that the all nodes apply´s the new settings.
In order to apply this new config, you need to go to the web interface (Datacenter/HA). You can see the changes done and if the syntax is ok you can commit the changed via gui to all nodes. By doing this, all nodes gets the info about the new config and apply them automatically.
Enable fencing on all nodes
In order to get fencing active, you also need to join each node to the fencing domain. To the following on all your cluster nodes.
- Enable fencing in /etc/default/redhat-cluster-pve (Just uncomment the last line, see below):
nano /etc/default/redhat-cluster-pve
# CLUSTERNAME="" # NODENAME="" # USE_CCS="yes" # CLUSTER_JOIN_TIMEOUT=300 # CLUSTER_JOIN_OPTIONS="" # CLUSTER_SHUTDOWN_TIMEOUT=60 # RGMGR_OPTIONS="" FENCE_JOIN="yes"
- join the fence domain with:
fence_tool join
To check the status, just run (this example shows all 3 nodes already joined):
fence_tool ls
fence domain member count 3 victim count 0 victim now 0 master nodeid 1 wait state none members 1 2 3
Test fencing
Before you use the fencing device, make sure that it works as expected. In my example configuration, the AP7921 uses the IP 192.168.2.30:
Query the status of power supply:
fence_apc -x -l hpapc -p 12345678 -a 192.168.2.30 -o status -n 1 -v
Reboot the server using fence_apc:
fence_apc -x -l hpapc -p 12345678 -a 192.168.2.30 -o reboot -n 1 -v
Intel Modular Server HA
to be extended
tbd.