Fencing

From Proxmox VE
Jump to navigation Jump to search
Yellowpin.svg Note: Article about Proxmox VE 2.0 beta

Introduction

To ensure data integrity, only one node is allowed to run a VM or any other cluster-service at a time. The use of power switches in the hardware configuration enables a node to power-cycle another node before restarting that node's HA services during a fail-over process. This prevents two nodes from simultaneously accessing the same data and corrupting it. Fence devices are used to guarantee data integrity under all failure conditions.

Update to the latest version

Before you start, make sure you have installed the latest packages, just run this on all nodes:

aptitude update && aptitude full-upgrade && aptitude install resource-agents-pve

Configure nodes to boot immediately and always after power cycle

Check your bios settings and test if if works. Just unplug the power cord and test if the server boots up after reconnecting.

If you use integrated fence devices, you must configure ACPI (Advanced Configuration and Power Interface) to ensure immediate and complete fencing - here are the different options:

  • make sure that you did not installed acpid (remove with: aptitude remove acpid)
  • disable ACPI soft-off in the bios
  • disable via acpi=off to the kernel boot command line

In any case, you need to make sure that the node turns off immediately when fenced. If you have delays here, the HA resources cannot be moved.

List of supported fence devices

APC Switch Rack PDU

E.g. AP7921, here is a example used in our test lab.

Create a user on the APC web interface

I just configured a new user via "Outlet User Management"

  • user name: hpapc
  • password: 12345678

Make sure that you enable "Outlet Access" and SSH and the most important part, make sure you connected the physical servers to the right power supply.

Example /etc/pve/cluster.conf.new with APC power fencing

This example uses the APC power switch as fencing device. Additionally, a simple "TestIP" is used for HA service and fail-over testing.

cp /etc/pve/cluster.conf /etc/pve/cluster.conf.new
nano /etc/pve/cluster.conf.new
<?xml version="1.0"?>
<cluster name="hpcluster765" config_version="28">

  <cman keyfile="/var/lib/pve-cluster/corosync.authkey">
  </cman>

  <fencedevices>
    <fencedevice agent="fence_apc" ipaddr="192.168.2.30" login="hpapc" name="apc" passwd="12345678"/>
  </fencedevices>

  <clusternodes>

  <clusternode name="hp4" votes="1" nodeid="1">
    <fence>
      <method name="power">
        <device name="apc" port="4" secure="on"/>
      </method>
    </fence>
  </clusternode>

  <clusternode name="hp1" votes="1" nodeid="2">
    <fence>
      <method name="power">
        <device name="apc" port="1" secure="on"/>
      </method>
    </fence>
  </clusternode>

  <clusternode name="hp3" votes="1" nodeid="3">
    <fence>
      <method name="power">
        <device name="apc" port="3" secure="on"/>
      </method>
    </fence>
  </clusternode>

  <clusternode name="hp2" votes="1" nodeid="4">
    <fence>
      <method name="power">
        <device name="apc" port="2" secure="on"/>
      </method>
    </fence>
  </clusternode>

  </clusternodes>

  <rm>
    <service autostart="1" exclusive="0" name="TestIP" recovery="relocate">
      <ip address="192.168.7.180"/>
    </service>
  </rm>

</cluster>

Note

If you edit this file via CLI, you need to increase ALWAYS the "config_version" number. This guarantees that the all nodes picks the new settings.

In order to apply this new config, you need to go to the web interface (Datacenter/HA). You can see the changes done and if the syntax is ok you can commit the changed via gui to all nodes. By doing this, all nodes gets the info about the new config and apply them automatically.

Enable fencing on all nodes

In order to get fencing active, you also need to join each node to the fencing domain. To the following on all your cluster nodes.

  • Enable fencing in /etc/default/redhat-cluster-pve (Just uncomment the last line, see below):
nano /etc/default/redhat-cluster-pve
# CLUSTERNAME=""
# NODENAME=""
# USE_CCS="yes"
# CLUSTER_JOIN_TIMEOUT=300
# CLUSTER_JOIN_OPTIONS=""
# CLUSTER_SHUTDOWN_TIMEOUT=60
# RGMGR_OPTIONS=""
FENCE_JOIN="yes"
  • join the fence domain with:
fence_tool join

To check the status, just run (this example shows all 3 nodes already joined):

fence_tool ls
fence domain
member count  3
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 2 3

Test fencing

Before you use the fencing device, make sure that it works as expected. In my example configuration, the AP7921 uses the IP 192.168.2.30:

Query the status of power supply:

fence_apc -x -l hpapc -p 12345678 -a 192.168.2.30 -o status -n 1 -v

Reboot the server using fence_apc:

fence_apc -x -l hpapc -p 12345678 -a 192.168.2.30 -o reboot -n 1 -v

Intel Modular Server HA

to be extended

tbd.