Separate Cluster Network

Note: Needs Proxmox VE 4.0 with pve-cluster in version 4.0-23 or above to properly work.

Introduction

It is good practice to use a separate network for corosync, which handles the cluster communication in Proxmox VE. It is one of the most important part in an fault tolerant (HA) system and other network traffic may disturb corosync. Storage communication should never be on the same network as corosync!

Also good practice is to add redundancy to your Cluster Network. This can be done by using RRP in combination with two physical separated networks. Besides the obvious benefits that the cluster still works on a switch failure, also the maintenance of your systems becomes easier. A firmware upgrade of a switch, for example, can be done on a running cluster with no downtime, as the other ring still handles the traffic in the time between.

This article shows you a way to use a completely separated corosync network in Proxmox VE 4.0, version 4.0-23 of the pve-cluster package is recommended.

Prerequisites

This HowTo uses a three node cluster with the nodes called 'one', 'two', 'three'.

An own NIC and an own (gigabit, although 100Mbit should be sufficient) switch for corosync is used. The NIC is configured on the eth1 interface and the network is 10.10.1.0/24

Reading through the corosync.conf manual entry is a good idea to get some hints and to see which options does what.

man corosync.conf

We distinguish two cases, one when we want to use one separated network:

from the beginning, i.e. at cluster creation time
when we already have a running cluster

Shared Steps

Note: back up /etc/pve/corosync.conf (not existent if the cluster wasn't created) and /etc/hosts from each node, that lets you revert back when something bad happened. Changes to /etc/pve/corosync.conf will immediately propagate to all nodes and trigger a corosync config reload, if the reload fails the old config remains in use.

Configure interfaces

Build up an static network by editing /etc/network/interfaces, see the example of one node below.

auto eth1
iface eth1 inet static
        address 10.10.1.151
        netmask 255.255.255.0

Do that on every node, change the address respectively. (in this example we use *.151 *.152 and *.153 as they mirror the endings of the interface/VM traffic IPs).

Restart the network and see if you can ping each node on the new network, be sure that multicast works and is not blocked by the firewall.

Configure hosts file

Now configure the /etc/hosts file so that we can use hostnames in the corosync config. This isn't strictly necessary you can also set the addresses directly but helps to keep the overview and is considered as good practice. Note that I added entries for the other nodes too, this isn't necessary but good practice as we can resolve them faster.

127.0.0.1 localhost.localdomain localhost
192.168.15.151 one.proxmox.com one pvelocalhost

# corosync network hosts 
10.10.1.151 one-corosync.proxmox.com one-corosync
10.10.1.152 two-corosync.proxmox.com two-corosync
10.10.1.153 three-corosync.proxmox.com three-corosync

# The following lines are desirable for IPv6 capable hosts
[...]

Setup at Cluster Creation

Since version 4.0-23 of the pve-cluster package we have built in support for creating the cluster with separate corosync ring(s) on own networks. If you're running a earlier version please update your system first.

bindnetaddr

This specifies the network address the corosync executive should bind to. bindnetaddr should be an IP address configured on the system, or a network address. For example, if the local interface is 192.168.5.151 with netmask 255.255.255.0, you should set bindnetaddr to 192.168.5.151 or 192.168.5.0. If the local interface is 192.168.5.151 with netmask 255.255.255.192, set bindnetaddr to 192.168.5.151 or 192.168.5.128, and so forth. This may also be an IPV6 address, in which case IPV6 networking will be used. In this case, the exact address must be specified and there is no automatic selection of the network interface within a specific subnet as with IPv4.

Note that a FQDN/hostname isn't allowed here, use a 'real' IP address.

Note: if you are setting a cluster with unicast, in most situations the network mask /24 will create an error. See Use Unicast

ringX_addr

Hostname (or IP) of the corosync ringX (X can be 0 or 1) address of this node. There can be also two rings, see Redundant Ring Protocol for setup instructions.

Normally there for corosync defined hostname from the /etc/hosts file for that.

Final Command

I our example the following parameters would be used when creating the cluster on the node named 'one':

bindnetaddr: 10.10.10.151
ring0_adress: one-corosync

pvecm create <clustername> -bindnet0_addr 10.10.10.151 -ring0_addr one-corosync

Setup on a Running Cluster

Needs pve-cluster in version 4.0-23 to properly work.

Note that a whole cluster reboot is needed to make this changes on a running cluster. Note look for 'no-reboot' way

Configure corosync

First copy the current corosync config:

cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new

Then edit the copied file with your favorite editor, or use nano as it is available on every Proxmox VE node by default:

nano /etc/pve/corosync.conf.new

in the editor adapt the following attributes:
- if not already there, add an "name: <nodename>" entry to each node {} section.
- ring0_addr from every node entry, change it to the new defined hostnames from /etc/hosts.
- bindnetaddr in the totem entry. Change it to the matching IP from the separate network, (e.g. in our case I use the node with nodeid 1 and change 192.168.15.151 to 10.10.1.151)
- config_version: increase it, very important, you can write any number which is higher then the actual one, but you need to increase it.

Here is an example how it could look:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: two
    nodeid: 2
    quorum_votes: 1
    ring0_addr: two-corosync
  }

  node {
    name: one
    nodeid: 1
    quorum_votes: 1
    ring0_addr: one-corosync
  }

  node {
    name: three
    nodeid: 3
    quorum_votes: 1
    ring0_addr: three-corosync
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: testcluster
  config_version: 8
  ip_version: ipv4
  secauth: on
  version: 2
  interface {
    bindnetaddr: 10.10.1.151
    ringnumber: 0
  }
}

rename the config file

mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf

reboot the first node, and look in the logs if corosync does not throw errors and could make an healthy cluster by itself on the new network.

When something failed look at the troubleshooting section.

then reboot every node, one after the other, if HA is enabled reboot the node which is the current HA master at last, to speed up the process.

An other, but unsupported, way to bring the changes in effect would be to restart all related services, like:

systemctl restart corosync.service 
systemctl restart pve-cluster.service
systemctl restart pvedaemon.service
systemctl restart pveproxy.service

Note that a reboot is cleaner and really recommended.

Adding nodes in the future

If you add a new node to the cluster in the future, first configure its own corosync interface the way described above, and edit the /etc/hosts file. You do not need to edit any corosync config file.

Second, use the standard pvecm command with one important addition:

pvecm add <IP addr of a cluster member> -ring0_addr <new nodes ring addr>

This sets the correct ring address in the config. Else you could get in trouble and need to manual intervent.

Redundant Ring Protocol

To be safe when the switch used for corosync fails, also to get faster throughput on the cluster communication - which may be helpful on big setups with a lot of nodes - you can use redundant rings. Those rings must run on two physical separated network, else you won't gain any plus on the High Availability side.

To use it first configure another interface and hostnames for your second ring like described above.

RRP modes

Note: Active mode is not completely stable, yet. Always use passive mode for production use.

Active replication offers slightly lower latency from transmit to delivery in faulty network environments but with less performance. Passive replication may nearly double the speed of the totem protocol if the protocol doesn't become CPU bound. The final option is none, in which case only one network interface will be used to operate the totem protocol.

On Cluster Creation

The pvecm create command provides the additional parameters '-bindnet1_addr', '-ring1_addr' and '-rrp_mode', those can be used for RRP configuration.

See the bindnetaddr and ringX_addr sections for information about the addresses.

Note, when you only set the ring 1 addresses ring 0 will be set to the default values (local ip address and nodename).

On Running Cluster

Use the same steps described in the Configure corosync section to edit the corosync config.

in the editor adapt the following attributes:

add a new interface section to the tome section of the config.
- there add "ringnumber: 1" and "bindnetaddr: <ring1bindnet_address>"
add "ring1_addr: <ring1_hostname>" entries to each node section.

It should look something like:

totem {
  cluster_name: tweak
  config_version: 2
  ip_version: ipv4
  rrp_mode: passive
  secauth: on
  version: 2
  interface {
    bindnetaddr: 10.10.1.62
    ringnumber: 0
  }
  interface {
    bindnetaddr: 10.10.3.62
    ringnumber: 1
  }
}

nodelist {
  node {
    name: pvecm62
    nodeid: 1
    quorum_votes: 1
    ring0_addr: coro0-62
    ring1_addr: coro1-62
  }

 node {
    name: pvecm63
    nodeid: 2
    quorum_votes: 1
    ring0_addr: coro0-63
    ring1_addr: coro1-63
  }

  [...] # other cluster nodes here
}

[...] # other config sections here

rename the config file

mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf

reboot the first node, and look in the logs if corosync does not throw errors and could make an healthy cluster by itself on the new network. When something failed look at the troubleshooting section.

then reboot every node, one after the other, if HA is enabled reboot the node which is the current HA master at last, to speed up the process.

Troubleshooting

Known issues

quorum.expected_votes must be configured

If the logs show something like:

[...]
corosync[1647]:  [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
corosync[1647]:  [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'
[...]

Your hosts file entry for the corosync hostname and the one in ring0_addr from corosync.conf do not match or could not be resolved.

Fix them up and reboot/restart. If you need to change something in corosync.conf but have no write permissions see Write config when not quorate.

crit: cpg_send_message failed: 9

If this pops up on only one node restart the pve-cluster service with:

systemctl restart pve-cluster.service

If that does not solve the problem or it's on all node check your firewall and switch, the may block or not support multicast.

Unknown issues

Ask for support. In the meantime revert back to the backed up corosync.conf. See 'Write config when not quorate' and then overwrite the config with the backup on each node, increase the config versions inside it and give attention that the versions is the same on all nodes. Then reboot the cluster.

Write config when not quorate

If you need to change /etc/pve/corosync.conf on an node with no quorum, and you know what you do, use:

pvecm expected 1

to set the expected vote count to 1. This makes the cluster quorate and you can fix your config, or revert it back to the back up.

If that wasn't enough (e.g.: corosync is dead) use:

systemctl stop pve-cluster
pmxcfs -l

to start the pmxcfs in a local mode. You have now write access, so you need to be very careful with changes!

After restarting the filesystem should merge changes, if there is no big merge conflict that could result in a split brain.