Difference between revisions of "Full Mesh Network for Ceph Server"

From Proxmox VE
Jump to navigation Jump to search
m (title case)
(20 intermediate revisions by 6 users not shown)
Line 1: Line 1:
 
== Introduction ==
 
== Introduction ==
  
This wiki page describes how to configure in Proxmox VE (or any other Debian based LINUX distribution) a three node [https://en.wikipedia.org/wiki/Mesh_networking "Meshed Network"] (instead of a network switch) as it can be used e.g. for connecting [[Ceph Server | Ceph Servers]] or nodes in a [[Proxmox VE 4.x Cluster | Proxmox VE Cluster]]. This should also work with a 5-node cluster, general you need nodes_total - 1 = nic ports. The basic idea is running a small 3 node cluster with 10 Gbit network WITHOUT buying an expensive 10 Gbit network switch.
+
This wiki page describes how to configure a three node [https://en.wikipedia.org/wiki/Mesh_networking "Meshed Network"] Proxmox VE (or any other Debian based Linux distribution), which can be, for example, used for connecting [[Deploy Hyper-Converged Ceph Cluster | Ceph Servers]] or nodes in a [[Cluster Manager | Proxmox VE Cluster]] with the maximum possible bandwidth and without using a switch. We recommend to use switches for clusters larger than 3 nodes or if a 3 node cluster should be expanded in the future.
  
This should work with any kind of ethernet NICs, i.e. also 40 Gbit or even 100 Gbit ones. But for verifying this article, 10 Gbit Intel NICs were used.
+
The big advantage of this setup is that you can achieve a fast network connection between the nodes (10, 25, 40 or 100Gbit/s) WITHOUT buying expensive switches which can handle these fast speeds.
  
There a two possible method to achieve a full mesh.
+
You need at least two available NICs in each server which each connect to one of the other servers.
General the fist one is the recommended one, because it is easier to set up and supports multicast.
+
<pre>
 +
            ┌───────┐
 +
      ┌────┤ Node1 ├────┐
 +
      │    └───────┘    │
 +
  ┌───┴───┐        ┌───┴───┐
 +
  │ Node2 ├─────────┤ Node3 │
 +
  └───────┘        └───────┘
 +
</pre>
 +
 
 +
There are a few possible ways to set up such a network:
 +
 
 +
# [[#Routed Setup (with fallback)|Routed (with fallback)]]: Each packet is sent to the addressed node only. If the direct connection is down, the packets will be routed via the node in between.
 +
# [[#Routed Setup (simple)|Routed (simple)]]: Each packet is sent to the addressed node only
 +
# [[#RSTP Loop Setup|RSTP]]: A loop with the rapid spanning tree protocol enabled
 +
# [[#Broadcast setup|Broadcast]]: Each packet is sent to both other nodes
 +
 
 +
Each setup has benefits and caveats.
 +
The '''simple routed''' one does not need any additional software and delivers the best performance. The '''routed with fallback''' approach delivers similar performance but can handle the loss of one connection in the mesh by routing the traffic via the middle node. Because of this, performance could be impacted in such a scenario.
 +
 
 +
The '''RSTP''' setup gives you similar fault tolerance as the routed setup with fallback. If the loop is complete, RSTP will create an artificial cut-off between two nodes, e.g. between Node 1 and Node 3. This means, Node 2 is in between Node 1 and 3 and the traffic between Node 1 and Node 3 is going via Node 2. Should a cable or NIC fail somewhere else, for example between Node 1 and Node 2, RSTP will remove the cut-off within a few seconds. Node 3 is now in between Node 1 and 2 and has to handle that traffic as well. Once the broken part has been replaced and the loop if complete again, RSTP will introduce another artificial cut-off.
 +
 
 +
The advantage of the '''broadcast''' method is an easier setup process, but it will send all data to both other nodes, using up more bandwidth.
 +
 
 +
=== Failure Scenarios ===
 +
 
 +
==== Loss of a Node ====
 +
<pre>
 +
            ┌───────┐
 +
      ┌────┤ Node1 ├────┐
 +
      │    └───────┘    │
 +
  ┌───┴───┐        ┌───┴───┐
 +
  │ Node2 ├─────────┤ XXXXX │
 +
  └───────┘        └───────┘
 +
</pre>
 +
If a node is going down, for example Node 3, the Ceph and Proxmox VE cluster will remain functioning, though with reduced redundancy.
 +
 
 +
==== Loss of a Connection ====
 +
<pre>
 +
            ┌───────┐
 +
      ┌────┤ Node1 ├────┐
 +
      │    └───────┘    X
 +
  ┌───┴───┐        ┌───┴───┐
 +
  │ Node2 ├─────────┤ Node3 │
 +
  └───────┘        └───────┘
 +
</pre>
 +
If one of the connections is failing, for example between Node 1 and Node 3, the resulting behavior depends on the chosen setup variant. For the '''RSTP''' and '''Routed (with fallback)''', everything will '''stay functioning'''. With RSTP it might take a short moment for it to remove the artificial cut-off. With a bit of luck, the artificial cut-off was exactly at the failed connection. When using the Routed (with fallback) setup, the traffic that used to go directly between Node 1 and Node 3 will now be routed via Node 2, resulting in a bit higher latency. This can have a bit of an impact on performance, but the cluster will stay fully functional.
 +
 
 +
For the '''Broadcast''' and '''Routed (simple)''' setups, such a situation is more '''problematic''', because the nodes now have a different view, Node 2 can communicate with Node 3, while that is not possible for Node 1 anymore. You will see behavior such as Ceph showing the services on Node 3 to be down.
 +
To reduce the chances of a failed connection, you could combine the Broadcast and Routed (simple) with a bond to increase fault tolerance at the cost of more NICs and cables.
 +
 
 +
== Example ==
 +
 
 +
3 servers:
 +
* Node1 with IP addresses x.x.x.50
 +
* Node2 with IP addresses x.x.x.51
 +
* Node3 with IP addresses x.x.x.52
 +
 
 +
3 to 4 Network ports in each server:
 +
* ens18, ens19 will be used for the actual full mesh. Physical direct connections to the other two servers, 10.15.15.y/24
 +
* ens20 connection to WAN (internet/router), using at vmbr0 192.168.2.y
 +
* ens21 (optional) LAN (for cluster traffic, etc.) 10.14.14.y
 +
 
 +
Direct connections between servers:
 +
* Node1/ens18 - Node2/ens19
 +
* Node2/ens18 - Node3/ens19
 +
* Node3/ens18 - Node1/ens19
 +
 
 +
Please adapt the NIC names and IP addresses according to your situation.
 +
 
 +
<pre>
 +
                    ┌───────────┐
 +
                    │  Node1  │
 +
                    ├─────┬─────┤
 +
                    │ens18│ens19│
 +
                    └──┬──┴──┬──┘
 +
                      │    │
 +
┌───────┬─────┐        │    │        ┌─────┬───────┐
 +
│      │ens19├────────┘    └────────┤ens18│      │
 +
│ Node2 ├─────┤                      ├─────┤ Node3 │
 +
│      │ens18├───────────────────────┤ens19│      │
 +
└───────┴─────┘                      └─────┴───────┘
 +
</pre>
 +
 
 +
== Routed Setup (with Fallback) ==
 +
 
 +
We can make use of the OpenFabric protocoll which ''is a routing protocol derived from IS-IS, providing link-state routing''. [https://frrouting.org/ FRR] has a working [https://docs.frrouting.org/en/latest/fabricd.html implementation].
 +
 
 +
{{Note|This will not work in combination with the EVPN functionality from the [https://pve.proxmox.com/pve-docs/chapter-pvesdn.html Proxmox VE SDN] as it will overwrite our FRR configuration}}
 +
{{Note|If you install Ceph afterwards, you will have to do the initial Ceph configuration on the command line}}
  
== Method 1 ==
+
First install FRR:
Create a "broadcast" bond with the given interfaces on every node.
+
 
This can be done over the GUI or on the command-line.
+
apt install frr
 +
 
 +
==== /etc/frr/daemons ====
 +
Enable the OpenFabric daemon by changing the line in /etc/frr/daemons to "yes":
 +
<pre>
 +
[...]
 +
fabricd=yes
 +
[...]
 +
</pre>
 +
 
 +
==== /etc/frr/frr.conf ====
 +
 
 +
In this config file, 3 parameters need to be changed for each node:
 +
* hostname
 +
* IP address
 +
* NET
  
=== GUI ===
+
The IP addresses, in our example, are the ones in the 10.15.15.y/24 network.
On the GUI go to the node level -> System -> Network.
 
Then click on "Create" and select "Linux Bond".
 
In the Wizard make your configuration without a gateway and set mode to "broadcast".
 
  
Reboot the node to activate the new network settings.
+
The System ID in the NET (network entity title) needs to be unique. For example we can use the following ones for the three nodes:
 +
* 49.0001.1111.1111.1111.00
 +
* 49.0001.2222.2222.2222.00
 +
* 49.0001.3333.3333.3333.00
  
=== Command-Line ===
 
Add the following lines to '/etc/network/interfaces'.
 
  
 +
By configuring very short interval times, we can achieve almost instantaneous failover.
 
<pre>
 
<pre>
auto bond<No>
+
# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log in
iface bond<No> inet static
+
# /var/log/frr/frr.log
address  <IP>
+
#
netmask <Netmask>
+
# Note:
slaves <Nic1> <Nic2>
+
# FRR's configuration shell, vtysh, dynamically edits the live, in-memory
bond_miimon 100
+
# configuration while FRR is running. When instructed, vtysh will persist the
bond_mode broadcast
+
# live configuration to this file, overwriting its contents. If you want to
#Full Mesh
+
# avoid this, you can edit this file manually before starting FRR, or instruct
 +
# vtysh to write configuration to a different file.
 +
 
 +
frr defaults traditional
 +
hostname node1
 +
log syslog informational
 +
ip forwarding
 +
no ipv6 forwarding
 +
service integrated-vtysh-config
 +
!
 +
interface lo
 +
ip address 10.15.15.50/32
 +
ip router openfabric 1
 +
openfabric passive
 +
!
 +
interface ens18
 +
ip router openfabric 1
 +
openfabric csnp-interval 2
 +
openfabric hello-interval 1
 +
openfabric hello-multiplier 2
 +
!
 +
interface ens19
 +
ip router openfabric 1
 +
  openfabric csnp-interval 2
 +
  openfabric hello-interval 1
 +
openfabric hello-multiplier 2
 +
!
 +
line vty
 +
!
 +
router openfabric 1
 +
net 49.0001.1111.1111.1111.00
 +
lsp-gen-interval 1
 +
max-lsp-lifetime 600
 +
lsp-refresh-interval 180
 
</pre>
 
</pre>
  
Then start the bond
+
==== /etc/network/interfaces ====
 +
The network configuration itself is rather simple. We need to bring the interfaces used for the mesh network up. If we plan to use a large MTU, configure it here.
 +
 
 +
Do note the last line! It causes FRR to be restarted if we do apply any changes to the network configuration. For example by clicking the "Apply Configuration" button in the Proxmox VE GUI.
 
<pre>
 
<pre>
ifup bond<No>
+
auto lo
 +
iface lo inet loopback
 +
 
 +
iface ens20 inet manual
 +
 
 +
auto ens21
 +
iface ens21 inet static
 +
        address  10.14.14.51
 +
        netmask  255.255.255.0
 +
 
 +
auto vmbr0
 +
iface vmbr0 inet static
 +
        address  192.168.2.51
 +
        netmask  255.255.240.0
 +
        gateway  192.168.2.1
 +
        bridge_ports ens20
 +
        bridge_stp off
 +
        bridge_fd 0
 +
 
 +
auto ens18
 +
iface ens18 inet static
 +
        mtu 9000
 +
 
 +
auto ens19
 +
iface ens19 inet static
 +
        mtu 9000
 +
 
 +
post-up /usr/bin/systemctl restart frr.service
 
</pre>
 
</pre>
  
== Method 2 ==
+
To apply the changes without a reboot, run the following commands:
 +
 
 +
ifreload -a
 +
systemctl restart frr.service
 +
 
 +
You can check the status via the FRR CLI.
 +
 
 +
vtysh
 +
 
 +
Then enter one of the commands detailed in the [https://docs.frrouting.org/en/latest/fabricd.html#showing-openfabric-information FRR OpenFabric documentation]. For example, "show openfabric route".
 +
 
 +
 
 +
==== Ceph Initialization ====
 +
 
 +
Since the IPs are configured in the FRR configuration and not in the /etc/network/interfaces file, the Ceph GUI configuration assistant won't show it to you. To do the initial Ceph configuration, run
 +
 
 +
pveceph init --network 10.15.15.50/24
 +
 
 +
If you plan to use the Ceph Cluster Network on a different network, add the "--cluster-network" option.
 +
Next you will need to create the first monitor to get Ceph running. Either do that on the GUI or via the CLI on the node you just installed Ceph on with
 +
 
 +
pveceph mon create
 +
 
 +
== Routed Setup (Simple) ==
 +
 
 +
The 3 nodes have to be configured as described in the following sections.
 +
Note that multicast is not possible with this method.
 +
 
 
=== Node1 ===
 
=== Node1 ===
 
==== /etc/network/interface ====
 
==== /etc/network/interface ====
Line 45: Line 237:
 
iface lo inet loopback
 
iface lo inet loopback
  
iface eth2 inet manual
+
iface ens20 inet manual
  
auto eth3
+
auto ens21
iface eth3 inet static
+
iface ens21 inet static
 
         address  10.14.14.50
 
         address  10.14.14.50
 
         netmask  255.255.255.0
 
         netmask  255.255.255.0
  
 
# Connected to Node2 (.51)
 
# Connected to Node2 (.51)
auto eth0
+
auto ens18
iface eth0 inet static
+
iface ens18 inet static
 
         address  10.15.15.50
 
         address  10.15.15.50
 
         netmask  255.255.255.0
 
         netmask  255.255.255.0
         up route add -net 10.15.15.51 netmask 255.255.255.255 dev eth0
+
         up ip route add 10.15.15.51/32 dev ens18
         down route del -net 10.15.15.51 netmask 255.255.255.255 dev eth0
+
         down ip route del 10.15.15.51/32
  
 
# Connected to Node3 (.52)
 
# Connected to Node3 (.52)
auto eth1
+
auto ens19
iface eth1 inet static
+
iface ens19 inet static
 
         address  10.15.15.50
 
         address  10.15.15.50
 
         netmask  255.255.255.0
 
         netmask  255.255.255.0
         up route add -net 10.15.15.52 netmask 255.255.255.255 dev eth1
+
         up ip route add 10.15.15.52/32 dev ens19
         down route del -net 10.15.15.52 netmask 255.255.255.255 dev eth1
+
         down ip route del 10.15.15.52/32
  
 
auto vmbr0
 
auto vmbr0
Line 73: Line 265:
 
         netmask  255.255.240.0
 
         netmask  255.255.240.0
 
         gateway  192.168.2.1
 
         gateway  192.168.2.1
         bridge_ports eth2
+
         bridge_ports ens20
 
         bridge_stp off
 
         bridge_stp off
 
         bridge_fd 0
 
         bridge_fd 0
 +
 
</pre>
 
</pre>
  
==== route ====
+
==== Route ====
 
<pre>
 
<pre>
root@pve-2-50:~# route -n
+
root@pve-2-50:~# ip route
Kernel IP routing table
+
default via 192.168.2.1 dev vmbr0 onlink
Destination    Gateway        Genmask        Flags Metric Ref    Use Iface
+
10.14.14.0/24 dev ens21 proto kernel scope link src 10.14.14.50
10.15.15.51    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
+
10.15.15.0/24 dev ens18 proto kernel scope link src 10.15.15.50
10.15.15.52    0.0.0.0        255.255.255.255 UH    0      0        0 eth1
+
10.15.15.0/24 dev ens19 proto kernel scope link src 10.15.15.50
10.15.15.0     0.0.0.0        255.255.255.0  U    0      0        0 eth0
+
10.15.15.52 dev ens19 scope link
10.15.15.0      0.0.0.0        255.255.255.0  U    0      0        0 eth1
+
10.15.15.51 dev ens18 scope link
10.14.14.0      0.0.0.0        255.255.255.0  U    0      0        0 eth3
+
192.168.0.0/20 dev vmbr0 proto kernel scope link src 192.168.2.50
192.168.0.0     0.0.0.0        255.255.240.0  U    0      0        0 vmbr0
 
0.0.0.0        192.168.2.1    0.0.0.0        UG    0      0        0 vmbr0
 
 
</pre>
 
</pre>
  
Line 98: Line 289:
 
iface lo inet loopback
 
iface lo inet loopback
  
iface eth2 inet manual
+
iface ens20 inet manual
  
auto eth3
+
auto ens21
iface eth3 inet static
+
iface ens21 inet static
 
         address  10.14.14.51
 
         address  10.14.14.51
 
         netmask  255.255.255.0
 
         netmask  255.255.255.0
  
# Connected to Node1 (.50)
+
# Connected to Node3 (.52)
auto eth0
+
auto ens18
iface eth0 inet static
+
iface ens18 inet static
 
         address  10.15.15.51
 
         address  10.15.15.51
 
         netmask  255.255.255.0
 
         netmask  255.255.255.0
         up route add -net 10.15.15.50 netmask 255.255.255.255 dev eth0
+
         up ip route add 10.15.15.52/32 dev ens18
         down route del -net 10.15.15.50 netmask 255.255.255.255 dev eth0
+
         down ip route del 10.15.15.52/32
  
# Connected to Node3 (.52)
+
# Connected to Node1 (.50)
auto eth1
+
auto ens19
iface eth1 inet static
+
iface ens19 inet static
 
         address  10.15.15.51
 
         address  10.15.15.51
 
         netmask  255.255.255.0
 
         netmask  255.255.255.0
         up route add -net 10.15.15.52 netmask 255.255.255.255 dev eth1
+
         up ip route add 10.15.15.50/32 dev ens19
         down route del -net 10.15.15.52 netmask 255.255.255.255 dev eth1
+
         down ip route del 10.15.15.50/32
  
 
auto vmbr0
 
auto vmbr0
Line 126: Line 317:
 
         netmask  255.255.240.0
 
         netmask  255.255.240.0
 
         gateway  192.168.2.1
 
         gateway  192.168.2.1
         bridge_ports eth2
+
         bridge_ports ens20
 
         bridge_stp off
 
         bridge_stp off
 
         bridge_fd 0
 
         bridge_fd 0
 +
 
</pre>
 
</pre>
  
==== route ====
+
==== Route ====
 
<pre>
 
<pre>
root@pve-2-51:/# route -n
+
root@pve-2-51:/# ip route
Kernel IP routing table
+
default via 192.168.2.1 dev vmbr0 onlink
Destination    Gateway        Genmask        Flags Metric Ref    Use Iface
+
10.14.14.0/24 dev ens21 proto kernel scope link src 10.14.14.51
10.15.15.50    0.0.0.0         255.255.255.255 UH    0      0        0 eth0
+
10.15.15.0/24 dev ens18 proto kernel scope link src 10.15.15.51
10.15.15.52    0.0.0.0        255.255.255.255 UH    0      0        0 eth1
+
10.15.15.0/24 dev ens19 proto kernel scope link src 10.15.15.51
10.15.15.0     0.0.0.0        255.255.255.0  U    0      0        0 eth0
+
10.15.15.52 dev ens18 scope link
10.15.15.0      0.0.0.0        255.255.255.0  U    0      0        0 eth1
+
10.15.15.50 dev ens19 scope link
10.14.14.0      0.0.0.0        255.255.255.0  U    0      0        0 eth3
+
192.168.0.0/20 dev vmbr0 proto kernel scope link src 192.168.2.51
192.168.0.0     0.0.0.0        255.255.240.0  U    0      0        0 vmbr0
 
0.0.0.0        192.168.2.1    0.0.0.0        UG    0      0        0 vmbr0
 
 
</pre>
 
</pre>
  
Line 151: Line 341:
 
iface lo inet loopback
 
iface lo inet loopback
  
iface eth2 inet manual
+
iface ens20 inet manual
  
auto eth3
+
auto ens21
iface eth3 inet static
+
iface ens21 inet static
 
         address  10.14.14.52
 
         address  10.14.14.52
 
         netmask  255.255.255.0
 
         netmask  255.255.255.0
  
# Connected to Node2 (.51)
+
# Connected to Node1 (.50)
auto eth0
+
auto ens18
iface eth0 inet static
+
iface ens18 inet static
 
         address  10.15.15.52
 
         address  10.15.15.52
 
         netmask  255.255.255.0
 
         netmask  255.255.255.0
         up route add -net 10.15.15.51 netmask 255.255.255.255 dev eth0
+
         up ip route add 10.15.15.50/32 dev ens18
         down route del -net 10.15.15.51 netmask 255.255.255.255 dev eth0
+
         down ip route del 10.15.15.50/32
  
# Connected to Node1 (.50)
+
# Connected to Node2 (.51)
auto eth1
+
auto ens19
iface eth1 inet static
+
iface ens19 inet static
 
         address  10.15.15.52
 
         address  10.15.15.52
 
         netmask  255.255.255.0
 
         netmask  255.255.255.0
         up route add -net 10.15.15.50 netmask 255.255.255.255 dev eth1
+
         up ip route add 10.15.15.51/32 dev ens19
         down route del -net 10.15.15.50 netmask 255.255.255.255 dev eth1
+
         down ip route del 10.15.15.51/32
  
 
auto vmbr0
 
auto vmbr0
Line 179: Line 369:
 
         netmask  255.255.240.0
 
         netmask  255.255.240.0
 
         gateway  192.168.2.1
 
         gateway  192.168.2.1
         bridge_ports eth2
+
         bridge_ports ens20
 
         bridge_stp off
 
         bridge_stp off
 
         bridge_fd 0
 
         bridge_fd 0
 
</pre>
 
</pre>
  
==== route ====
+
==== Route ====
 +
<pre>
 +
root@pve-2-52:~# ip route
 +
default via 192.168.2.1 dev vmbr0 onlink
 +
10.14.14.0/24 dev ens21 proto kernel scope link src 10.14.14.52
 +
10.15.15.0/24 dev ens18 proto kernel scope link src 10.15.15.52
 +
10.15.15.0/24 dev ens19 proto kernel scope link src 10.15.15.52
 +
10.15.15.51 dev ens19 scope link
 +
10.15.15.50 dev ens18 scope link
 +
192.168.0.0/20 dev vmbr0 proto kernel scope link src 192.168.2.52
 +
</pre>
 +
 
 +
== RSTP Loop Setup ==
 +
 
 +
This setup requires the use of Open vSwitch (OVS) as it supports RSTP (Rapid spanning tree protocol). The Linux bridge itself only supports STP (without the rapid) which usually needs too long to react to a changed topology. In our tests we saw the RSTP setup to recover from one network connection going down within a few seconds while STP took about 30 seconds. This is long enough for Ceph to start to complain and throw some warnings.
 +
 
 +
First the package `openvswitch-switch` needs to be installed on all nodes:
 +
 
 +
apt install openvswitch-switch
 +
 
 +
The network configuration will look the same for each node, except for the IP addresses.
 +
 
 +
==== /etc/network/interface ====
 +
<pre>
 +
auto lo
 +
iface lo inet loopback
 +
 
 +
iface ens20 inet manual
 +
 
 +
auto ens21
 +
iface ens21 inet static
 +
        address  10.14.14.51
 +
        netmask  255.255.255.0
 +
 
 +
auto vmbr0
 +
iface vmbr0 inet static
 +
        address  192.168.2.51
 +
        netmask  255.255.240.0
 +
        gateway  192.168.2.1
 +
        bridge_ports ens20
 +
        bridge_stp off
 +
        bridge_fd 0
 +
 
 +
auto ens18
 +
iface ens18 inet manual
 +
    ovs_type OVSPort
 +
    ovs_bridge vmbr1
 +
    ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged
 +
 
 +
auto ens19
 +
iface ens19 inet manual
 +
    ovs_type OVSPort
 +
    ovs_bridge vmbr1
 +
    ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged
 +
 
 +
auto vmbr1
 +
iface vmbr1 inet static
 +
    address 10.15.15.50/24
 +
    ovs_type OVSBridge
 +
    ovs_ports ens18 ens19
 +
    up ovs-vsctl set Bridge ${IFACE} rstp_enable=true other_config:rstp-priority=32768 other_config:rstp-forward-delay=4 other_config:rstp-max-age=6
 +
    post-up sleep 10
 +
</pre>
 +
 
 +
If needed, you can set the MTU with `ovs_mtu 9000` in the `vmbr1`, `eno18` and `eno19` configs.
 +
You can check the RSTP status with
 +
 
 +
ovs-appctl rstp/show
 +
 
 +
== Broadcast Setup ==
 +
Create a "broadcast" bond with the given interfaces on every node.
 +
This can be done over the GUI or on the command-line.
 +
 
 +
=== GUI ===
 +
On the GUI go to the node level -> System -> Network.
 +
Then click on "Create" and select "Linux Bond".
 +
In the Wizard make your configuration without a gateway and set mode to "broadcast".
 +
 
 +
Reboot the node to activate the new network settings.
 +
 
 +
=== Command-Line ===
 +
Add the following lines to '/etc/network/interfaces'.
 +
 
 +
<pre>
 +
auto bond<No>
 +
iface bond<No> inet static
 +
address  <IP>
 +
netmask  <Netmask>
 +
slaves <Nic1> <Nic2>
 +
bond_miimon 100
 +
bond_mode broadcast
 +
#Full Mesh
 +
</pre>
 +
 
 +
Then start the bond
 
<pre>
 
<pre>
root@pve-2-52:~# route -n
+
ifup bond<No>
Kernel IP routing table
 
Destination    Gateway        Genmask        Flags Metric Ref    Use Iface
 
10.15.15.51    0.0.0.0        255.255.255.255 UH    0      0        0 eth0
 
10.15.15.50    0.0.0.0        255.255.255.255 UH    0      0        0 eth1
 
10.15.15.0      0.0.0.0        255.255.255.0  U    0      0        0 eth0
 
10.15.15.0      0.0.0.0        255.255.255.0  U    0      0        0 eth1
 
10.14.14.0      0.0.0.0        255.255.255.0  U    0      0        0 eth3
 
192.168.0.0    0.0.0.0        255.255.240.0  U    0      0        0 vmbr0
 
0.0.0.0        192.168.2.1    0.0.0.0        UG    0      0        0 vmbr0
 
 
</pre>
 
</pre>
  
[[Category: HOWTO]] [[Category: Cluster]] [[Category: Technology]]
+
In Node1 of the above described setup example /etc/network/interface will look like as follows:
 +
 
 +
<pre>
 +
iface lo inet loopback
 +
 
 +
iface ens20 inet manual
 +
 
 +
auto ens21
 +
iface ens21 inet static
 +
        address  10.14.14.50
 +
        netmask  255.255.255.0
 +
 
 +
 
 +
iface ens18 inet manual
 +
 
 +
iface ens19 inet manual
 +
 
 +
auto bond0
 +
iface bond0 inet static
 +
      address 10.15.15.50
 +
      netmask 255.255.255.0
 +
      slaves ens18 ens19
 +
      bond_miimon 100
 +
      bond_mode broadcast
 +
 
 +
 
 +
auto vmbr0
 +
iface vmbr0 inet static
 +
        address  192.168.2.50
 +
        netmask  255.255.240.0
 +
        gateway  192.168.2.1
 +
        bridge_ports ens20
 +
        bridge_stp off
 +
        bridge_fd 0
 +
</pre>
 +
 
 +
 
 +
[[Category: HOWTO]] [[Category: Cluster]]

Revision as of 15:52, 14 September 2022

Introduction

This wiki page describes how to configure a three node "Meshed Network" Proxmox VE (or any other Debian based Linux distribution), which can be, for example, used for connecting Ceph Servers or nodes in a Proxmox VE Cluster with the maximum possible bandwidth and without using a switch. We recommend to use switches for clusters larger than 3 nodes or if a 3 node cluster should be expanded in the future.

The big advantage of this setup is that you can achieve a fast network connection between the nodes (10, 25, 40 or 100Gbit/s) WITHOUT buying expensive switches which can handle these fast speeds.

You need at least two available NICs in each server which each connect to one of the other servers.

            ┌───────┐
       ┌────┤ Node1 ├────┐
       │    └───────┘    │
   ┌───┴───┐         ┌───┴───┐
   │ Node2 ├─────────┤ Node3 │
   └───────┘         └───────┘

There are a few possible ways to set up such a network:

  1. Routed (with fallback): Each packet is sent to the addressed node only. If the direct connection is down, the packets will be routed via the node in between.
  2. Routed (simple): Each packet is sent to the addressed node only
  3. RSTP: A loop with the rapid spanning tree protocol enabled
  4. Broadcast: Each packet is sent to both other nodes

Each setup has benefits and caveats. The simple routed one does not need any additional software and delivers the best performance. The routed with fallback approach delivers similar performance but can handle the loss of one connection in the mesh by routing the traffic via the middle node. Because of this, performance could be impacted in such a scenario.

The RSTP setup gives you similar fault tolerance as the routed setup with fallback. If the loop is complete, RSTP will create an artificial cut-off between two nodes, e.g. between Node 1 and Node 3. This means, Node 2 is in between Node 1 and 3 and the traffic between Node 1 and Node 3 is going via Node 2. Should a cable or NIC fail somewhere else, for example between Node 1 and Node 2, RSTP will remove the cut-off within a few seconds. Node 3 is now in between Node 1 and 2 and has to handle that traffic as well. Once the broken part has been replaced and the loop if complete again, RSTP will introduce another artificial cut-off.

The advantage of the broadcast method is an easier setup process, but it will send all data to both other nodes, using up more bandwidth.

Failure Scenarios

Loss of a Node

            ┌───────┐
       ┌────┤ Node1 ├────┐
       │    └───────┘    │
   ┌───┴───┐         ┌───┴───┐
   │ Node2 ├─────────┤ XXXXX │
   └───────┘         └───────┘

If a node is going down, for example Node 3, the Ceph and Proxmox VE cluster will remain functioning, though with reduced redundancy.

Loss of a Connection

            ┌───────┐
       ┌────┤ Node1 ├────┐
       │    └───────┘    X
   ┌───┴───┐         ┌───┴───┐
   │ Node2 ├─────────┤ Node3 │
   └───────┘         └───────┘

If one of the connections is failing, for example between Node 1 and Node 3, the resulting behavior depends on the chosen setup variant. For the RSTP and Routed (with fallback), everything will stay functioning. With RSTP it might take a short moment for it to remove the artificial cut-off. With a bit of luck, the artificial cut-off was exactly at the failed connection. When using the Routed (with fallback) setup, the traffic that used to go directly between Node 1 and Node 3 will now be routed via Node 2, resulting in a bit higher latency. This can have a bit of an impact on performance, but the cluster will stay fully functional.

For the Broadcast and Routed (simple) setups, such a situation is more problematic, because the nodes now have a different view, Node 2 can communicate with Node 3, while that is not possible for Node 1 anymore. You will see behavior such as Ceph showing the services on Node 3 to be down. To reduce the chances of a failed connection, you could combine the Broadcast and Routed (simple) with a bond to increase fault tolerance at the cost of more NICs and cables.

Example

3 servers:

  • Node1 with IP addresses x.x.x.50
  • Node2 with IP addresses x.x.x.51
  • Node3 with IP addresses x.x.x.52

3 to 4 Network ports in each server:

  • ens18, ens19 will be used for the actual full mesh. Physical direct connections to the other two servers, 10.15.15.y/24
  • ens20 connection to WAN (internet/router), using at vmbr0 192.168.2.y
  • ens21 (optional) LAN (for cluster traffic, etc.) 10.14.14.y

Direct connections between servers:

  • Node1/ens18 - Node2/ens19
  • Node2/ens18 - Node3/ens19
  • Node3/ens18 - Node1/ens19

Please adapt the NIC names and IP addresses according to your situation.

                    ┌───────────┐
                    │   Node1   │
                    ├─────┬─────┤
                    │ens18│ens19│
                    └──┬──┴──┬──┘
                       │     │
┌───────┬─────┐        │     │        ┌─────┬───────┐
│       │ens19├────────┘     └────────┤ens18│       │
│ Node2 ├─────┤                       ├─────┤ Node3 │
│       │ens18├───────────────────────┤ens19│       │
└───────┴─────┘                       └─────┴───────┘

Routed Setup (with Fallback)

We can make use of the OpenFabric protocoll which is a routing protocol derived from IS-IS, providing link-state routing. FRR has a working implementation.

Yellowpin.svg Note: This will not work in combination with the EVPN functionality from the Proxmox VE SDN as it will overwrite our FRR configuration
Yellowpin.svg Note: If you install Ceph afterwards, you will have to do the initial Ceph configuration on the command line

First install FRR:

apt install frr

/etc/frr/daemons

Enable the OpenFabric daemon by changing the line in /etc/frr/daemons to "yes":

[...]
fabricd=yes
[...]

/etc/frr/frr.conf

In this config file, 3 parameters need to be changed for each node:

  • hostname
  • IP address
  • NET

The IP addresses, in our example, are the ones in the 10.15.15.y/24 network.

The System ID in the NET (network entity title) needs to be unique. For example we can use the following ones for the three nodes:

  • 49.0001.1111.1111.1111.00
  • 49.0001.2222.2222.2222.00
  • 49.0001.3333.3333.3333.00


By configuring very short interval times, we can achieve almost instantaneous failover.

# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log in
# /var/log/frr/frr.log
#
# Note:
# FRR's configuration shell, vtysh, dynamically edits the live, in-memory
# configuration while FRR is running. When instructed, vtysh will persist the
# live configuration to this file, overwriting its contents. If you want to
# avoid this, you can edit this file manually before starting FRR, or instruct
# vtysh to write configuration to a different file.

frr defaults traditional
hostname node1
log syslog informational
ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
interface lo
 ip address 10.15.15.50/32
 ip router openfabric 1
 openfabric passive
!
interface ens18
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
interface ens19
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
line vty
!
router openfabric 1
 net 49.0001.1111.1111.1111.00
 lsp-gen-interval 1
 max-lsp-lifetime 600
 lsp-refresh-interval 180

/etc/network/interfaces

The network configuration itself is rather simple. We need to bring the interfaces used for the mesh network up. If we plan to use a large MTU, configure it here.

Do note the last line! It causes FRR to be restarted if we do apply any changes to the network configuration. For example by clicking the "Apply Configuration" button in the Proxmox VE GUI.

auto lo
iface lo inet loopback

iface ens20 inet manual

auto ens21
iface ens21 inet static
        address  10.14.14.51
        netmask  255.255.255.0

auto vmbr0
iface vmbr0 inet static
        address  192.168.2.51
        netmask  255.255.240.0
        gateway  192.168.2.1
        bridge_ports ens20
        bridge_stp off
        bridge_fd 0

auto ens18
iface ens18 inet static
        mtu 9000

auto ens19
iface ens19 inet static
        mtu 9000

post-up /usr/bin/systemctl restart frr.service

To apply the changes without a reboot, run the following commands:

ifreload -a
systemctl restart frr.service

You can check the status via the FRR CLI.

vtysh

Then enter one of the commands detailed in the FRR OpenFabric documentation. For example, "show openfabric route".


Ceph Initialization

Since the IPs are configured in the FRR configuration and not in the /etc/network/interfaces file, the Ceph GUI configuration assistant won't show it to you. To do the initial Ceph configuration, run

pveceph init --network 10.15.15.50/24

If you plan to use the Ceph Cluster Network on a different network, add the "--cluster-network" option. Next you will need to create the first monitor to get Ceph running. Either do that on the GUI or via the CLI on the node you just installed Ceph on with

pveceph mon create

Routed Setup (Simple)

The 3 nodes have to be configured as described in the following sections. Note that multicast is not possible with this method.

Node1

/etc/network/interface

auto lo
iface lo inet loopback

iface ens20 inet manual

auto ens21
iface ens21 inet static
        address  10.14.14.50
        netmask  255.255.255.0

# Connected to Node2 (.51)
auto ens18
iface ens18 inet static
        address  10.15.15.50
        netmask  255.255.255.0
        up ip route add 10.15.15.51/32 dev ens18
        down ip route del 10.15.15.51/32

# Connected to Node3 (.52)
auto ens19
iface ens19 inet static
        address  10.15.15.50
        netmask  255.255.255.0
        up ip route add 10.15.15.52/32 dev ens19
        down ip route del 10.15.15.52/32

auto vmbr0
iface vmbr0 inet static
        address  192.168.2.50
        netmask  255.255.240.0
        gateway  192.168.2.1
        bridge_ports ens20
        bridge_stp off
        bridge_fd 0

Route

root@pve-2-50:~# ip route
default via 192.168.2.1 dev vmbr0 onlink 
10.14.14.0/24 dev ens21 proto kernel scope link src 10.14.14.50 
10.15.15.0/24 dev ens18 proto kernel scope link src 10.15.15.50 
10.15.15.0/24 dev ens19 proto kernel scope link src 10.15.15.50 
10.15.15.52 dev ens19 scope link 
10.15.15.51 dev ens18 scope link 
192.168.0.0/20 dev vmbr0 proto kernel scope link src 192.168.2.50 

Node2

/etc/network/interface

auto lo
iface lo inet loopback

iface ens20 inet manual

auto ens21
iface ens21 inet static
        address  10.14.14.51
        netmask  255.255.255.0

# Connected to Node3 (.52)
auto ens18
iface ens18 inet static
        address  10.15.15.51
        netmask  255.255.255.0
        up ip route add 10.15.15.52/32 dev ens18
        down ip route del 10.15.15.52/32

# Connected to Node1 (.50)
auto ens19
iface ens19 inet static
        address  10.15.15.51
        netmask  255.255.255.0
        up ip route add 10.15.15.50/32 dev ens19
        down ip route del 10.15.15.50/32

auto vmbr0
iface vmbr0 inet static
        address  192.168.2.51
        netmask  255.255.240.0
        gateway  192.168.2.1
        bridge_ports ens20
        bridge_stp off
        bridge_fd 0

Route

root@pve-2-51:/# ip route
default via 192.168.2.1 dev vmbr0 onlink 
10.14.14.0/24 dev ens21 proto kernel scope link src 10.14.14.51 
10.15.15.0/24 dev ens18 proto kernel scope link src 10.15.15.51 
10.15.15.0/24 dev ens19 proto kernel scope link src 10.15.15.51 
10.15.15.52 dev ens18 scope link 
10.15.15.50 dev ens19 scope link 
192.168.0.0/20 dev vmbr0 proto kernel scope link src 192.168.2.51 

Node3

/etc/network/interface

auto lo
iface lo inet loopback

iface ens20 inet manual

auto ens21
iface ens21 inet static
        address  10.14.14.52
        netmask  255.255.255.0

# Connected to Node1 (.50)
auto ens18
iface ens18 inet static
        address  10.15.15.52
        netmask  255.255.255.0
        up ip route add 10.15.15.50/32 dev ens18
        down ip route del 10.15.15.50/32

# Connected to Node2 (.51)
auto ens19
iface ens19 inet static
        address  10.15.15.52
        netmask  255.255.255.0
        up ip route add 10.15.15.51/32 dev ens19
        down ip route del 10.15.15.51/32

auto vmbr0
iface vmbr0 inet static
        address  192.168.2.52
        netmask  255.255.240.0
        gateway  192.168.2.1
        bridge_ports ens20
        bridge_stp off
        bridge_fd 0

Route

root@pve-2-52:~# ip route
default via 192.168.2.1 dev vmbr0 onlink 
10.14.14.0/24 dev ens21 proto kernel scope link src 10.14.14.52 
10.15.15.0/24 dev ens18 proto kernel scope link src 10.15.15.52 
10.15.15.0/24 dev ens19 proto kernel scope link src 10.15.15.52 
10.15.15.51 dev ens19 scope link 
10.15.15.50 dev ens18 scope link 
192.168.0.0/20 dev vmbr0 proto kernel scope link src 192.168.2.52 

RSTP Loop Setup

This setup requires the use of Open vSwitch (OVS) as it supports RSTP (Rapid spanning tree protocol). The Linux bridge itself only supports STP (without the rapid) which usually needs too long to react to a changed topology. In our tests we saw the RSTP setup to recover from one network connection going down within a few seconds while STP took about 30 seconds. This is long enough for Ceph to start to complain and throw some warnings.

First the package `openvswitch-switch` needs to be installed on all nodes:

apt install openvswitch-switch

The network configuration will look the same for each node, except for the IP addresses.

/etc/network/interface

auto lo
iface lo inet loopback

iface ens20 inet manual

auto ens21
iface ens21 inet static
        address  10.14.14.51
        netmask  255.255.255.0

auto vmbr0
iface vmbr0 inet static
        address  192.168.2.51
        netmask  255.255.240.0
        gateway  192.168.2.1
        bridge_ports ens20
        bridge_stp off
        bridge_fd 0

auto ens18
iface ens18 inet manual
    ovs_type OVSPort
    ovs_bridge vmbr1
    ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged

auto ens19
iface ens19 inet manual
    ovs_type OVSPort
    ovs_bridge vmbr1
    ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged

auto vmbr1
iface vmbr1 inet static
    address 10.15.15.50/24
    ovs_type OVSBridge
    ovs_ports ens18 ens19
    up ovs-vsctl set Bridge ${IFACE} rstp_enable=true other_config:rstp-priority=32768 other_config:rstp-forward-delay=4 other_config:rstp-max-age=6
    post-up sleep 10

If needed, you can set the MTU with `ovs_mtu 9000` in the `vmbr1`, `eno18` and `eno19` configs. You can check the RSTP status with

ovs-appctl rstp/show

Broadcast Setup

Create a "broadcast" bond with the given interfaces on every node. This can be done over the GUI or on the command-line.

GUI

On the GUI go to the node level -> System -> Network. Then click on "Create" and select "Linux Bond". In the Wizard make your configuration without a gateway and set mode to "broadcast".

Reboot the node to activate the new network settings.

Command-Line

Add the following lines to '/etc/network/interfaces'.

auto bond<No>
iface bond<No> inet static
	address  <IP>
	netmask  <Netmask>
	slaves <Nic1> <Nic2>
	bond_miimon 100
	bond_mode broadcast
#Full Mesh

Then start the bond

ifup bond<No>

In Node1 of the above described setup example /etc/network/interface will look like as follows:

iface lo inet loopback

iface ens20 inet manual

auto ens21
iface ens21 inet static
        address  10.14.14.50
        netmask  255.255.255.0


iface ens18 inet manual

iface ens19 inet manual

auto bond0
iface bond0 inet static
       address 10.15.15.50
       netmask 255.255.255.0
       slaves ens18 ens19
       bond_miimon 100
       bond_mode broadcast


auto vmbr0
iface vmbr0 inet static
        address  192.168.2.50
        netmask  255.255.240.0
        gateway  192.168.2.1
        bridge_ports ens20
        bridge_stp off
        bridge_fd 0