Full Mesh Network for Ceph Server: Difference between revisions
A.lauterer (talk | contribs) (change phrasing to make it clear that it gets quite complicated for larger clusters and a switch would be much easier to manage) |
A.lauterer (talk | contribs) (Add section about RSTP setup) |
||
Line 1: | Line 1: | ||
== Introduction == | == Introduction == | ||
This wiki page describes how to configure a three node [https://en.wikipedia.org/wiki/Mesh_networking "Meshed Network"] Proxmox VE (or any other Debian based Linux distribution), which can be, for example, used for connecting [[Deploy Hyper-Converged Ceph Cluster | Ceph Servers]] or nodes in a [[Cluster Manager | Proxmox VE Cluster]] with the maximum possible bandwidth and without using a switch | This wiki page describes how to configure a three node [https://en.wikipedia.org/wiki/Mesh_networking "Meshed Network"] Proxmox VE (or any other Debian based Linux distribution), which can be, for example, used for connecting [[Deploy Hyper-Converged Ceph Cluster | Ceph Servers]] or nodes in a [[Cluster Manager | Proxmox VE Cluster]] with the maximum possible bandwidth and without using a switch. We recommend to use switches for clusters larger than 3 nodes or if a 3 node cluster should be expanded in the future. | ||
The big advantage of this setup is that you can achieve a fast network connection between the nodes (10, 25, 40 or 100Gbit/s) WITHOUT buying expensive switches which can handle these fast speeds. | |||
You need at least two available NICs in each server which each connect to one of the other servers. | |||
<pre> | |||
┌───────┐ | |||
┌────┤ Node1 ├────┐ | |||
│ └───────┘ │ | |||
┌───┴───┐ ┌───┴───┐ | |||
│ Node2 ├─────────┤ Node3 │ | |||
└───────┘ └───────┘ | |||
</pre> | |||
There are three possible ways to set up such a network: | |||
# RSTP: A loop with the rapid spanning tree protocol enabled | |||
# Routed: Each packet is sent to the addressed node only | # Routed: Each packet is sent to the addressed node only | ||
# Broadcast: Each packet is sent to both other nodes | # Broadcast: Each packet is sent to both other nodes | ||
The routed setup is | Each setup has benefits and caveats. | ||
The advantage of the broadcast method is an easier setup process. | The '''RSTP''' setup is the most fault-tolerant. If the loop is complete, RSTP will create an artificial cut-off between two nodes, e.g. between Node 1 and Node 3. This means, Node 2 is in between Node 1 and 3 and the traffic between Node 1 and Node 3 is going via Node 2. Should a cable or NIC fail somewhere else, for example between Node 1 and Node 2, RSTP will remove the cut-off within a few seconds. Node 3 is now in between Node 1 and 2 and has to handle that traffic as well. Once the broken part has been replaced and the loop if complete again, RSTP will introduce another artificial cut-off. | ||
The '''routed''' setup is the most performant. It uses less bandwidth by routing the traffic only to the destination node. | |||
The advantage of the '''broadcast''' method is an easier setup process, but it will send all data to both other nodes, using up more bandwidth. | |||
The routed and broadcast method do not have any fault tolerance themselves, but of course you could combine them with a bond to increase fault tolerance at the cost of more NICs and cables. | |||
== Example == | == Example == | ||
Line 28: | Line 45: | ||
* Node2/ens18 - Node3/ens19 | * Node2/ens18 - Node3/ens19 | ||
* Node3/ens18 - Node1/ens19 | * Node3/ens18 - Node1/ens19 | ||
Please adapt the NIC names and IP addresses according to your situation. | |||
<pre> | <pre> | ||
┌───────────┐ | |||
│ Node1 │ | |||
├─────┬─────┤ | |||
│ens18│ens19│ | |||
└──┬──┴──┬──┘ | |||
│ │ | |||
┌───────┬─────┐ │ │ ┌─────┬───────┐ | |||
│ │ens19├────────┘ └────────┤ens18│ │ | |||
│ Node2 ├─────┤ ├─────┤ Node3 │ | |||
│ │ens18├───────────────────────┤ens19│ │ | |||
└───────┴─────┘ └─────┴───────┘ | |||
</pre> | </pre> | ||
== | == RSTP Loop Setup == | ||
This setup requires the use of Open vSwitch (OVS) as it supports RSTP (Rapid spanning tree protocol). The Linux bridge itself only supports STP (without the rapid) which usually needs too long to react to a changed topology. In our tests we saw the RSTP setup to recover from one network connection going down within a few seconds while STP took about 30 seconds. This is long enough for Ceph to start to complain and throw some warnings. | |||
First the package `openvswitch-switch` needs to be installed on all nodes: | |||
apt install openvswitch-switch | |||
The network configuration will look the same for each node, except for the IP addresses. | |||
==== /etc/network/interface ==== | |||
<pre> | |||
auto lo | |||
iface lo inet loopback | |||
iface ens20 inet manual | |||
auto ens21 | |||
iface ens21 inet static | |||
address 10.14.14.51 | |||
netmask 255.255.255.0 | |||
auto vmbr0 | |||
iface vmbr0 inet static | |||
address 192.168.2.51 | |||
netmask 255.255.240.0 | |||
gateway 192.168.2.1 | |||
bridge_ports ens20 | |||
bridge_stp off | |||
bridge_fd 0 | |||
auto ens18 | |||
iface ens18 inet manual | |||
ovs_type OVSPort | |||
ovs_bridge vmbr1 | |||
ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged | |||
auto ens19 | |||
iface ens19 inet manual | |||
ovs_type OVSPort | |||
ovs_bridge vmbr1 | |||
ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged | |||
auto vmbr1 | |||
iface vmbr1 inet static | |||
address 10.15.15.50/24 | |||
ovs_type OVSBridge | |||
ovs_ports ens18 ens19 | |||
up ovs-vsctl set Bridge ${IFACE} rstp_enable=true other_config:rstp-priority=32768 other_config:rstp-forward-delay=4 other_config:rstp-max-age=6 | |||
post-up sleep 10 | |||
</pre> | |||
If needed, you can set the MTU with `ovs_mtu 9000` in the `vmbr1`, `eno18` and `eno19` configs. | |||
You can check the RSTP status with | |||
ovs-appctl rstp/show | |||
== Routed Setup == | |||
Corresponding to the above described setup example the 3 nodes have to be configured as described in the following sections. | Corresponding to the above described setup example, the 3 nodes have to be configured as described in the following sections. | ||
Note that multicast is not possible with this method. | Note that multicast is not possible with this method. | ||
Revision as of 08:54, 13 December 2021
Introduction
This wiki page describes how to configure a three node "Meshed Network" Proxmox VE (or any other Debian based Linux distribution), which can be, for example, used for connecting Ceph Servers or nodes in a Proxmox VE Cluster with the maximum possible bandwidth and without using a switch. We recommend to use switches for clusters larger than 3 nodes or if a 3 node cluster should be expanded in the future.
The big advantage of this setup is that you can achieve a fast network connection between the nodes (10, 25, 40 or 100Gbit/s) WITHOUT buying expensive switches which can handle these fast speeds.
You need at least two available NICs in each server which each connect to one of the other servers.
┌───────┐ ┌────┤ Node1 ├────┐ │ └───────┘ │ ┌───┴───┐ ┌───┴───┐ │ Node2 ├─────────┤ Node3 │ └───────┘ └───────┘
There are three possible ways to set up such a network:
- RSTP: A loop with the rapid spanning tree protocol enabled
- Routed: Each packet is sent to the addressed node only
- Broadcast: Each packet is sent to both other nodes
Each setup has benefits and caveats. The RSTP setup is the most fault-tolerant. If the loop is complete, RSTP will create an artificial cut-off between two nodes, e.g. between Node 1 and Node 3. This means, Node 2 is in between Node 1 and 3 and the traffic between Node 1 and Node 3 is going via Node 2. Should a cable or NIC fail somewhere else, for example between Node 1 and Node 2, RSTP will remove the cut-off within a few seconds. Node 3 is now in between Node 1 and 2 and has to handle that traffic as well. Once the broken part has been replaced and the loop if complete again, RSTP will introduce another artificial cut-off.
The routed setup is the most performant. It uses less bandwidth by routing the traffic only to the destination node. The advantage of the broadcast method is an easier setup process, but it will send all data to both other nodes, using up more bandwidth.
The routed and broadcast method do not have any fault tolerance themselves, but of course you could combine them with a bond to increase fault tolerance at the cost of more NICs and cables.
Example
3 servers:
- Node1 with IP addresses x.x.x.50
- Node2 with IP addresses x.x.x.51
- Node3 with IP addresses x.x.x.52
3 to 4 Network ports in each server:
- ens18, ens19 will be used for the actual full mesh. Physical direct connections to the other two servers, 10.15.15.y/24
- ens20 connection to WAN (internet/router), using at vmbr0 192.168.2.y
- ens21 (optional) LAN (for cluster traffic, etc.) 10.14.14.y
Direct connections between servers:
- Node1/ens18 - Node2/ens19
- Node2/ens18 - Node3/ens19
- Node3/ens18 - Node1/ens19
Please adapt the NIC names and IP addresses according to your situation.
┌───────────┐ │ Node1 │ ├─────┬─────┤ │ens18│ens19│ └──┬──┴──┬──┘ │ │ ┌───────┬─────┐ │ │ ┌─────┬───────┐ │ │ens19├────────┘ └────────┤ens18│ │ │ Node2 ├─────┤ ├─────┤ Node3 │ │ │ens18├───────────────────────┤ens19│ │ └───────┴─────┘ └─────┴───────┘
RSTP Loop Setup
This setup requires the use of Open vSwitch (OVS) as it supports RSTP (Rapid spanning tree protocol). The Linux bridge itself only supports STP (without the rapid) which usually needs too long to react to a changed topology. In our tests we saw the RSTP setup to recover from one network connection going down within a few seconds while STP took about 30 seconds. This is long enough for Ceph to start to complain and throw some warnings.
First the package `openvswitch-switch` needs to be installed on all nodes:
apt install openvswitch-switch
The network configuration will look the same for each node, except for the IP addresses.
/etc/network/interface
auto lo iface lo inet loopback iface ens20 inet manual auto ens21 iface ens21 inet static address 10.14.14.51 netmask 255.255.255.0 auto vmbr0 iface vmbr0 inet static address 192.168.2.51 netmask 255.255.240.0 gateway 192.168.2.1 bridge_ports ens20 bridge_stp off bridge_fd 0 auto ens18 iface ens18 inet manual ovs_type OVSPort ovs_bridge vmbr1 ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged auto ens19 iface ens19 inet manual ovs_type OVSPort ovs_bridge vmbr1 ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged auto vmbr1 iface vmbr1 inet static address 10.15.15.50/24 ovs_type OVSBridge ovs_ports ens18 ens19 up ovs-vsctl set Bridge ${IFACE} rstp_enable=true other_config:rstp-priority=32768 other_config:rstp-forward-delay=4 other_config:rstp-max-age=6 post-up sleep 10
If needed, you can set the MTU with `ovs_mtu 9000` in the `vmbr1`, `eno18` and `eno19` configs. You can check the RSTP status with
ovs-appctl rstp/show
Routed Setup
Corresponding to the above described setup example, the 3 nodes have to be configured as described in the following sections. Note that multicast is not possible with this method.
Node1
/etc/network/interface
auto lo iface lo inet loopback iface ens20 inet manual auto ens21 iface ens21 inet static address 10.14.14.50 netmask 255.255.255.0 # Connected to Node2 (.51) auto ens18 iface ens18 inet static address 10.15.15.50 netmask 255.255.255.0 up ip route add 10.15.15.51/32 dev ens18 down ip route del 10.15.15.51/32 # Connected to Node3 (.52) auto ens19 iface ens19 inet static address 10.15.15.50 netmask 255.255.255.0 up ip route add 10.15.15.52/32 dev ens19 down ip route del 10.15.15.52/32 auto vmbr0 iface vmbr0 inet static address 192.168.2.50 netmask 255.255.240.0 gateway 192.168.2.1 bridge_ports ens20 bridge_stp off bridge_fd 0
route
root@pve-2-50:~# ip route default via 192.168.2.1 dev vmbr0 onlink 10.14.14.0/24 dev ens21 proto kernel scope link src 10.14.14.50 10.15.15.0/24 dev ens18 proto kernel scope link src 10.15.15.50 10.15.15.0/24 dev ens19 proto kernel scope link src 10.15.15.50 10.15.15.52 dev ens19 scope link 10.15.15.51 dev ens18 scope link 192.168.0.0/20 dev vmbr0 proto kernel scope link src 192.168.2.50
Node2
/etc/network/interface
auto lo iface lo inet loopback iface ens20 inet manual auto ens21 iface ens21 inet static address 10.14.14.51 netmask 255.255.255.0 # Connected to Node3 (.52) auto ens18 iface ens18 inet static address 10.15.15.51 netmask 255.255.255.0 up ip route add 10.15.15.52/32 dev ens18 down ip route del 10.15.15.52/32 # Connected to Node1 (.50) auto ens19 iface ens19 inet static address 10.15.15.51 netmask 255.255.255.0 up ip route add 10.15.15.50/32 dev ens19 down ip route del 10.15.15.50/32 auto vmbr0 iface vmbr0 inet static address 192.168.2.51 netmask 255.255.240.0 gateway 192.168.2.1 bridge_ports ens20 bridge_stp off bridge_fd 0
route
root@pve-2-51:/# ip route default via 192.168.2.1 dev vmbr0 onlink 10.14.14.0/24 dev ens21 proto kernel scope link src 10.14.14.51 10.15.15.0/24 dev ens18 proto kernel scope link src 10.15.15.51 10.15.15.0/24 dev ens19 proto kernel scope link src 10.15.15.51 10.15.15.52 dev ens18 scope link 10.15.15.50 dev ens19 scope link 192.168.0.0/20 dev vmbr0 proto kernel scope link src 192.168.2.51
Node3
/etc/network/interface
auto lo iface lo inet loopback iface ens20 inet manual auto ens21 iface ens21 inet static address 10.14.14.52 netmask 255.255.255.0 # Connected to Node1 (.50) auto ens18 iface ens18 inet static address 10.15.15.52 netmask 255.255.255.0 up ip route add 10.15.15.50/32 dev ens18 down ip route del 10.15.15.50/32 # Connected to Node2 (.51) auto ens19 iface ens19 inet static address 10.15.15.52 netmask 255.255.255.0 up ip route add 10.15.15.51/32 dev ens19 down ip route del 10.15.15.51/32 auto vmbr0 iface vmbr0 inet static address 192.168.2.52 netmask 255.255.240.0 gateway 192.168.2.1 bridge_ports ens20 bridge_stp off bridge_fd 0
route
root@pve-2-52:~# ip route default via 192.168.2.1 dev vmbr0 onlink 10.14.14.0/24 dev ens21 proto kernel scope link src 10.14.14.52 10.15.15.0/24 dev ens18 proto kernel scope link src 10.15.15.52 10.15.15.0/24 dev ens19 proto kernel scope link src 10.15.15.52 10.15.15.51 dev ens19 scope link 10.15.15.50 dev ens18 scope link 192.168.0.0/20 dev vmbr0 proto kernel scope link src 192.168.2.52
Broadcast setup
Create a "broadcast" bond with the given interfaces on every node. This can be done over the GUI or on the command-line.
GUI
On the GUI go to the node level -> System -> Network. Then click on "Create" and select "Linux Bond". In the Wizard make your configuration without a gateway and set mode to "broadcast".
Reboot the node to activate the new network settings.
Command-Line
Add the following lines to '/etc/network/interfaces'.
auto bond<No> iface bond<No> inet static address <IP> netmask <Netmask> slaves <Nic1> <Nic2> bond_miimon 100 bond_mode broadcast #Full Mesh
Then start the bond
ifup bond<No>
In Node1 of the above described setup example /etc/network/interface will look like as follows:
iface lo inet loopback iface ens20 inet manual auto ens21 iface ens21 inet static address 10.14.14.50 netmask 255.255.255.0 iface ens18 inet manual iface ens19 inet manual auto bond0 iface bond0 inet static address 10.15.15.50 netmask 255.255.255.0 slaves ens18 ens19 bond_miimon 100 bond_mode broadcast auto vmbr0 iface vmbr0 inet static address 192.168.2.50 netmask 255.255.240.0 gateway 192.168.2.1 bridge_ports ens20 bridge_stp off bridge_fd 0