Difference between revisions of "DRBD"

From Proxmox VE
Jump to navigation Jump to search
(how to recover from split brain)
Line 1: Line 1:
=Introduction=
+
= Introduction =
DRBD® refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network based raid-1. For detailed information please visit [http://www.linbit.com Linbit].
 
  
Main features of the integration in Proxmox VE:
+
DRBD® refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network based raid-1. For detailed information please visit [http://www.linbit.com Linbit].
*LVM on top of DRBD (Primary/Primary)
+
 
*All VM disks (LVM volumes on the DRBD device) are replicated in real time on both Proxmox VE nodes via the network
+
Main features of the integration in Proxmox VE:  
 +
 
 +
*LVM on top of DRBD (Primary/Primary)  
 +
*All VM disks (LVM volumes on the DRBD device) are replicated in real time on both Proxmox VE nodes via the network  
 
*Ability to live migrate running machines without downtime in a few seconds WITHOUT the need of SAN (iSCSI, FC, NFS) as the data is already on both nodes.
 
*Ability to live migrate running machines without downtime in a few seconds WITHOUT the need of SAN (iSCSI, FC, NFS) as the data is already on both nodes.
  
Note: Currently only KVM guests can use the new [[Storage Model]].
+
Note: Currently only KVM guests can use the new [[Storage Model]].
 +
 
 +
It is highly recommended that you configure two DRBD volumes as explained [[DRBD#Recovery_from_communication_failure|here.]]
 +
 
 +
For this HowTo I used a Intel Entry Server main board with 2 Intel GBit Network cards and two simple SATA hard drives. On /dev/sda I installed Proxmox VE as usual, /dev/sdb/ will be used for DRBD. Eth0 is used for the default VMBR0 and eth1 will be used for DRBD.  
  
It is highly recommended that you configure two DRBD volumes as explained [[DRBD#Recovery from communication failure|here.]]
+
= System requirements  =
  
For this HowTo I used a Intel Entry Server main board with 2 Intel GBit Network cards and two simple SATA hard drives. On /dev/sda I installed Proxmox VE as usual, /dev/sdb/ will be used for DRBD. Eth0 is used for the default VMBR0 and eth1 will be used for DRBD.
+
You need 2 identical Proxmox VE servers with the following extra hardware:
  
=System requirements=
+
*Free network card (connected with a direct crossover cable)  
You need 2 identical Proxmox VE servers with the following extra hardware:
+
*Second raid volume (e.g. /dev/sdb)  
*Free network card (connected with a direct crossover cable)
+
*Use a hardware raid controller with BBU to eliminate performance issues concerning internal metadata (see [http://fghaas.wordpress.com/2009/08/20/internal-metadata-and-why-we-recommend-it/ Florian´s blog]).
*Second raid volume (e.g. /dev/sdb)
 
*Use a hardware raid controller with BBU to eliminate performance issues concerning internal metadata (see [http://fghaas.wordpress.com/2009/08/20/internal-metadata-and-why-we-recommend-it/ Florian´s blog]).  
 
  
=Preparations=
+
= Preparations =
Make sure you run at least Proxmox VE 1.4 on both servers and create the well known standard [[Proxmox VE Cluster]].
+
 
 +
Make sure you run at least Proxmox VE 1.4 on both servers and create the well known standard [[Proxmox VE Cluster]].  
 +
 
 +
== Network  ==
  
==Network==
 
 
Configure eth1 on both nodes with a fixed private IP address via the web interface and reboot each server.  
 
Configure eth1 on both nodes with a fixed private IP address via the web interface and reboot each server.  
  
For better understanding, here is my /etc/network/interfaces file from my first node after the reboot:
+
For better understanding, here is my /etc/network/interfaces file from my first node after the reboot:  
 
 
 
<pre>cat /etc/network/interfaces
 
<pre>cat /etc/network/interfaces
 
# network interface settings
 
# network interface settings
Line 46: Line 51:
 
         bridge_ports eth0
 
         bridge_ports eth0
 
         bridge_stp off
 
         bridge_stp off
         bridge_fd 0</pre>
+
         bridge_fd 0</pre>  
 
+
And from the second node:  
And from the second node:
 
 
<pre># network interface settings
 
<pre># network interface settings
 
auto lo
 
auto lo
Line 67: Line 71:
 
         bridge_ports eth0
 
         bridge_ports eth0
 
         bridge_stp off
 
         bridge_stp off
         bridge_fd 0</pre>
+
         bridge_fd 0</pre>  
 +
== Disk for DRBD  ==
  
==Disk for DRBD==
+
I will use /dev/sdb1 for DRBD. Therefore I need to create this single big partition on /dev/sdb - make sure that this is 100% identical on both nodes.  
I will use /dev/sdb1 for DRBD. Therefore I need to create this single big partition on /dev/sdb - make sure that this is 100% identical on both nodes.
 
 
 
Just run fdisk /dev/sdb and create a primary partition (dev/sdb1):
 
  
 +
Just run fdisk /dev/sdb and create a primary partition (dev/sdb1):
 
<pre>proxmox-105:~# fdisk /dev/sdb
 
<pre>proxmox-105:~# fdisk /dev/sdb
 
                                                                                                                                                              
 
                                                                                                                                                              
Line 116: Line 119:
 
Syncing disks.
 
Syncing disks.
 
proxmox-105:~#
 
proxmox-105:~#
</pre>
+
</pre>  
 +
= DRBD configuration  =
  
=DRBD configuration=
+
== Software installation ==
==Software installation==
 
Install the DRBD user tools on both nodes (I got DRBD 8.3.2 in the Kernel and also 8.3.2 drbd8-utils):
 
<pre>apt-get install drbd8-utils</pre>
 
  
'''Note:'''
+
Install the DRBD user tools on both nodes (I got DRBD 8.3.2 in the Kernel and also 8.3.2 drbd8-utils):
It is advisable to have a DRBD userland version that matches the kernel module version.
+
<pre>apt-get install drbd8-utils</pre>
See this forum post for a short build howto:
+
'''Note:''' It is advisable to have a DRBD userland version that matches the kernel module version. See this forum post for a short build howto:  
  
http://forum.proxmox.com/threads/7059-Update-DRBD-userland-to-8.3.10-to-match-kernel-in-1.9?p=40033#poststop
+
http://forum.proxmox.com/threads/7059-Update-DRBD-userland-to-8.3.10-to-match-kernel-in-1.9?p=40033#poststop  
  
==Prepare drbd.conf file==  
+
== Prepare drbd.conf file ==
Edit the drbd config file, please note that this file needs to be identical on both nodes, here is an example:
+
 
 +
Edit the drbd config file, please note that this file needs to be identical on both nodes, here is an example:  
 
<pre>proxmox-105:~# cat /etc/drbd.conf
 
<pre>proxmox-105:~# cat /etc/drbd.conf
 
global { usage-count no; }
 
global { usage-count no; }
Line 162: Line 164:
 
         }
 
         }
 
}
 
}
</pre>
+
</pre>  
 +
== Start DRBD  ==
  
==Start DRBD==
+
On both servers, start DRBD:  
On both servers, start DRBD:
+
<pre>/etc/init.d/drbd start</pre>  
 
+
now create the device metadata, also on both nodes:  
<pre>/etc/init.d/drbd start</pre>
+
<pre>drbdadm create-md r0</pre>  
 
+
Bring the device up, also on both nodes:  
now create the device metadata, also on both nodes:
+
<pre>drbdadm up r0</pre>  
 
+
Now you can take on current status of the DRBD, should look like this on both nodes:  
<pre>drbdadm create-md r0</pre>
 
 
 
Bring the device up, also on both nodes:
 
<pre>drbdadm up r0</pre>
 
 
 
Now you can take on current status of the DRBD, should look like this on both nodes:
 
 
<pre>proxmox-105:~# cat /proc/drbd
 
<pre>proxmox-105:~# cat /proc/drbd
 
version: 8.3.2 (api:88/proto:86-90)
 
version: 8.3.2 (api:88/proto:86-90)
Line 182: Line 179:
 
  0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
 
  0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
 
     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2096348
 
     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2096348
</pre>
+
</pre>  
 
+
Now DRBD has successfully allocated resources and is ready for operation. Now start the initial synchronization (only on one node!!!):  
Now DRBD has successfully allocated resources and is ready for operation. Now start the initial synchronization (only on one node!!!):
+
<pre>drbdadm -- --overwrite-data-of-peer primary r0</pre>  
 
+
Wait till the initial sync is finished (depending on the size and speed this process can take some time):  
<pre>drbdadm -- --overwrite-data-of-peer primary r0</pre>
+
<pre>proxmox-120:~# watch cat /proc/drbd</pre>  
 
+
Finally check if your DRBD is starting in Primary/Primary mode. In order to do this, just stop DRBD service on both nodes:  
Wait till the initial sync is finished (depending on the size and speed this process can take some time):
+
<pre>/etc/init.d/drbd stop</pre>  
<pre>proxmox-120:~# watch cat /proc/drbd</pre>
+
And start again on both nodes:  
 
+
<pre>/etc/init.d/drbd start</pre>  
Finally check if your DRBD is starting in Primary/Primary mode. In order to do this, just stop DRBD service on both nodes:
+
Now DRBD should be in the Primary/Primary mode:  
 
 
<pre>/etc/init.d/drbd stop</pre>
 
 
 
And start again on both nodes:
 
<pre>/etc/init.d/drbd start</pre>
 
 
 
Now DRBD should be in the Primary/Primary mode:
 
 
<pre>proxmox-105:~# cat /proc/drbd
 
<pre>proxmox-105:~# cat /proc/drbd
 
version: 8.3.2 (api:88/proto:86-90)
 
version: 8.3.2 (api:88/proto:86-90)
Line 204: Line 194:
 
  0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
 
  0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
 
     ns:0 nr:0 dw:0 dr:268 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 
     ns:0 nr:0 dw:0 dr:268 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
</pre>
+
</pre>  
 +
== LVM configuration  ==
  
==LVM configuration==
 
 
Now we got a running /dev/drbd0 in Primary/Primary mode and we need to add LVM on top of this device.  
 
Now we got a running /dev/drbd0 in Primary/Primary mode and we need to add LVM on top of this device.  
===Adapt your lvm.conf===
 
You need to change the filter section in the LVM configuration, just edit the corresponding line in lvm.conf - do this on both nodes:
 
  
 +
=== Adapt your lvm.conf  ===
 +
 +
You need to change the filter section in the LVM configuration, just edit the corresponding line in lvm.conf - do this on both nodes:
 
<pre>nano /etc/lvm/lvm.conf
 
<pre>nano /etc/lvm/lvm.conf
  
 
# By default we accept every block device:
 
# By default we accept every block device:
 
filter = [ "r|/dev/sdb1|", "r|/dev/disk/|", "r|/dev/block/|", "a/.*/" ]
 
filter = [ "r|/dev/sdb1|", "r|/dev/disk/|", "r|/dev/block/|", "a/.*/" ]
</pre>
+
</pre>  
 +
Note: if you device is not /dev/sdb change this according to our system
  
Note: if you device is not /dev/sdb change this according to our system
+
=== Create the physical volume for LVM ===
===Create the physical volume for LVM===
+
 
On one node, create the physical volume:
+
On one node, create the physical volume:  
 
<pre>proxmox-105:~# pvcreate /dev/drbd0
 
<pre>proxmox-105:~# pvcreate /dev/drbd0
 
   Physical volume "/dev/drbd0" successfully created
 
   Physical volume "/dev/drbd0" successfully created
proxmox-105:~#</pre>
+
proxmox-105:~#</pre>  
 
+
Check your physical volumes, should look like this:  
Check your physical volumes, should look like this:
 
 
<pre>proxmox-105:~# pvscan
 
<pre>proxmox-105:~# pvscan
 
   PV /dev/sda2    VG pve            lvm2 [465.26 GB / 4.00 GB free]
 
   PV /dev/sda2    VG pve            lvm2 [465.26 GB / 4.00 GB free]
 
   PV /dev/drbd0                      lvm2 [149.04 GB]
 
   PV /dev/drbd0                      lvm2 [149.04 GB]
 
   Total: 2 [613.30 GB] / in use: 1 [613.30 GB] / in no VG: 1 [4.00 GB]
 
   Total: 2 [613.30 GB] / in use: 1 [613.30 GB] / in no VG: 1 [4.00 GB]
proxmox-105:~#</pre>
+
proxmox-105:~#</pre>  
 +
=== Create the volume group  ===
  
===Create the volume group===
+
On one node, create the volume group:  
On one node, create the volume group:
 
 
<pre>proxmox-105:~# vgcreate drbdvg /dev/drbd0
 
<pre>proxmox-105:~# vgcreate drbdvg /dev/drbd0
 
   Volume group "drbdvg" successfully created
 
   Volume group "drbdvg" successfully created
 
proxmox-105:~#
 
proxmox-105:~#
</pre>
+
</pre>  
 
+
Check your physical volumes again, should look like this:  
Check your physical volumes again, should look like this:
 
 
<pre>proxmox-105:~# pvscan
 
<pre>proxmox-105:~# pvscan
 
   PV /dev/sda2    VG pve            lvm2 [465.26 GB / 4.00 GB free]
 
   PV /dev/sda2    VG pve            lvm2 [465.26 GB / 4.00 GB free]
 
   PV /dev/drbd0  VG drbdvg          lvm2 [149.04 GB / 149.04 GB free]
 
   PV /dev/drbd0  VG drbdvg          lvm2 [149.04 GB / 149.04 GB free]
 
...
 
...
proxmox-105:~#</pre>
+
proxmox-105:~#</pre>  
 +
=== Add the LVM group to the Proxmox VE storage list via web interface  ===
 +
 
 +
Go to the Proxmox VE web interface to 'Configuration/Storage' and click on the red arrow and select 'Add LVM group'.
  
===Add the LVM group to the Proxmox VE storage list via web interface===
+
You will see the previously created volume group (drbdvg), select this and give a Storage Name (cannot be changed later, so think twice), and enable the sharing by click the 'shared' box.  
Go to the Proxmox VE web interface to 'Configuration/Storage' and click on the red arrow and select 'Add LVM group'.
 
  
You will see the previously created volume group (drbdvg), select this and give a Storage Name (cannot be changed later, so think twice), and enable the sharing by click the 'shared' box.
+
= Create the first VM on DRBD for testing and live migration  =
  
=Create the first VM on DRBD for testing and live migration=
+
In order to check end to end functionality, create a new KVM VM - and obviously store the VM disk on the previously created storage. Now, if you install the operating system - in my case a Ubuntu 9.04 all the data is already replicated automatically on both nodes - just finish the installation and reboot into your VM.  
In order to check end to end functionality, create a new KVM VM - and obviously store the VM disk on the previously created storage. Now, if you install the operating system - in my case a Ubuntu 9.04 all the data is already replicated automatically on both nodes - just finish the installation and reboot into your VM.
 
  
Try to live migrate the VM - as all data is available on both nodes it just take a few seconds. The overall process can take a bit longer if the VM is under load and if there is a lot of RAM involved. But in any case, the downtime is minimal and you will see no interruption at all - like you use iSCSI or NFS for storage back-end.
+
Try to live migrate the VM - as all data is available on both nodes it just take a few seconds. The overall process can take a bit longer if the VM is under load and if there is a lot of RAM involved. But in any case, the downtime is minimal and you will see no interruption at all - like you use iSCSI or NFS for storage back-end.  
  
=DRBD support=
+
= DRBD support =
DRBD can be configured in many ways and there is a lot of space for optimizations and performance tuning. If you run DRBD in a production environment we highly recommend the [http://www.linbit.com/en/products-services/drbd-support/drbd-support/ 24/7 support packages] from the DRBD developers. The company behind DRBD is [http://www.linbit.com Linbit], they just live a few miles away from the Proxmox headquarter, so this makes it easy to cooperate.
+
 
 +
DRBD can be configured in many ways and there is a lot of space for optimizations and performance tuning. If you run DRBD in a production environment we highly recommend the [http://www.linbit.com/en/products-services/drbd-support/drbd-support/ 24/7 support packages] from the DRBD developers. The company behind DRBD is [http://www.linbit.com Linbit], they just live a few miles away from the Proxmox headquarter, so this makes it easy to cooperate.  
 +
 
 +
= Recovery from communication failure  =
  
=Recovery from communication failure=
 
 
When the communication channel between the two DRBD nodes fails (commonly known as a "split-brain" situation), both nodes continue independently. When the communication channel is re-established the two nodes will no longer have the same data in their DRBD volumes, and re-synchronisation is required.  
 
When the communication channel between the two DRBD nodes fails (commonly known as a "split-brain" situation), both nodes continue independently. When the communication channel is re-established the two nodes will no longer have the same data in their DRBD volumes, and re-synchronisation is required.  
  
Line 267: Line 260:
 
It is therefore advisable to always run "mission critical" VMs on one of the nodes, and use its peer for "development" VMs. Alternatively it is possible to create two DRBD volumes and use one for VMs that normally run on node A, and the other for VMs that are normally on B. Thus little or no forced migration is needed after a split-brain condition.  
 
It is therefore advisable to always run "mission critical" VMs on one of the nodes, and use its peer for "development" VMs. Alternatively it is possible to create two DRBD volumes and use one for VMs that normally run on node A, and the other for VMs that are normally on B. Thus little or no forced migration is needed after a split-brain condition.  
  
For specific recovery steps, see [http://www.drbd.org/users-guide-8.3/s-resolve-split-brain.html DRBD's documentation on resolving split-brain conditions].
+
For specific recovery steps, see [http://www.drbd.org/users-guide-8.3/s-resolve-split-brain.html DRBD's documentation on resolving split-brain conditions].
 +
 
 +
== Recovery from split brain when using two DRBD volumes  ==
 +
 
 +
Our example configuration is two DRBD volumes named as follows:<br>Resoruce r0 /dev/drbd0 <br>Resource r1 /dev/drbd1
 +
 
 +
Server names are NODEA and NODEB
 +
 
 +
NODEA runs the vms stored on r0<br>NODEB runs the vms stored on r1<br>
 +
 
 +
===Fix resource r1===
 +
On NODEA you run these commands:
 +
<pre>drbdadm secondary r1
 +
drbdadm -- --discard-my-data connect r1
 +
</pre>
 +
On NODEB you these commands:
 +
<pre>drbdadm connect r1
 +
</pre>
 +
Now resource r1 will sync changes from NODEB to NODEA
 +
 
 +
You can watch the progress using this command:
 +
<pre>watch cat /proc/drbd
 +
</pre>
 +
<br>
 +
 
 +
===Fix resource r0===
 +
On NODEB you run these commands:
 +
<pre>drbdadm secondary r0
 +
drbdadm -- --discard-my-data connect r0
 +
</pre>
 +
On NODEA you these commands:
 +
<pre>drbdadm connect r0
 +
</pre>
 +
Now resource r0 will sync changes from NODEA to NODEB
 +
 
 +
You can watch the progress using this command:
 +
<pre>watch cat /proc/drbd
 +
</pre>
 +
<br> In most cases the recovery will take a few minutes since DRBD&nbsp;only needs to copy the data that has changed since the split-brain occured.  
  
=Final considerations=
+
<br>
Now you have a fully redundant storage for your VM´s with just no need of expensive SAN equipment, configured in about 30 to 60 minutes - starting from bare-metal. If you want to achieve a similar setup with traditional virtualization solutions with SAN you will need at least 4 servers boxes and the storage network.
 
  
'''Traditional setup with SAN (eg. iSCSI, NFS, FC):'''
+
= Final considerations  =
*Two servers for a redundant SAN
+
 
*Two servers for redundant virtualization hosts
+
Now you have a fully redundant storage for your VM´s with just no need of expensive SAN equipment, configured in about 30 to 60 minutes - starting from bare-metal. If you want to achieve a similar setup with traditional virtualization solutions with SAN you will need at least 4 servers boxes and the storage network.
*Extra storage network, VLAN´s, FC switches, etc
+
 
 +
'''Traditional setup with SAN (eg. iSCSI, NFS, FC):'''  
 +
 
 +
*Two servers for a redundant SAN  
 +
*Two servers for redundant virtualization hosts  
 +
*Extra storage network, VLAN´s, FC switches, etc  
 
*Complex setup
 
*Complex setup
  
'''Proxmox VE with DRBD:'''
+
'''Proxmox VE with DRBD:'''  
*Only two servers with raid controllers and a bunch of hard drives (configure two raid volumes, one for Proxmox VE and one for DRBD device)  
+
 
 +
*Only two servers with raid controllers and a bunch of hard drives (configure two raid volumes, one for Proxmox VE and one for DRBD device)
  
 
Beginning with Kernel 2.6.33, DRBD is integrated into the mainline Kernel (also most 2.6.32 Kernels got it backported). This makes it much easier to maintain and upgrade the systems and can be considered as the "highest quality certificate" for Kernel module developers.  
 
Beginning with Kernel 2.6.33, DRBD is integrated into the mainline Kernel (also most 2.6.32 Kernels got it backported). This makes it much easier to maintain and upgrade the systems and can be considered as the "highest quality certificate" for Kernel module developers.  
  
Additionally, DRBD will be the base for high availability (HA) - please help by testing, reporting bugs and issues in our forum or mailing list.
+
Additionally, DRBD will be the base for high availability (HA) - please help by testing, reporting bugs and issues in our forum or mailing list.  
  
[[Category: HOWTO]] [[Category: Technology]]
+
[[Category:HOWTO]] [[Category:Technology]]

Revision as of 21:36, 29 January 2012

Introduction

DRBD® refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network based raid-1. For detailed information please visit Linbit.

Main features of the integration in Proxmox VE:

  • LVM on top of DRBD (Primary/Primary)
  • All VM disks (LVM volumes on the DRBD device) are replicated in real time on both Proxmox VE nodes via the network
  • Ability to live migrate running machines without downtime in a few seconds WITHOUT the need of SAN (iSCSI, FC, NFS) as the data is already on both nodes.

Note: Currently only KVM guests can use the new Storage Model.

It is highly recommended that you configure two DRBD volumes as explained here.

For this HowTo I used a Intel Entry Server main board with 2 Intel GBit Network cards and two simple SATA hard drives. On /dev/sda I installed Proxmox VE as usual, /dev/sdb/ will be used for DRBD. Eth0 is used for the default VMBR0 and eth1 will be used for DRBD.

System requirements

You need 2 identical Proxmox VE servers with the following extra hardware:

  • Free network card (connected with a direct crossover cable)
  • Second raid volume (e.g. /dev/sdb)
  • Use a hardware raid controller with BBU to eliminate performance issues concerning internal metadata (see Florian´s blog).

Preparations

Make sure you run at least Proxmox VE 1.4 on both servers and create the well known standard Proxmox VE Cluster.

Network

Configure eth1 on both nodes with a fixed private IP address via the web interface and reboot each server.

For better understanding, here is my /etc/network/interfaces file from my first node after the reboot:

cat /etc/network/interfaces
# network interface settings
auto lo
iface lo inet loopback

iface eth0 inet manual

auto eth1
iface eth1 inet static
        address  10.0.7.105
        netmask  255.255.240.0

auto vmbr0
iface vmbr0 inet static
        address  192.168.7.105
        netmask  255.255.240.0
        gateway  192.168.2.1
        bridge_ports eth0
        bridge_stp off
        bridge_fd 0

And from the second node:

# network interface settings
auto lo
iface lo inet loopback

iface eth0 inet manual

auto eth1
iface eth1 inet static
        address  10.0.7.106
        netmask  255.255.240.0

auto vmbr0
iface vmbr0 inet static
        address  192.168.7.106
        netmask  255.255.240.0
        gateway  192.168.2.1
        bridge_ports eth0
        bridge_stp off
        bridge_fd 0

Disk for DRBD

I will use /dev/sdb1 for DRBD. Therefore I need to create this single big partition on /dev/sdb - make sure that this is 100% identical on both nodes.

Just run fdisk /dev/sdb and create a primary partition (dev/sdb1):

proxmox-105:~# fdisk /dev/sdb
                                                                                                                                                             
The number of cylinders for this disk is set to 19457.                                                                                                          
There is nothing wrong with that, but this is larger than 1024,                                                                                                 
and could in certain setups cause problems with:                                                                                                                
1) software that runs at boot time (e.g., old versions of LILO)                                                                                                 
2) booting and partitioning software from other OSs                                                                                                             
   (e.g., DOS FDISK, OS/2 FDISK)                                                                                                                                
                                                                                                                                                                
Command (m for help): n                                                                                                                                         
Command action                                                                                                                                                  
   e   extended                                                                                                                                                 
   p   primary partition (1-4)                                                                                                                                  
p                                                                                                                                                               
Partition number (1-4): 1                                                                                                                                       
First cylinder (1-19457, default 1):                                                                                                                            
Using default value 1                                                                                                                                           
Last cylinder or +size or +sizeM or +sizeK (1-19457, default 19457):                                                                                            
Using default value 19457                                                                                                                                       
                                                                                                                                                                
Command (m for help): t                                                                                                                                         
Selected partition 1                                                                                                                                            
Hex code (type L to list codes): 8e                                                                                                                             
Changed system type of partition 1 to 8e (Linux LVM)                                                                                                            
                                                                                                                                                                
Command (m for help): p                                                                                                                                         
                                                                                                                                                                
Disk /dev/sdb: 160.0 GB, 160041885696 bytes                                                                                                                     
255 heads, 63 sectors/track, 19457 cylinders                                                                                                                    
Units = cylinders of 16065 * 512 = 8225280 bytes                                                                                                                
Disk identifier: 0x49e2fd2f                                                                                                                                     
                                                                                                                                                                
   Device Boot      Start         End      Blocks   Id  System                                                                                                  
/dev/sdb1               1       19457   156288321   8e  Linux LVM                                                                                               
                                                                                                                                                                
Command (m for help): w                                                                                                                                         
The partition table has been altered!                                                                                                                           
                                                                                                                                                                
Calling ioctl() to re-read partition table.                                                                                                                     
                                                                                          
Syncing disks.
proxmox-105:~#

DRBD configuration

Software installation

Install the DRBD user tools on both nodes (I got DRBD 8.3.2 in the Kernel and also 8.3.2 drbd8-utils):

apt-get install drbd8-utils

Note: It is advisable to have a DRBD userland version that matches the kernel module version. See this forum post for a short build howto:

http://forum.proxmox.com/threads/7059-Update-DRBD-userland-to-8.3.10-to-match-kernel-in-1.9?p=40033#poststop

Prepare drbd.conf file

Edit the drbd config file, please note that this file needs to be identical on both nodes, here is an example:

proxmox-105:~# cat /etc/drbd.conf
global { usage-count no; }
common { syncer { rate 30M; } }
resource r0 {
        protocol C;
        startup {
                wfc-timeout  15;     # wfc-timeout can be dangerous (http://forum.proxmox.com/threads/3465-Is-it-safe-to-use-wfc-timeout-in-DRBD-configuration)
                degr-wfc-timeout 60;
                become-primary-on both;
        }
        net {
                cram-hmac-alg sha1;
                shared-secret "my-secret";
                allow-two-primaries;
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
        }
        on proxmox-105 {
                device /dev/drbd0;
                disk /dev/sdb1;
                address 10.0.7.105:7788;
                meta-disk internal;
        }
        on proxmox-106 {
                device /dev/drbd0;
                disk /dev/sdb1;
                address 10.0.7.106:7788;
                meta-disk internal;
        }
}

Start DRBD

On both servers, start DRBD:

/etc/init.d/drbd start

now create the device metadata, also on both nodes:

drbdadm create-md r0

Bring the device up, also on both nodes:

drbdadm up r0

Now you can take on current status of the DRBD, should look like this on both nodes:

proxmox-105:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@oahu, 2009-09-10 15:18:39
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2096348

Now DRBD has successfully allocated resources and is ready for operation. Now start the initial synchronization (only on one node!!!):

drbdadm -- --overwrite-data-of-peer primary r0

Wait till the initial sync is finished (depending on the size and speed this process can take some time):

proxmox-120:~# watch cat /proc/drbd

Finally check if your DRBD is starting in Primary/Primary mode. In order to do this, just stop DRBD service on both nodes:

/etc/init.d/drbd stop

And start again on both nodes:

/etc/init.d/drbd start

Now DRBD should be in the Primary/Primary mode:

proxmox-105:~# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@oahu, 2009-09-10 15:18:39
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
    ns:0 nr:0 dw:0 dr:268 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

LVM configuration

Now we got a running /dev/drbd0 in Primary/Primary mode and we need to add LVM on top of this device.

Adapt your lvm.conf

You need to change the filter section in the LVM configuration, just edit the corresponding line in lvm.conf - do this on both nodes:

nano /etc/lvm/lvm.conf

# By default we accept every block device:
filter = [ "r|/dev/sdb1|", "r|/dev/disk/|", "r|/dev/block/|", "a/.*/" ]

Note: if you device is not /dev/sdb change this according to our system

Create the physical volume for LVM

On one node, create the physical volume:

proxmox-105:~# pvcreate /dev/drbd0
  Physical volume "/dev/drbd0" successfully created
proxmox-105:~#

Check your physical volumes, should look like this:

proxmox-105:~# pvscan
  PV /dev/sda2    VG pve             lvm2 [465.26 GB / 4.00 GB free]
  PV /dev/drbd0                      lvm2 [149.04 GB]
  Total: 2 [613.30 GB] / in use: 1 [613.30 GB] / in no VG: 1 [4.00 GB]
proxmox-105:~#

Create the volume group

On one node, create the volume group:

proxmox-105:~# vgcreate drbdvg /dev/drbd0
  Volume group "drbdvg" successfully created
proxmox-105:~#

Check your physical volumes again, should look like this:

proxmox-105:~# pvscan
  PV /dev/sda2    VG pve             lvm2 [465.26 GB / 4.00 GB free]
  PV /dev/drbd0   VG drbdvg          lvm2 [149.04 GB / 149.04 GB free]
...
proxmox-105:~#

Add the LVM group to the Proxmox VE storage list via web interface

Go to the Proxmox VE web interface to 'Configuration/Storage' and click on the red arrow and select 'Add LVM group'.

You will see the previously created volume group (drbdvg), select this and give a Storage Name (cannot be changed later, so think twice), and enable the sharing by click the 'shared' box.

Create the first VM on DRBD for testing and live migration

In order to check end to end functionality, create a new KVM VM - and obviously store the VM disk on the previously created storage. Now, if you install the operating system - in my case a Ubuntu 9.04 all the data is already replicated automatically on both nodes - just finish the installation and reboot into your VM.

Try to live migrate the VM - as all data is available on both nodes it just take a few seconds. The overall process can take a bit longer if the VM is under load and if there is a lot of RAM involved. But in any case, the downtime is minimal and you will see no interruption at all - like you use iSCSI or NFS for storage back-end.

DRBD support

DRBD can be configured in many ways and there is a lot of space for optimizations and performance tuning. If you run DRBD in a production environment we highly recommend the 24/7 support packages from the DRBD developers. The company behind DRBD is Linbit, they just live a few miles away from the Proxmox headquarter, so this makes it easy to cooperate.

Recovery from communication failure

When the communication channel between the two DRBD nodes fails (commonly known as a "split-brain" situation), both nodes continue independently. When the communication channel is re-established the two nodes will no longer have the same data in their DRBD volumes, and re-synchronisation is required.

In the dual-primary configuration used with Proxmox VE, both nodes may have changes to the same DRBD resource that have occurred since the split-brain condition. In this case it will be necessary to discard the data of one of the nodes. Normally one would choose to discard the data of the least-active node.

Virtual disk data on that node can be preserved by using the PVE web interface to migrate those virtual machines on to the node which will have its data preserved. This does mean that these virtual machines will need to be taken out of service while their complete virtual disk data is copied to the other node.

It is therefore advisable to always run "mission critical" VMs on one of the nodes, and use its peer for "development" VMs. Alternatively it is possible to create two DRBD volumes and use one for VMs that normally run on node A, and the other for VMs that are normally on B. Thus little or no forced migration is needed after a split-brain condition.

For specific recovery steps, see DRBD's documentation on resolving split-brain conditions.

Recovery from split brain when using two DRBD volumes

Our example configuration is two DRBD volumes named as follows:
Resoruce r0 /dev/drbd0
Resource r1 /dev/drbd1

Server names are NODEA and NODEB

NODEA runs the vms stored on r0
NODEB runs the vms stored on r1

Fix resource r1

On NODEA you run these commands:

drbdadm secondary r1
drbdadm -- --discard-my-data connect r1

On NODEB you these commands:

drbdadm connect r1

Now resource r1 will sync changes from NODEB to NODEA

You can watch the progress using this command:

watch cat /proc/drbd


Fix resource r0

On NODEB you run these commands:

drbdadm secondary r0
drbdadm -- --discard-my-data connect r0

On NODEA you these commands:

drbdadm connect r0

Now resource r0 will sync changes from NODEA to NODEB

You can watch the progress using this command:

watch cat /proc/drbd


In most cases the recovery will take a few minutes since DRBD only needs to copy the data that has changed since the split-brain occured.


Final considerations

Now you have a fully redundant storage for your VM´s with just no need of expensive SAN equipment, configured in about 30 to 60 minutes - starting from bare-metal. If you want to achieve a similar setup with traditional virtualization solutions with SAN you will need at least 4 servers boxes and the storage network.

Traditional setup with SAN (eg. iSCSI, NFS, FC):

  • Two servers for a redundant SAN
  • Two servers for redundant virtualization hosts
  • Extra storage network, VLAN´s, FC switches, etc
  • Complex setup

Proxmox VE with DRBD:

  • Only two servers with raid controllers and a bunch of hard drives (configure two raid volumes, one for Proxmox VE and one for DRBD device)

Beginning with Kernel 2.6.33, DRBD is integrated into the mainline Kernel (also most 2.6.32 Kernels got it backported). This makes it much easier to maintain and upgrade the systems and can be considered as the "highest quality certificate" for Kernel module developers.

Additionally, DRBD will be the base for high availability (HA) - please help by testing, reporting bugs and issues in our forum or mailing list.