DRBD
Introduction
DRBD® refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network based raid-1. For detailed information please visit Linbit.
Main features of the integration in Proxmox VE:
- LVM on top of DRBD (Primary/Primary)
- All VM disks (LVM volumes on the DRBD device) can be replicated in real time on both Proxmox VE nodes via the network. Alternatively, all VM storage can be replicated in nearly real time to many other backup servers.
- Ability to live migrate running machines without downtime in a few seconds WITHOUT the need of SAN (iSCSI, FC, NFS) as the data is already on both nodes.
Note: In Proxmox 1.x only KVM guests can use the new Storage Model. In Proxmox 2.0 with a little command line work you can use OpenVZ on DRBD too.
It is highly recommended that you configure two DRBD volumes as explained here. Recovery from split-brain is easier with two volumes than it is with one.
For the remainder of this HowTo I used an Intel Entry Server main board with 2 Intel GBit Network cards and two simple SATA hard drives. On /dev/sda I installed Proxmox VE as usual, /dev/sdb/ will be used for DRBD. Eth0 is used for the default VMBR0 and eth1 will be used for DRBD.
If you're prefer a HowTo based on Proxmox 2.x and which is more intended for live backup over high latency, low bandwidth links than HA, you may find the guide at http://www.nedproductions.biz/wiki/configuring-a-proxmox-ve-2.x-cluster-running-over-an-openvpn-intranet/configuring-a-proxmox-ve-2.x-cluster-running-over-an-openvpn-intranet-part-3 useful. In particular this guide assumes regular split brain conditions as live backup only occurs for part of a day (during the early hours of the morning).
System requirements
You need 2 identical Proxmox VE servers with the following extra hardware:
- Free network card (connected with a direct crossover cable)
- Second raid volume (e.g. /dev/sdb)
- Use a hardware raid controller with BBU to eliminate performance issues concerning internal metadata (see Florian´s blog).
Preparations
Make sure you run at least Proxmox VE 1.4 on both servers and create the well known standard Proxmox VE Cluster.
Network
Configure eth1 on both nodes with a fixed private IP address via the web interface and reboot each server.
For better understanding, here is my /etc/network/interfaces file from my first node after the reboot:
cat /etc/network/interfaces # network interface settings auto lo iface lo inet loopback iface eth0 inet manual auto eth1 iface eth1 inet static address 10.0.7.105 netmask 255.255.240.0 auto vmbr0 iface vmbr0 inet static address 192.168.7.105 netmask 255.255.240.0 gateway 192.168.2.1 bridge_ports eth0 bridge_stp off bridge_fd 0
And from the second node:
# network interface settings auto lo iface lo inet loopback iface eth0 inet manual auto eth1 iface eth1 inet static address 10.0.7.106 netmask 255.255.240.0 auto vmbr0 iface vmbr0 inet static address 192.168.7.106 netmask 255.255.240.0 gateway 192.168.2.1 bridge_ports eth0 bridge_stp off bridge_fd 0
Disk for DRBD
I will use /dev/sdb1 for DRBD. Therefore I need to create this single big partition on /dev/sdb - make sure that this is 100% identical on both nodes.
Just run fdisk /dev/sdb and create a primary partition (dev/sdb1):
proxmox-105:~# fdisk /dev/sdb The number of cylinders for this disk is set to 19457. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-19457, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-19457, default 19457): Using default value 19457 Command (m for help): t Selected partition 1 Hex code (type L to list codes): 8e Changed system type of partition 1 to 8e (Linux LVM) Command (m for help): p Disk /dev/sdb: 160.0 GB, 160041885696 bytes 255 heads, 63 sectors/track, 19457 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x49e2fd2f Device Boot Start End Blocks Id System /dev/sdb1 1 19457 156288321 8e Linux LVM Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. proxmox-105:~#
DRBD configuration
Software installation
Install the DRBD user tools on both nodes (I got DRBD 8.3.2 in the Kernel and also 8.3.2 drbd8-utils):
apt-get install drbd8-utils
Note: It is advisable to have a DRBD userland version that matches the kernel module version. See this forum post for a short build howto:
Prepare drbd configuration
Replace /etc/drbd.d/global_common.conf by the following content:
global { usage-count no; } common { syncer { rate 30M; verify-alg md5; } handlers { out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; } }
Create a resource configuration file /etc/drbd.d/r0.res
resource r0 { protocol C; startup { wfc-timeout 0; # non-zero wfc-timeout can be dangerous (http://forum.proxmox.com/threads/3465-Is-it-safe-to-use-wfc-timeout-in-DRBD-configuration) degr-wfc-timeout 60; become-primary-on both; } net { cram-hmac-alg sha1; shared-secret "my-secret"; allow-two-primaries; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; #data-integrity-alg crc32c; # has to be enabled only for test and disabled for production use (check man drbd.conf, section "NOTES ON DATA INTEGRITY") } on proxmox-105 { device /dev/drbd0; disk /dev/sdb1; address 10.0.7.105:7788; meta-disk internal; } on proxmox-106 { device /dev/drbd0; disk /dev/sdb1; address 10.0.7.106:7788; meta-disk internal; } disk { # no-disk-barrier and no-disk-flushes should be applied only to systems with non-volatile (battery backed) controller caches. # Follow links for more information: # http://www.drbd.org/users-guide-8.3/s-throughput-tuning.html#s-tune-disable-barriers # http://www.drbd.org/users-guide/s-throughput-tuning.html#s-tune-disable-barriers no-disk-barrier; no-disk-flushes; } }
(if you do not has a directory /etc/drbd.d, perhaps you are in front a old configuration. Please, see history of this page to configure /etc/drbd.conf)
Start DRBD
On both servers, start DRBD:
/etc/init.d/drbd start
now create the device metadata, also on both nodes:
drbdadm create-md r0
Bring the device up, also on both nodes:
drbdadm up r0
Now you can take on current status of the DRBD, should look like this on both nodes:
proxmox-105:~# cat /proc/drbd version: 8.3.2 (api:88/proto:86-90) GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@oahu, 2009-09-10 15:18:39 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r---- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:2096348
Now DRBD has successfully allocated resources and is ready for operation. Now start the initial synchronization (only on one node!!!):
drbdadm -- --overwrite-data-of-peer primary r0
Wait till the initial sync is finished (depending on the size and speed this process can take some time):
proxmox-120:~# watch cat /proc/drbd
Finally check if your DRBD is starting in Primary/Primary mode. In order to do this, just stop DRBD service on both nodes:
/etc/init.d/drbd stop
And start again on both nodes:
/etc/init.d/drbd start
Now DRBD should be in the Primary/Primary mode:
proxmox-105:~# cat /proc/drbd version: 8.3.2 (api:88/proto:86-90) GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by root@oahu, 2009-09-10 15:18:39 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r---- ns:0 nr:0 dw:0 dr:268 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
LVM configuration
Now we got a running /dev/drbd0 in Primary/Primary mode and we need to add LVM on top of this device.
Adapt your lvm.conf
You need to change the filter section in the LVM configuration, just edit the corresponding line in lvm.conf - do this on both nodes:
nano /etc/lvm/lvm.conf # By default we accept every block device: filter = [ "r|/dev/sdb1|", "r|/dev/disk/|", "r|/dev/block/|", "a/.*/" ]
Note: if you device is not /dev/sdb change this according to our system
Create the physical volume for LVM
On one node, create the physical volume:
proxmox-105:~# pvcreate /dev/drbd0 Physical volume "/dev/drbd0" successfully created proxmox-105:~#
Check your physical volumes, should look like this:
proxmox-105:~# pvscan PV /dev/sda2 VG pve lvm2 [465.26 GB / 4.00 GB free] PV /dev/drbd0 lvm2 [149.04 GB] Total: 2 [613.30 GB] / in use: 1 [613.30 GB] / in no VG: 1 [4.00 GB] proxmox-105:~#
Create the volume group
On one node, create the volume group:
proxmox-105:~# vgcreate drbdvg /dev/drbd0 Volume group "drbdvg" successfully created proxmox-105:~#
Check your physical volumes again, should look like this:
proxmox-105:~# pvscan PV /dev/sda2 VG pve lvm2 [465.26 GB / 4.00 GB free] PV /dev/drbd0 VG drbdvg lvm2 [149.04 GB / 149.04 GB free] ... proxmox-105:~#
Add the LVM group to the Proxmox VE storage list via web interface
Go to the Proxmox VE web interface to 'Configuration/Storage' and click on the red arrow and select 'Add LVM group'.
You will see the previously created volume group (drbdvg), select this and give a Storage Name (cannot be changed later, so think twice), and enable the sharing by click the 'shared' box.
Create the first VM on DRBD for testing and live migration
In order to check end to end functionality, create a new KVM VM - and obviously store the VM disk on the previously created storage. Now, if you install the operating system - in my case a Ubuntu 9.04 all the data is already replicated automatically on both nodes - just finish the installation and reboot into your VM.
Try to live migrate the VM - as all data is available on both nodes it just take a few seconds. The overall process can take a bit longer if the VM is under load and if there is a lot of RAM involved. But in any case, the downtime is minimal and you will see no interruption at all - like you use iSCSI or NFS for storage back-end.
- Warning: try to avoid enabling write cache for any virtual drives on top of DRBD as that can cause out of sync blocks. You need to use 'writethrough' or 'directsync' instead of default 'none'. Follow the link for more information: http://forum.proxmox.com/threads/18259-KVM-on-top-of-DRBD-and-out-of-sync-long-term-investigation-results?p=93126#post93126
DRBD support
DRBD can be configured in many ways and there is a lot of space for optimizations and performance tuning. If you run DRBD in a production environment we highly recommend the 24/7 support packages from the DRBD developers. The company behind DRBD is Linbit, they just live a few miles away from the Proxmox headquarter, so this makes it easy to cooperate.
Recovery from communication failure
When the communication channel between the two DRBD nodes fails (commonly known as a "split-brain" situation), both nodes continue independently. When the communication channel is re-established the two nodes will no longer have the same data in their DRBD volumes, and re-synchronisation is required.
In the dual-primary configuration used with Proxmox VE, both nodes may have changes to the same DRBD resource that have occurred since the split-brain condition. In this case it will be necessary to discard the data of one of the nodes. Normally one would choose to discard the data of the least-active node.
Virtual disk data on that node can be preserved by using the PVE web interface to migrate those virtual machines on to the node which will have its data preserved. This does mean that these virtual machines will need to be taken out of service while their complete virtual disk data is copied to the other node.
It is therefore advisable to always run "mission critical" VMs on one of the nodes, and use its peer for "development" VMs. Alternatively it is possible to create two DRBD volumes and use one for VMs that normally run on node A, and the other for VMs that are normally on B. Thus little or no forced migration is needed after a split-brain condition.
For specific recovery steps, see DRBD's documentation on resolving split-brain conditions.
Recovery from split brain when using two DRBD volumes
Our example configuration is two DRBD volumes named as follows:
Resoruce r0 /dev/drbd0
Resource r1 /dev/drbd1
Server names are NODEA and NODEB
NODEA runs the vms stored on r0
NODEB runs the vms stored on r1
Fix resource r1
On NODEA you run these commands:
- Make sure nobody uses the drbd device (associated with the resource r1) on THIS node before making it 'secondary'. As soon as you make the resource 'secondary' drbd device becomes unavailable. Usually it means you have to stop all of VMs that use that device.
drbdadm secondary r1 drbdadm -- --discard-my-data connect r1
On NODEB you these commands:
drbdadm connect r1
Now resource r1 will sync changes from NODEB to NODEA
You can watch the progress using this command:
watch cat /proc/drbd
After DRBD is synced run this command on NODEA:
drbdadm primary r1
Fix resource r0
On NODEB you run these commands:
- Make sure nobody uses the drbd device (associated with the resource r0) on THIS node before making it 'secondary'. As soon as you make the resource 'secondary' drbd device becomes unavailable. Usually it means you have to stop all of VMs that use that device.
drbdadm secondary r0 drbdadm -- --discard-my-data connect r0
On NODEA you these commands:
drbdadm connect r0
Now resource r0 will sync changes from NODEA to NODEB
You can watch the progress using this command:
watch cat /proc/drbd
After DRBD is synced run this command on NODEB:
drbdadm primary r0
In most cases the recovery will take a few minutes since DRBD only needs to copy the data that has changed since the split-brain occured.
Recovery from split brain when using one DRBD volume
Our example configuration is one DRBD volume named as follows:
Resoruce r0 /dev/drbd0
Server names are NODEA and NODEB
Both nodes have VMs running on DRBD resource r0 and a split brain has occured.
You need to turn off all the VMs on one node, in our example we are turning off all the VMs on NODEB.
After turning off all the VMs on NODEB make a backup of the VMs on NODEB.
Now restore those backups on NODEA, if you are not satisfied that the restored VMs are ok do not proceed and repeat the above.
These next steps will destroy the data on NODEB so be sure you have properly restored your nodes on NODEA first.
Delete all the VMs on NODEB, they are restored to NODEA with different IDs so they are not needed on NODEB anymore.
Fix resource r0
On NODEB you run these commands:
drbdadm secondary r0 drbdadm -- --discard-my-data connect r0
On NODEA you these commands:
drbdadm connect r0
Now resource r0 will sync changes from NODEA to NODEB
You can watch the progress using this command:
watch cat /proc/drbd
After DRBD is synced run this command on NODEB:
drbdadm primary r0
Now you can live migrate VMs back to NODEB.
You can avoid this disruptive recovery process by using two DRBD volumes.
Integrity checking
- It is good idea to enable "data-integrity-alg" for testing purposes and test at least for a week before production use.
- Try to run "drbdadm verify" once a week (or at least once a month) when servers under low load.
# /etc/cron.d/drbdadm-verify-weekly # This will have cron invoked a drbd resources verification every Monday at 42 minutes past midnight 42 0 * * 1 root /sbin/drbdadm verify all
- Check man drbd.conf, section "NOTES ON DATA INTEGRITY" for more information.
Final considerations
Now you have a fully redundant storage for your VM´s with just no need of expensive SAN equipment, configured in about 30 to 60 minutes - starting from bare-metal. If you want to achieve a similar setup with traditional virtualization solutions with SAN you will need at least 4 servers boxes and the storage network.
Traditional setup with SAN (eg. iSCSI, NFS, FC):
- Two servers for a redundant SAN
- Two servers for redundant virtualization hosts
- Extra storage network, VLAN´s, FC switches, etc
- Complex setup
Proxmox VE with DRBD:
- Only two servers with raid controllers and a bunch of hard drives (configure two raid volumes, one for Proxmox VE and one for DRBD device)
Beginning with Kernel 2.6.33, DRBD is integrated into the mainline Kernel (also most 2.6.32 Kernels got it backported). This makes it much easier to maintain and upgrade the systems and can be considered as the "highest quality certificate" for Kernel module developers.
Additionally, DRBD will be the base for high availability (HA) - please help by testing, reporting bugs and issues in our forum or mailing list.