Proxmox VE is based on the famous Debian Linux distribution. That means that you have access to the whole world of Debian packages, and the base system is well documented. The Debian Administrator's Handbook is available online, and provides a comprehensive introduction to the Debian operating system (see [Hertzog13]).

A standard Proxmox VE installation uses the default repositories from Debian, so you get bug fixes and security updates through that channel. In addition, we provide our own package repository to roll out all Proxmox VE related packages. This includes updates to some Debian packages when necessary.

We also deliver a specially optimized Linux kernel, where we enable all required virtualization and container features. That kernel includes drivers for ZFS, and several hardware drivers. For example, we ship Intel network card drivers to support their newest hardware.

The following sections will concentrate on virtualization related topics. They either explains things which are different on Proxmox VE, or tasks which are commonly used on Proxmox VE. For other topics, please refer to the standard Debian documentation.

Package Repositories

All Debian based systems use APT as package management tool. The list of repositories is defined in /etc/apt/sources.list and .list files found inside /etc/apt/sources.d/. Updates can be installed directly using apt-get, or via the GUI.

Apt sources.list files list one package repository per line, with the most preferred source listed first. Empty lines are ignored, and a # character anywhere on a line marks the remainder of that line as a comment. The information available from the configured sources is acquired by apt-get update.

File /etc/apt/sources.list
deb http://ftp.debian.org/debian stretch main contrib

# security updates
deb http://security.debian.org stretch/updates main contrib

In addition, Proxmox VE provides three different package repositories.

Proxmox VE Enterprise Repository

This is the default, stable and recommended repository, available for all Proxmox VE subscription users. It contains the most stable packages, and is suitable for production use. The pve-enterprise repository is enabled by default:

File /etc/apt/sources.list.d/pve-enterprise.list
deb https://enterprise.proxmox.com/debian/pve stretch pve-enterprise

As soon as updates are available, the root@pam user is notified via email about the available new packages. On the GUI, the change-log of each package can be viewed (if available), showing all details of the update. So you will never miss important security fixes.

Please note that and you need a valid subscription key to access this repository. We offer different support levels, and you can find further details at http://www.proxmox.com/en/proxmox-ve/pricing.

Note You can disable this repository by commenting out the above line using a # (at the start of the line). This prevents error messages if you do not have a subscription key. Please configure the pve-no-subscription repository in that case.

Proxmox VE No-Subscription Repository

As the name suggests, you do not need a subscription key to access this repository. It can be used for testing and non-production use. Its not recommended to run on production servers, as these packages are not always heavily tested and validated.

We recommend to configure this repository in /etc/apt/sources.list.

File /etc/apt/sources.list
deb http://ftp.debian.org/debian stretch main contrib

# PVE pve-no-subscription repository provided by proxmox.com,
# NOT recommended for production use
deb http://download.proxmox.com/debian/pve stretch pve-no-subscription

# security updates
deb http://security.debian.org stretch/updates main contrib

Proxmox VE Test Repository

Finally, there is a repository called pvetest. This one contains the latest packages and is heavily used by developers to test new features. As usual, you can configure this using /etc/apt/sources.list by adding the following line:

sources.list entry for pvetest
deb http://download.proxmox.com/debian/pve stretch pvetest
Warning the pvetest repository should (as the name implies) only be used for testing new features or bug fixes.

SecureApt

We use GnuPG to sign the Release files inside those repositories, and APT uses that signatures to verify that all packages are from a trusted source.

The key used for verification is already installed if you install from our installation CD. If you install by other means, you can manually download the key with:

# wget http://download.proxmox.com/debian/proxmox-ve-release-5.x.gpg -O /etc/apt/trusted.gpg.d/proxmox-ve-release-5.x.gpg

Please verify the checksum afterwards:

# sha512sum /etc/apt/trusted.gpg.d/proxmox-ve-release-5.x.gpg
ffb95f0f4be68d2e753c8875ea2f8465864a58431d5361e88789568673551501ae574283a4e0492f17d79dc67edfb173a56a6304dea39e01f249ebdabc9f074a  /etc/apt/trusted.gpg.d/proxmox-ve-release-5.x.gpg

or

# md5sum /etc/apt/trusted.gpg.d/proxmox-ve-release-5.x.gpg
511d36d0f1350c01c42a3dc9f3c27939  /etc/apt/trusted.gpg.d/proxmox-ve-release-5.x.gpg

System Software Updates

We provide regular package updates on all repositories. You can install those update using the GUI, or you can directly run the CLI command apt-get:

apt-get update
apt-get dist-upgrade
Note The apt package management system is extremely flexible and provides countless of feature - see man apt-get or [Hertzog13] for additional information.

You should do such updates at regular intervals, or when we release versions with security related fixes. Major system upgrades are announced at the Proxmox VE Community Forum. Those announcement also contain detailed upgrade instructions.

Tip We recommend to run regular upgrades, because it is important to get the latest security updates.

Network Configuration

Network configuration can be done either via the GUI, or by manually editing the file /etc/network/interfaces, which contains the whole network configuration. The interfaces(5) manual page contains the complete format description. All Proxmox VE tools try hard to keep direct user modifications, but using the GUI is still preferable, because it protects you from errors.

Once the network is configured, you can use the Debian traditional tools ifup and ifdown commands to bring interfaces up and down.

Note Proxmox VE does not write changes directly to /etc/network/interfaces. Instead, we write into a temporary file called /etc/network/interfaces.new, and commit those changes when you reboot the node.

Naming Conventions

We currently use the following naming conventions for device names:

  • Ethernet devices: en*, systemd network interface names. This naming scheme is used for new Proxmox VE installations since version 5.0.

  • Ethernet devices: eth[N], where 0 ≤ N (eth0, eth1, …) This naming scheme is used for Proxmox VE hosts which were installed before the 5.0 release. When upgrading to 5.0, the names are kept as-is.

  • Bridge names: vmbr[N], where 0 ≤ N ≤ 4094 (vmbr0 - vmbr4094)

  • Bonds: bond[N], where 0 ≤ N (bond0, bond1, …)

  • VLANs: Simply add the VLAN number to the device name, separated by a period (eno1.50, bond1.30)

This makes it easier to debug networks problems, because the device name implies the device type.

Systemd Network Interface Names

Systemd uses the two character prefix en for Ethernet network devices. The next characters depends on the device driver and the fact which schema matches first.

  • o<index>[n<phys_port_name>|d<dev_port>] — devices on board

  • s<slot>[f<function>][n<phys_port_name>|d<dev_port>] — device by hotplug id

  • [P<domain>]p<bus>s<slot>[f<function>][n<phys_port_name>|d<dev_port>] — devices by bus id

  • x<MAC> — device by MAC address

The most common patterns are:

  • eno1 — is the first on board NIC

  • enp3s0f1 — is the NIC on pcibus 3 slot 0 and use the NIC function 1.

For more information see Predictable Network Interface Names.

Choosing a network configuration

Depending on your current network organization and your resources you can choose either a bridged, routed, or masquerading networking setup.

Proxmox VE server in a private LAN, using an external gateway to reach the internet

The Bridged model makes the most sense in this case, and this is also the default mode on new Proxmox VE installations. Each of your Guest system will have a virtual interface attached to the Proxmox VE bridge. This is similar in effect to having the Guest network card directly connected to a new switch on your LAN, the Proxmox VE host playing the role of the switch.

Proxmox VE server at hosting provider, with public IP ranges for Guests

For this setup, you can use either a Bridged or Routed model, depending on what your provider allows.

Proxmox VE server at hosting provider, with a single public IP address

In that case the only way to get outgoing network accesses for your guest systems is to use Masquerading. For incoming network access to your guests, you will need to configure Port Forwarding.

For further flexibility, you can configure VLANs (IEEE 802.1q) and network bonding, also known as "link aggregation". That way it is possible to build complex and flexible virtual networks.

Default Configuration using a Bridge

Bridges are like physical network switches implemented in software. All VMs can share a single bridge, or you can create multiple bridges to separate network domains. Each host can have up to 4094 bridges.

The installation program creates a single bridge named vmbr0, which is connected to the first Ethernet card. The corresponding configuration in /etc/network/interfaces might look like this:

auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.10.2
        netmask 255.255.255.0
        gateway 192.168.10.1
        bridge_ports eno1
        bridge_stp off
        bridge_fd 0

Virtual machines behave as if they were directly connected to the physical network. The network, in turn, sees each virtual machine as having its own MAC, even though there is only one network cable connecting all of these VMs to the network.

Routed Configuration

Most hosting providers do not support the above setup. For security reasons, they disable networking as soon as they detect multiple MAC addresses on a single interface.

Tip Some providers allows you to register additional MACs on there management interface. This avoids the problem, but is clumsy to configure because you need to register a MAC for each of your VMs.

You can avoid the problem by “routing” all traffic via a single interface. This makes sure that all network packets use the same MAC address.

A common scenario is that you have a public IP (assume 198.51.100.5 for this example), and an additional IP block for your VMs (203.0.113.16/29). We recommend the following setup for such situations:

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet static
        address  198.51.100.5
        netmask  255.255.255.0
        gateway  198.51.100.1
        post-up echo 1 > /proc/sys/net/ipv4/ip_forward
        post-up echo 1 > /proc/sys/net/ipv4/conf/eno1/proxy_arp


auto vmbr0
iface vmbr0 inet static
        address  203.0.113.17
        netmask  255.255.255.248
        bridge_ports none
        bridge_stp off
        bridge_fd 0

Masquerading (NAT) with iptables

Masquerading allows guests having only a private IP address to access the network by using the host IP address for outgoing traffic. Each outgoing packet is rewritten by iptables to appear as originating from the host, and responses are rewritten accordingly to be routed to the original sender.

auto lo
iface lo inet loopback

auto eno1
#real IP address
iface eno1 inet static
        address  198.51.100.5
        netmask  255.255.255.0
        gateway  198.51.100.1

auto vmbr0
#private sub network
iface vmbr0 inet static
        address  10.10.10.1
        netmask  255.255.255.0
        bridge_ports none
        bridge_stp off
        bridge_fd 0

        post-up echo 1 > /proc/sys/net/ipv4/ip_forward
        post-up   iptables -t nat -A POSTROUTING -s '10.10.10.0/24' -o eno1 -j MASQUERADE
        post-down iptables -t nat -D POSTROUTING -s '10.10.10.0/24' -o eno1 -j MASQUERADE

Linux Bond

Bonding (also called NIC teaming or Link Aggregation) is a technique for binding multiple NIC’s to a single network device. It is possible to achieve different goals, like make the network fault-tolerant, increase the performance or both together.

High-speed hardware like Fibre Channel and the associated switching hardware can be quite expensive. By doing link aggregation, two NICs can appear as one logical interface, resulting in double speed. This is a native Linux kernel feature that is supported by most switches. If your nodes have multiple Ethernet ports, you can distribute your points of failure by running network cables to different switches and the bonded connection will failover to one cable or the other in case of network trouble.

Aggregated links can improve live-migration delays and improve the speed of replication of data between Proxmox VE Cluster nodes.

There are 7 modes for bonding:

  • Round-robin (balance-rr): Transmit network packets in sequential order from the first available network interface (NIC) slave through the last. This mode provides load balancing and fault tolerance.

  • Active-backup (active-backup): Only one NIC slave in the bond is active. A different slave becomes active if, and only if, the active slave fails. The single logical bonded interface’s MAC address is externally visible on only one NIC (port) to avoid distortion in the network switch. This mode provides fault tolerance.

  • XOR (balance-xor): Transmit network packets based on [(source MAC address XOR’d with destination MAC address) modulo NIC slave count]. This selects the same NIC slave for each destination MAC address. This mode provides load balancing and fault tolerance.

  • Broadcast (broadcast): Transmit network packets on all slave network interfaces. This mode provides fault tolerance.

  • IEEE 802.3ad Dynamic link aggregation (802.3ad)(LACP): Creates aggregation groups that share the same speed and duplex settings. Utilizes all slave network interfaces in the active aggregator group according to the 802.3ad specification.

  • Adaptive transmit load balancing (balance-tlb): Linux bonding driver mode that does not require any special network-switch support. The outgoing network packet traffic is distributed according to the current load (computed relative to the speed) on each network interface slave. Incoming traffic is received by one currently designated slave network interface. If this receiving slave fails, another slave takes over the MAC address of the failed receiving slave.

  • Adaptive load balancing (balance-alb): Includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special network switch support. The receive load balancing is achieved by ARP negotiation. The bonding driver intercepts the ARP Replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the NIC slaves in the single logical bonded interface such that different network-peers use different MAC addresses for their network packet traffic.

If your switch support the LACP (IEEE 802.3ad) protocol then we recommend using the corresponding bonding mode (802.3ad). Otherwise you should generally use the active-backup mode.
If you intend to run your cluster network on the bonding interfaces, then you have to use active-passive mode on the bonding interfaces, other modes are unsupported.

The following bond configuration can be used as distributed/shared storage network. The benefit would be that you get more speed and the network will be fault-tolerant.

Example: Use bond with fixed IP address
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

auto bond0
iface bond0 inet static
      slaves eno1 eno2
      address  192.168.1.2
      netmask  255.255.255.0
      bond_miimon 100
      bond_mode 802.3ad
      bond_xmit_hash_policy layer2+3

auto vmbr0
iface vmbr0 inet static
        address  10.10.10.2
        netmask  255.255.255.0
        gateway  10.10.10.1
        bridge_ports eno1
        bridge_stp off
        bridge_fd 0

Another possibility it to use the bond directly as bridge port. This can be used to make the guest network fault-tolerant.

Example: Use a bond as bridge port
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

auto bond0
iface bond0 inet manual
      slaves eno1 eno2
      bond_miimon 100
      bond_mode 802.3ad
      bond_xmit_hash_policy layer2+3

auto vmbr0
iface vmbr0 inet static
        address  10.10.10.2
        netmask  255.255.255.0
        gateway  10.10.10.1
        bridge_ports bond0
        bridge_stp off
        bridge_fd 0

VLAN 802.1Q

A virtual LAN (VLAN) is a broadcast domain that is partitioned and isolated in the network at layer two. So it is possible to have multiple networks (4096) in a physical network, each independent of the other ones.

Each VLAN network is identified by a number often called tag. Network packages are then tagged to identify which virtual network they belong to.

VLAN for Guest Networks

Proxmox VE supports this setup out of the box. You can specify the VLAN tag when you create a VM. The VLAN tag is part of the guest network confinuration. The networking layer supports differnet modes to implement VLANs, depending on the bridge configuration:

  • VLAN awareness on the Linux bridge: In this case, each guest’s virtual network card is assigned to a VLAN tag, which is transparently supported by the Linux bridge. Trunk mode is also possible, but that makes the configuration in the guest necessary.

  • "traditional" VLAN on the Linux bridge: In contrast to the VLAN awareness method, this method is not transparent and creates a VLAN device with associated bridge for each VLAN. That is, if e.g. in our default network, a guest VLAN 5 is used to create eno1.5 and vmbr0v5, which remains until rebooting.

  • Open vSwitch VLAN: This mode uses the OVS VLAN feature.

  • Guest configured VLAN: VLANs are assigned inside the guest. In this case, the setup is completely done inside the guest and can not be influenced from the outside. The benefit is that you can use more than one VLAN on a single virtual NIC.

VLAN on the Host

To allow host communication with an isolated network. It is possible to apply VLAN tags to any network device (NIC, Bond, Bridge). In general, you should configure the VLAN on the interface with the least abstraction layers between itself and the physical NIC.

For example, in a default configuration where you want to place the host management address on a separate VLAN.

Note In the examples we use the VLAN at bridge level to ensure the correct function of VLAN 5 in the guest network, but in combination with VLAN anwareness bridge this it will not work for guest network VLAN 5. The downside of this setup is more CPU usage.
Example: Use VLAN 5 for the Proxmox VE management IP
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno1.5 inet manual

auto vmbr0v5
iface vmbr0v5 inet static
        address  10.10.10.2
        netmask  255.255.255.0
        gateway  10.10.10.1
        bridge_ports eno1.5
        bridge_stp off
        bridge_fd 0

auto vmbr0
iface vmbr0 inet manual
        bridge_ports eno1
        bridge_stp off
        bridge_fd 0

The next example is the same setup but a bond is used to make this network fail-safe.

Example: Use VLAN 5 with bond0 for the Proxmox VE management IP
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

auto bond0
iface bond0 inet manual
      slaves eno1 eno2
      bond_miimon 100
      bond_mode 802.3ad
      bond_xmit_hash_policy layer2+3

iface bond0.5 inet manual

auto vmbr0v5
iface vmbr0v5 inet static
        address  10.10.10.2
        netmask  255.255.255.0
        gateway  10.10.10.1
        bridge_ports bond0.5
        bridge_stp off
        bridge_fd 0

auto vmbr0
iface vmbr0 inet manual
        bridge_ports bond0
        bridge_stp off
        bridge_fd 0

Time Synchronization

The Proxmox VE cluster stack itself relies heavily on the fact that all the nodes have precisely synchronized time. Some other components, like Ceph, also refuse to work properly if the local time on nodes is not in sync.

Time synchronization between nodes can be achieved with the “Network Time Protocol” (NTP). Proxmox VE uses systemd-timesyncd as NTP client by default, preconfigured to use a set of public servers. This setup works out of the box in most cases.

Using Custom NTP Servers

In some cases, it might be desired to not use the default NTP servers. For example, if your Proxmox VE nodes do not have access to the public internet (e.g., because of restrictive firewall rules), you need to setup local NTP servers and tell systemd-timesyncd to use them:

File /etc/systemd/timesyncd.conf
[Time]
Servers=ntp1.example.com ntp2.example.com ntp3.example.com ntp4.example.com

After restarting the synchronization service (systemctl restart systemd-timesyncd) you should verify that your newly configured NTP servers are used by checking the journal (journalctl --since -1h -u systemd-timesyncd):

...
Oct 07 14:58:36 node1 systemd[1]: Stopping Network Time Synchronization...
Oct 07 14:58:36 node1 systemd[1]: Starting Network Time Synchronization...
Oct 07 14:58:36 node1 systemd[1]: Started Network Time Synchronization.
Oct 07 14:58:36 node1 systemd-timesyncd[13514]: Using NTP server 10.0.0.1:123 (ntp1.example.com).
Oct 07 14:58:36 nora systemd-timesyncd[13514]: interval/delta/delay/jitter/drift 64s/-0.002s/0.020s/0.000s/-31ppm
...

External Metric Server

Starting with Proxmox VE 4.0, you can define external metric servers, which will be sent various stats about your hosts, virtual machines and storages.

Currently supported are:

The server definitions are saved in /etc/pve/status.cfg

Graphite server configuration

The definition of a server is:

graphite:
   server your-server
   port your-port
   path your-path

where your-port defaults to 2003 and your-path defaults to proxmox

Proxmox VE sends the data over udp, so the graphite server has to be configured for this

Influxdb plugin configuration

The definition is:

influxdb:
   server your-server
   port your-port

Proxmox VE sends the data over udp, so the influxdb server has to be configured for this

Here is an example configuration for influxdb (on your influxdb server):

[[udp]]
  enabled = true
  bind-address = "0.0.0.0:8089"
  database = "proxmox"
  batch-size = 1000
  batch-timeout = "1s"

With this configuration, your server listens on all IP addresses on port 8089, and writes the data in the proxmox database

Disk Health Monitoring

Although a robust and redundant storage is recommended, it can be very helpful to monitor the health of your local disks.

Starting with Proxmox VE 4.3, the package smartmontools
[smartmontools homepage https://www.smartmontools.org]
is installed and required. This is a set of tools to monitor and control the S.M.A.R.T. system for local hard disks.

You can get the status of a disk by issuing the following command:

# smartctl -a /dev/sdX

where /dev/sdX is the path to one of your local disks.

If the output says:

SMART support is: Disabled

you can enable it with the command:

# smartctl -s on /dev/sdX

For more information on how to use smartctl, please see man smartctl.

By default, smartmontools daemon smartd is active and enabled, and scans the disks under /dev/sdX and /dev/hdX every 30 minutes for errors and warnings, and sends an e-mail to root if it detects a problem.

For more information about how to configure smartd, please see man smartd and man smartd.conf.

If you use your hard disks with a hardware raid controller, there are most likely tools to monitor the disks in the raid array and the array itself. For more information about this, please refer to the vendor of your raid controller.

Logical Volume Manager (LVM)

Most people install Proxmox VE directly on a local disk. The Proxmox VE installation CD offers several options for local disk management, and the current default setup uses LVM. The installer let you select a single disk for such setup, and uses that disk as physical volume for the Volume Group (VG) pve. The following output is from a test installation using a small 8GB disk:

# pvs
  PV         VG   Fmt  Attr PSize PFree
  /dev/sda3  pve  lvm2 a--  7.87g 876.00m

# vgs
  VG   #PV #LV #SN Attr   VSize VFree
  pve    1   3   0 wz--n- 7.87g 876.00m

The installer allocates three Logical Volumes (LV) inside this VG:

# lvs
  LV   VG   Attr       LSize   Pool Origin Data%  Meta%
  data pve  twi-a-tz--   4.38g             0.00   0.63
  root pve  -wi-ao----   1.75g
  swap pve  -wi-ao---- 896.00m
root

Formatted as ext4, and contains the operation system.

swap

Swap partition

data

This volume uses LVM-thin, and is used to store VM images. LVM-thin is preferable for this task, because it offers efficient support for snapshots and clones.

For Proxmox VE versions up to 4.1, the installer creates a standard logical volume called “data”, which is mounted at /var/lib/vz.

Starting from version 4.2, the logical volume “data” is a LVM-thin pool, used to store block based guest images, and /var/lib/vz is simply a directory on the root file system.

Hardware

We highly recommend to use a hardware RAID controller (with BBU) for such setups. This increases performance, provides redundancy, and make disk replacements easier (hot-pluggable).

LVM itself does not need any special hardware, and memory requirements are very low.

Bootloader

We install two boot loaders by default. The first partition contains the standard GRUB boot loader. The second partition is an EFI System Partition (ESP), which makes it possible to boot on EFI systems.

Creating a Volume Group

Let’s assume we have an empty disk /dev/sdb, onto which we want to create a volume group named “vmdata”.

Caution Please note that the following commands will destroy all existing data on /dev/sdb.

First create a partition.

# sgdisk -N 1 /dev/sdb

Create a Physical Volume (PV) without confirmation and 250K metadatasize.

# pvcreate --metadatasize 250k -y -ff /dev/sdb1

Create a volume group named “vmdata” on /dev/sdb1

# vgcreate vmdata /dev/sdb1

Creating an extra LV for /var/lib/vz

This can be easily done by creating a new thin LV.

# lvcreate -n <Name> -V <Size[M,G,T]> <VG>/<LVThin_pool>

A real world example:

# lvcreate -n vz -V 10G pve/data

Now a filesystem must be created on the LV.

# mkfs.ext4 /dev/pve/vz

At last this has to be mounted.

Warning be sure that /var/lib/vz is empty. On a default installation it’s not.

To make it always accessible add the following line in /etc/fstab.

# echo '/dev/pve/vz /var/lib/vz ext4 defaults 0 2' >> /etc/fstab

Resizing the thin pool

Resize the LV and the metadata pool can be achieved with the following command.

# lvresize --size +<size[\M,G,T]> --poolmetadatasize +<size[\M,G]> <VG>/<LVThin_pool>
Note When extending the data pool, the metadata pool must also be extended.

Create a LVM-thin pool

A thin pool has to be created on top of a volume group. How to create a volume group see Section LVM.

# lvcreate -L 80G -T -n vmstore vmdata

ZFS on Linux

ZFS is a combined file system and logical volume manager designed by Sun Microsystems. Starting with Proxmox VE 3.4, the native Linux kernel port of the ZFS file system is introduced as optional file system and also as an additional selection for the root file system. There is no need for manually compile ZFS modules - all packages are included.

By using ZFS, its possible to achieve maximum enterprise features with low budget hardware, but also high performance systems by leveraging SSD caching or even SSD only setups. ZFS can replace cost intense hardware raid cards by moderate CPU and memory load combined with easy management.

General ZFS advantages
  • Easy configuration and management with Proxmox VE GUI and CLI.

  • Reliable

  • Protection against data corruption

  • Data compression on file system level

  • Snapshots

  • Copy-on-write clone

  • Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3

  • Can use SSD for cache

  • Self healing

  • Continuous integrity checking

  • Designed for high storage capacities

  • Protection against data corruption

  • Asynchronous replication over network

  • Open Source

  • Encryption

Hardware

ZFS depends heavily on memory, so you need at least 8GB to start. In practice, use as much you can get for your hardware/budget. To prevent data corruption, we recommend the use of high quality ECC RAM.

If you use a dedicated cache and/or log disk, you should use an enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can increase the overall performance significantly.

Important Do not use ZFS on top of hardware controller which has its own cache management. ZFS needs to directly communicate with disks. An HBA adapter is the way to go, or something like LSI controller flashed in “IT” mode.

If you are experimenting with an installation of Proxmox VE inside a VM (Nested Virtualization), don’t use virtio for disks of that VM, since they are not supported by ZFS. Use IDE or SCSI instead (works also with virtio SCSI controller type).

Installation as Root File System

When you install using the Proxmox VE installer, you can choose ZFS for the root file system. You need to select the RAID type at installation time:

RAID0

Also called “striping”. The capacity of such volume is the sum of the capacities of all disks. But RAID0 does not add any redundancy, so the failure of a single drive makes the volume unusable.

RAID1

Also called “mirroring”. Data is written identically to all disks. This mode requires at least 2 disks with the same size. The resulting capacity is that of a single disk.

RAID10

A combination of RAID0 and RAID1. Requires at least 4 disks.

RAIDZ-1

A variation on RAID-5, single parity. Requires at least 3 disks.

RAIDZ-2

A variation on RAID-5, double parity. Requires at least 4 disks.

RAIDZ-3

A variation on RAID-5, triple parity. Requires at least 5 disks.

The installer automatically partitions the disks, creates a ZFS pool called rpool, and installs the root file system on the ZFS subvolume rpool/ROOT/pve-1.

Another subvolume called rpool/data is created to store VM images. In order to use that with the Proxmox VE tools, the installer creates the following configuration entry in /etc/pve/storage.cfg:

zfspool: local-zfs
        pool rpool/data
        sparse
        content images,rootdir

After installation, you can view your ZFS pool status using the zpool command:

# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda2    ONLINE       0     0     0
            sdb2    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

The zfs command is used configure and manage your ZFS file systems. The following command lists all file systems after installation:

# zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool             4.94G  7.68T    96K  /rpool
rpool/ROOT         702M  7.68T    96K  /rpool/ROOT
rpool/ROOT/pve-1   702M  7.68T   702M  /
rpool/data          96K  7.68T    96K  /rpool/data
rpool/swap        4.25G  7.69T    64K  -

Bootloader

The default ZFS disk partitioning scheme does not use the first 2048 sectors. This gives enough room to install a GRUB boot partition. The Proxmox VE installer automatically allocates that space, and installs the GRUB boot loader there. If you use a redundant RAID setup, it installs the boot loader on all disk required for booting. So you can boot even if some disks fail.

Note It is not possible to use ZFS as root file system with UEFI boot.

ZFS Administration

This section gives you some usage examples for common tasks. ZFS itself is really powerful and provides many options. The main commands to manage ZFS are zfs and zpool. Both commands come with great manual pages, which can be read with:

# man zpool
# man zfs
Create a new zpool

To create a new pool, at least one disk is needed. The ashift should have the same sector-size (2 power of ashift) or larger as the underlying disk.

zpool create -f -o ashift=12 <pool> <device>

To activate compression

zfs set compression=lz4 <pool>
Create a new pool with RAID-0

Minimum 1 Disk

zpool create -f -o ashift=12 <pool> <device1> <device2>
Create a new pool with RAID-1

Minimum 2 Disks

zpool create -f -o ashift=12 <pool> mirror <device1> <device2>
Create a new pool with RAID-10

Minimum 4 Disks

zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4>
Create a new pool with RAIDZ-1

Minimum 3 Disks

zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3>
Create a new pool with RAIDZ-2

Minimum 4 Disks

zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4>
Create a new pool with cache (L2ARC)

It is possible to use a dedicated cache drive partition to increase the performance (use SSD).

As <device> it is possible to use more devices, like it’s shown in "Create a new pool with RAID*".

zpool create -f -o ashift=12 <pool> <device> cache <cache_device>
Create a new pool with log (ZIL)

It is possible to use a dedicated cache drive partition to increase the performance(SSD).

As <device> it is possible to use more devices, like it’s shown in "Create a new pool with RAID*".

zpool create -f -o ashift=12 <pool> <device> log <log_device>
Add cache and log to an existing pool

If you have an pool without cache and log. First partition the SSD in 2 partition with parted or gdisk

Important Always use GPT partition tables.

The maximum size of a log device should be about half the size of physical memory, so this is usually quite small. The rest of the SSD can be used as cache.

zpool add -f <pool> log <device-part1> cache <device-part2>
Changing a failed device
zpool replace -f <pool> <old device> <new-device>

Activate E-Mail Notification

ZFS comes with an event daemon, which monitors events generated by the ZFS kernel module. The daemon can also send emails on ZFS events like pool errors. Newer ZFS packages ships the daemon in a separate package, and you can install it using apt-get:

# apt-get install zfs-zed

To activate the daemon it is necessary to edit /etc/zfs/zed.d/zed.rc with your favourite editor, and uncomment the ZED_EMAIL_ADDR setting:

ZED_EMAIL_ADDR="root"

Please note Proxmox VE forwards mails to root to the email address configured for the root user.

Important The only setting that is required is ZED_EMAIL_ADDR. All other settings are optional.

Limit ZFS Memory Usage

It is good to use at most 50 percent (which is the default) of the system memory for ZFS ARC to prevent performance shortage of the host. Use your preferred editor to change the configuration in /etc/modprobe.d/zfs.conf and insert:

options zfs zfs_arc_max=8589934592

This example setting limits the usage to 8GB.

Important

If your root file system is ZFS you must update your initramfs every time this value changes:

update-initramfs -u
SWAP on ZFS

SWAP on ZFS on Linux may generate some troubles, like blocking the server or generating a high IO load, often seen when starting a Backup to an external Storage.

We strongly recommend to use enough memory, so that you normally do not run into low memory situations. Additionally, you can lower the “swappiness” value. A good value for servers is 10:

sysctl -w vm.swappiness=10

To make the swappiness persistent, open /etc/sysctl.conf with an editor of your choice and add the following line:

vm.swappiness = 10
Table 1. Linux kernel swappiness parameter values
Value Strategy

vm.swappiness = 0

The kernel will swap only to avoid an out of memory condition

vm.swappiness = 1

Minimum amount of swapping without disabling it entirely.

vm.swappiness = 10

This value is sometimes recommended to improve performance when sufficient memory exists in a system.

vm.swappiness = 60

The default value.

vm.swappiness = 100

The kernel will swap aggressively.

Certificate Management

Certificates for communication within the cluster

Each Proxmox VE cluster creates its own internal Certificate Authority (CA) and generates a self-signed certificate for each node. These certificates are used for encrypted communication with the cluster’s pveproxy service and the Shell/Console feature if SPICE is used.

The CA certificate and key are stored in the Proxmox Cluster File System (pmxcfs).

Certificates for API and web GUI

The REST API and web GUI are provided by the pveproxy service, which runs on each node.

You have the following options for the certificate used by pveproxy:

  1. By default the node-specific certificate in /etc/pve/nodes/NODENAME/pve-ssl.pem is used. This certificate is signed by the cluster CA and therefore not trusted by browsers and operating systems by default.

  2. use an externally provided certificate (e.g. signed by a commercial CA).

  3. use ACME (e.g., Let’s Encrypt) to get a trusted certificate with automatic renewal.

For options 2 and 3 the file /etc/pve/local/pveproxy-ssl.pem (and /etc/pve/local/pveproxy-ssl.key, which needs to be without password) is used.

Certificates are managed with the Proxmox VE Node management command (see the pvenode(1) manpage).

Warning Do not replace or manually modify the automatically generated node certificate files in /etc/pve/local/pve-ssl.pem and /etc/pve/local/pve-ssl.key or the cluster CA files in /etc/pve/pve-root-ca.pem and /etc/pve/priv/pve-root-ca.key.

Getting trusted certificates via ACME

Proxmox VE includes an implementation of the Automatic Certificate Management Environment ACME protocol, allowing Proxmox VE admins to interface with Let’s Encrypt for easy setup of trusted TLS certificates which are accepted out of the box on most modern operating systems and browsers.

Currently the two ACME endpoints implemented are Let’s Encrypt (LE) and its staging environment (see https://letsencrypt.org), both using the standalone HTTP challenge.

Because of rate-limits you should use LE staging for experiments.

There are a few prerequisites to use Let’s Encrypt:

  1. Port 80 of the node needs to be reachable from the internet.

  2. There must be no other listener on port 80.

  3. The requested (sub)domain needs to resolve to a public IP of the Node.

  4. You have to accept the ToS of Let’s Encrypt.

At the moment the GUI uses only the default ACME account.

Example: Sample pvenode invocation for using Let’s Encrypt certificates
root@proxmox:~# pvenode acme account register default mail@example.invalid
Directory endpoints:
0) Let's Encrypt V2 (https://acme-v02.api.letsencrypt.org/directory)
1) Let's Encrypt V2 Staging (https://acme-staging-v02.api.letsencrypt.org/directory)
2) Custom
Enter selection:
1

Attempting to fetch Terms of Service from 'https://acme-staging-v02.api.letsencrypt.org/directory'..
Terms of Service: https://letsencrypt.org/documents/LE-SA-v1.2-November-15-2017.pdf
Do you agree to the above terms? [y|N]y

Attempting to register account with 'https://acme-staging-v02.api.letsencrypt.org/directory'..
Generating ACME account key..
Registering ACME account..
Registration successful, account URL: 'https://acme-staging-v02.api.letsencrypt.org/acme/acct/xxxxxxx'
Task OK
root@proxmox:~# pvenode acme account list
default
root@proxmox:~# pvenode config set --acme domains=example.invalid
root@proxmox:~# pvenode acme cert order
Loading ACME account details
Placing ACME order
Order URL: https://acme-staging-v02.api.letsencrypt.org/acme/order/xxxxxxxxxxxxxx

Getting authorization details from
'https://acme-staging-v02.api.letsencrypt.org/acme/authz/xxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxxxx-xxxxxxx'
... pending!
Setting up webserver
Triggering validation
Sleeping for 5 seconds
Status is 'valid'!

All domains validated!

Creating CSR
Finalizing order
Checking order status
valid!

Downloading certificate
Setting pveproxy certificate and key
Restarting pveproxy
Task OK

Automatic renewal of ACME certificates

If a node has been successfully configured with an ACME-provided certificate (either via pvenode or via the GUI), the certificate will be automatically renewed by the pve-daily-update.service. Currently, renewal will be attempted if the certificate has expired or will expire in the next 30 days.