1. Introduction

Proxmox VE is a platform to run virtual machines and containers. It is based on Debian Linux, and completely open source. For maximum flexibility, we implemented two virtualization technologies - Kernel-based Virtual Machine (KVM) and container-based virtualization (LXC).

One main design goal was to make administration as easy as possible. You can use Proxmox VE on a single node, or assemble a cluster of many nodes. All management tasks can be done using our web-based management interface, and even a novice user can setup and install Proxmox VE within minutes.

Proxmox Software Stack

1.1. Central Management

While many people start with a single node, Proxmox VE can scale out to a large set of clustered nodes. The cluster stack is fully integrated and ships with the default installation.

Unique Multi-Master Design

The integrated web-based management interface gives you a clean overview of all your KVM guests and Linux containers and even of your whole cluster. You can easily manage your VMs and containers, storage or cluster from the GUI. There is no need to install a separate, complex, and pricey management server.

Proxmox Cluster File System (pmxcfs)

Proxmox VE uses the unique Proxmox Cluster file system (pmxcfs), a database-driven file system for storing configuration files. This enables you to store the configuration of thousands of virtual machines. By using corosync, these files are replicated in real time on all cluster nodes. The file system stores all data inside a persistent database on disk, nonetheless, a copy of the data resides in RAM which provides a maximum storage size of 30MB - more than enough for thousands of VMs.

Proxmox VE is the only virtualization platform using this unique cluster file system.

Web-based Management Interface

Proxmox VE is simple to use. Management tasks can be done via the included web based management interface - there is no need to install a separate management tool or any additional management node with huge databases. The multi-master tool allows you to manage your whole cluster from any node of your cluster. The central web-based management - based on the JavaScript Framework (ExtJS) - empowers you to control all functionalities from the GUI and overview history and syslogs of each single node. This includes running backup or restore jobs, live-migration or HA triggered activities.

Command Line

For advanced users who are used to the comfort of the Unix shell or Windows Powershell, Proxmox VE provides a command-line interface to manage all the components of your virtual environment. This command-line interface has intelligent tab completion and full documentation in the form of UNIX man pages.

REST API

Proxmox VE uses a RESTful API. We choose JSON as primary data format, and the whole API is formally defined using JSON Schema. This enables fast and easy integration for third party management tools like custom hosting environments.

Role-based Administration

You can define granular access for all objects (like VMs, storages, nodes, etc.) by using the role based user- and permission management. This allows you to define privileges and helps you to control access to objects. This concept is also known as access control lists: Each permission specifies a subject (a user or group) and a role (set of privileges) on a specific path.

Authentication Realms

Proxmox VE supports multiple authentication sources like Microsoft Active Directory, LDAP, Linux PAM standard authentication or the built-in Proxmox VE authentication server.

1.2. Flexible Storage

The Proxmox VE storage model is very flexible. Virtual machine images can either be stored on one or several local storages or on shared storage like NFS and on SAN. There are no limits, you may configure as many storage definitions as you like. You can use all storage technologies available for Debian Linux.

One major benefit of storing VMs on shared storage is the ability to live-migrate running machines without any downtime, as all nodes in the cluster have direct access to VM disk images.

We currently support the following Network storage types:

  • LVM Group (network backing with iSCSI targets)

  • iSCSI target

  • NFS Share

  • CIFS Share

  • Ceph RBD

  • Directly use iSCSI LUNs

  • GlusterFS

Local storage types supported are:

  • LVM Group (local backing devices like block devices, FC devices, DRBD, etc.)

  • Directory (storage on existing filesystem)

  • ZFS

1.3. Integrated Backup and Restore

The integrated backup tool (vzdump) creates consistent snapshots of running Containers and KVM guests. It basically creates an archive of the VM or CT data which includes the VM/CT configuration files.

KVM live backup works for all storage types including VM images on NFS, CIFS, iSCSI LUN, Ceph RBD. The new backup format is optimized for storing VM backups fast and effective (sparse files, out of order data, minimized I/O).

1.4. High Availability Cluster

A multi-node Proxmox VE HA Cluster enables the definition of highly available virtual servers. The Proxmox VE HA Cluster is based on proven Linux HA technologies, providing stable and reliable HA services.

1.5. Flexible Networking

Proxmox VE uses a bridged networking model. All VMs can share one bridge as if virtual network cables from each guest were all plugged into the same switch. For connecting VMs to the outside world, bridges are attached to physical network cards and assigned a TCP/IP configuration.

For further flexibility, VLANs (IEEE 802.1q) and network bonding/aggregation are possible. In this way it is possible to build complex, flexible virtual networks for the Proxmox VE hosts, leveraging the full power of the Linux network stack.

1.6. Integrated Firewall

The integrated firewall allows you to filter network packets on any VM or Container interface. Common sets of firewall rules can be grouped into “security groups”.

1.7. Hyper-converged Infrastructure

Proxmox VE is a virtualization platform that tightly integrates compute, storage and networking resources, manages highly available clusters, backup/restore as well as disaster recovery. All components are software-defined and compatible with one another.

Therefore it is possible to administrate them like a single system via the centralized web management interface. These capabilities make Proxmox VE an ideal choice to deploy and manage an open source hyper-converged infrastructure.

1.7.1. Benefits of a Hyper-Converged Infrastructure (HCI) with Proxmox VE

A hyper-converged infrastructure (HCI) is especially useful for deployments in which a high infrastructure demand meets a low administration budget, for distributed setups such as remote and branch office environments or for virtual private and public clouds.

HCI provides the following advantages:

  • Scalability: seamless expansion of compute, network and storage devices (i.e. scale up servers and storage quickly and independently from each other).

  • Low cost: Proxmox VE is open source and integrates all components you need such as compute, storage, networking, backup, and management center. It can replace an expensive compute/storage infrastructure.

  • Data protection and efficiency: services such as backup and disaster recovery are integrated.

  • Simplicity: easy configuration and centralized administration.

  • Open Source: No vendor lock-in.

1.7.2. Hyper-Converged Infrastructure: Storage

Proxmox VE has tightly integrated support for deploying a hyper-converged storage infrastructure. You can, for example, deploy and manage the following two storage technologies by using the web interface only:

Besides above, Proxmox VE has support to integrate a wide range of additional storage technologies. You can find out about them in the Storage Manager chapter.

1.8. Why Open Source

Proxmox VE uses a Linux kernel and is based on the Debian GNU/Linux Distribution. The source code of Proxmox VE is released under the GNU Affero General Public License, version 3. This means that you are free to inspect the source code at any time or contribute to the project yourself.

At Proxmox we are committed to use open source software whenever possible. Using open source software guarantees full access to all functionalities - as well as high security and reliability. We think that everybody should have the right to access the source code of a software to run it, build on it, or submit changes back to the project. Everybody is encouraged to contribute while Proxmox ensures the product always meets professional quality criteria.

Open source software also helps to keep your costs low and makes your core infrastructure independent from a single vendor.

1.9. Your benefits with Proxmox VE

  • Open source software

  • No vendor lock-in

  • Linux kernel

  • Fast installation and easy-to-use

  • Web-based management interface

  • REST API

  • Huge active community

  • Low administration costs and simple deployment

1.10. Getting Help

1.10.1. Proxmox VE Wiki

The primary source of information is the Proxmox VE Wiki. It combines the reference documentation with user contributed content.

1.10.2. Community Support Forum

Proxmox VE itself is fully open source, so we always encourage our users to discuss and share their knowledge using the Proxmox VE Community Forum. The forum is moderated by the Proxmox support team, and has a large user base from all around the world. Needless to say, such a large forum is a great place to get information.

1.10.3. Mailing Lists

This is a fast way to communicate with the Proxmox VE community via email.

Proxmox VE is fully open source and contributions are welcome! The primary communication channel for developers is the:

1.10.4. Commercial Support

Proxmox Server Solutions GmbH also offers enterprise support available as Proxmox VE Subscription Service Plans. All users with a subscription get access to the Proxmox VE Enterprise Repository, and—with a Basic, Standard or Premium subscription—also to the Proxmox Customer Portal. The customer portal provides help and support with guaranteed response times from the Proxmox VE developers.

For volume discounts, or more information in general, please contact sales@proxmox.com.

1.10.5. Bug Tracker

Proxmox runs a public bug tracker at https://bugzilla.proxmox.com. If an issue appears, file your report there. An issue can be a bug as well as a request for a new feature or enhancement. The bug tracker helps to keep track of the issue and will send a notification once it has been solved.

1.11. Project History

The project started in 2007, followed by a first stable version in 2008. At the time we used OpenVZ for containers, and KVM for virtual machines. The clustering features were limited, and the user interface was simple (server generated web page).

But we quickly developed new features using the Corosync cluster stack, and the introduction of the new Proxmox cluster file system (pmxcfs) was a big step forward, because it completely hides the cluster complexity from the user. Managing a cluster of 16 nodes is as simple as managing a single node.

We also introduced a new REST API, with a complete declarative specification written in JSON-Schema. This enabled other people to integrate Proxmox VE into their infrastructure, and made it easy to provide additional services.

Also, the new REST API made it possible to replace the original user interface with a modern HTML5 application using JavaScript. We also replaced the old Java based VNC console code with noVNC. So you only need a web browser to manage your VMs.

The support for various storage types is another big task. Notably, Proxmox VE was the first distribution to ship ZFS on Linux by default in 2014. Another milestone was the ability to run and manage Ceph storage on the hypervisor nodes. Such setups are extremely cost effective.

When we started we were among the first companies providing commercial support for KVM. The KVM project itself continuously evolved, and is now a widely used hypervisor. New features arrive with each release. We developed the KVM live backup feature, which makes it possible to create snapshot backups on any storage type.

The most notable change with version 4.0 was the move from OpenVZ to LXC. Containers are now deeply integrated, and they can use the same storage and network features as virtual machines.

1.12. Improving the Proxmox VE Documentation

Contributions and improvements to the Proxmox VE documentation are always welcome. There are several ways to contribute.

If you find errors or other room for improvement in this documentation, please file a bug at the Proxmox bug tracker to propose a correction.

If you want to propose new content, choose one of the following options:

  • The wiki: For specific setups, how-to guides, or tutorials the wiki is the right option to contribute.

  • The reference documentation: For general content that will be helpful to all users please propose your contribution for the reference documentation. This includes all information about how to install, configure, use, and troubleshoot Proxmox VE features. The reference documentation is written in the asciidoc format. To edit the documentation you need to clone the git repository at git://git.proxmox.com/git/pve-docs.git; then follow the README.adoc document.

Note If you are interested in working on the Proxmox VE codebase, the Developer Documentation wiki article will show you where to start.

1.13. Translating Proxmox VE

The Proxmox VE user interface is in English by default. However, thanks to the contributions of the community, translations to other languages are also available. We welcome any support in adding new languages, translating the latest features, and improving incomplete or inconsistent translations.

We use gettext for the management of the translation files. Tools like Poedit offer a nice user interface to edit the translation files, but you can use whatever editor you’re comfortable with. No programming knowledge is required for translating.

1.13.1. Translating with git

The language files are available as a git repository. If you are familiar with git, please contribute according to our Developer Documentation.

You can create a new translation by doing the following (replace <LANG> with the language ID):

# git clone git://git.proxmox.com/git/proxmox-i18n.git
# cd proxmox-i18n
# make init-<LANG>.po

Or you can edit an existing translation, using the editor of your choice:

# poedit <LANG>.po

1.13.2. Translating without git

Even if you are not familiar with git, you can help translate Proxmox VE. To start, you can download the language files here. Find the language you want to improve, then right click on the "raw" link of this language file and select Save Link As…. Make your changes to the file, and then send your final translation directly to office(at)proxmox.com, together with a signed contributor license agreement.

1.13.3. Testing the Translation

In order for the translation to be used in Proxmox VE, you must first translate the .po file into a .js file. You can do this by invoking the following script, which is located in the same repository:

# ./po2js.pl -t pve xx.po >pve-lang-xx.js

The resulting file pve-lang-xx.js can then be copied to the directory /usr/share/pve-i18n, on your proxmox server, in order to test it out.

Alternatively, you can build a deb package by running the following command from the root of the repository:

# make deb
Important For either of these methods to work, you need to have the following perl packages installed on your system. For Debian/Ubuntu:
# apt-get install perl liblocale-po-perl libjson-perl

1.13.4. Sending the Translation

You can send the finished translation (.po file) to the Proxmox team at the address office(at)proxmox.com, along with a signed contributor license agreement. Alternatively, if you have some developer experience, you can send it as a patch to the Proxmox VE development mailing list. See Developer Documentation.

2. Installing Proxmox VE

Proxmox VE is based on Debian. This is why the install disk images (ISO files) provided by Proxmox include a complete Debian system as well as all necessary Proxmox VE packages.

Tip See the support table in the FAQ for the relationship between Proxmox VE releases and Debian releases.

The installer will guide you through the setup, allowing you to partition the local disk(s), apply basic system configurations (for example, timezone, language, network) and install all required packages. This process should not take more than a few minutes. Installing with the provided ISO is the recommended method for new and existing users.

Alternatively, Proxmox VE can be installed on top of an existing Debian system. This option is only recommended for advanced users because detailed knowledge about Proxmox VE is required.

2.1. System Requirements

We recommend using high quality server hardware, when running Proxmox VE in production. To further decrease the impact of a failed host, you can run Proxmox VE in a cluster with highly available (HA) virtual machines and containers.

Proxmox VE can use local storage (DAS), SAN, NAS, and distributed storage like Ceph RBD. For details see chapter storage.

2.1.1. Minimum Requirements, for Evaluation

These minimum requirements are for evaluation purposes only and should not be used in production.

  • CPU: 64bit (Intel EMT64 or AMD64)

  • Intel VT/AMD-V capable CPU/motherboard for KVM full virtualization support

  • RAM: 1 GB RAM, plus additional RAM needed for guests

  • Hard drive

  • One network card (NIC)

  • Intel EMT64 or AMD64 with Intel VT/AMD-V CPU flag.

  • Memory: Minimum 2 GB for the OS and Proxmox VE services, plus designated memory for guests. For Ceph and ZFS, additional memory is required; approximately 1GB of memory for every TB of used storage.

  • Fast and redundant storage, best results are achieved with SSDs.

  • OS storage: Use a hardware RAID with battery protected write cache (“BBU”) or non-RAID with ZFS (optional SSD for ZIL).

  • VM storage:

    • For local storage, use either a hardware RAID with battery backed write cache (BBU) or non-RAID for ZFS and Ceph. Neither ZFS nor Ceph are compatible with a hardware RAID controller.

    • Shared and distributed storage is possible.

    • SSDs with Power-Loss-Protection (PLP) are recommended for good performance. Using consumer SSDs is discouraged.

  • Redundant (Multi-)Gbit NICs, with additional NICs depending on the preferred storage technology and cluster setup.

  • For PCI(e) passthrough the CPU needs to support the VT-d/AMD-d flag.

2.1.3. Simple Performance Overview

To get an overview of the CPU and hard disk performance on an installed Proxmox VE system, run the included pveperf tool.

Note This is just a very quick and general benchmark. More detailed tests are recommended, especially regarding the I/O performance of your system.

2.1.4. Supported Web Browsers for Accessing the Web Interface

To access the web-based user interface, we recommend using one of the following browsers:

  • Firefox, a release from the current year, or the latest Extended Support Release

  • Chrome, a release from the current year

  • Microsoft’s currently supported version of Edge

  • Safari, a release from the current year

When accessed from a mobile device, Proxmox VE will show a lightweight, touch-based interface.

2.2. Prepare Installation Media

The Proxmox VE installation media is a hybrid ISO image. It works in two ways:

  • An ISO image file ready to burn to a CD or DVD.

  • A raw sector (IMG) image file ready to copy to a USB flash drive (USB stick).

Using a USB flash drive to install Proxmox VE is the recommended way because it is the faster option.

2.2.1. Prepare a USB Flash Drive as Installation Medium

The flash drive needs to have at least 1 GB of storage available.

Note Do not use UNetbootin. It does not work with the Proxmox VE installation image.
Important Make sure that the USB flash drive is not mounted and does not contain any important data.

2.2.2. Instructions for GNU/Linux

On Unix-like operating system use the dd command to copy the ISO image to the USB flash drive. First find the correct device name of the USB flash drive (see below). Then run the dd command.

# dd bs=1M conv=fdatasync if=./proxmox-ve_*.iso of=/dev/XYZ
Note Be sure to replace /dev/XYZ with the correct device name and adapt the input filename (if) path.
Caution Be very careful, and do not overwrite the wrong disk!
Find the Correct USB Device Name

There are two ways to find out the name of the USB flash drive. The first one is to compare the last lines of the dmesg command output before and after plugging in the flash drive. The second way is to compare the output of the lsblk command. Open a terminal and run:

# lsblk

Then plug in your USB flash drive and run the command again:

# lsblk

A new device will appear. This is the one you want to use. To be on the extra safe side check if the reported size matches your USB flash drive.

2.2.3. Instructions for macOS

Open the terminal (query Terminal in Spotlight).

Convert the .iso file to .dmg format using the convert option of hdiutil, for example:

# hdiutil convert proxmox-ve_*.iso -format UDRW -o proxmox-ve_*.dmg
Tip macOS tends to automatically add .dmg to the output file name.

To get the current list of devices run the command:

# diskutil list

Now insert the USB flash drive and run this command again to determine which device node has been assigned to it. (e.g., /dev/diskX).

# diskutil list
# diskutil unmountDisk /dev/diskX
Note replace X with the disk number from the last command.
# sudo dd if=proxmox-ve_*.dmg bs=1M of=/dev/rdiskX
Note rdiskX, instead of diskX, in the last command is intended. It will increase the write speed.

2.2.4. Instructions for Windows

Using Etcher

Etcher works out of the box. Download Etcher from https://etcher.io. It will guide you through the process of selecting the ISO and your USB flash drive.

Using Rufus

Rufus is a more lightweight alternative, but you need to use the DD mode to make it work. Download Rufus from https://rufus.ie/. Either install it or use the portable version. Select the destination drive and the Proxmox VE ISO file.

Important Once you Start you have to click No on the dialog asking to download a different version of GRUB. In the next dialog select the DD mode.

2.3. Using the Proxmox VE Installer

The installer ISO image includes the following:

  • Complete operating system (Debian Linux, 64-bit)

  • The Proxmox VE installer, which partitions the local disk(s) with ext4, XFS, BTRFS (technology preview), or ZFS and installs the operating system

  • Proxmox VE Linux kernel with KVM and LXC support

  • Complete toolset for administering virtual machines, containers, the host system, clusters and all necessary resources

  • Web-based management interface

Note All existing data on the selected drives will be removed during the installation process. The installer does not add boot menu entries for other operating systems.

Please insert the prepared installation media (for example, USB flash drive or CD-ROM) and boot from it.

Tip Make sure that booting from the installation medium (for example, USB) is enabled in your server’s firmware settings. Secure boot needs to be disabled when booting an installer prior to Proxmox VE version 8.1.
screenshot/pve-grub-menu.png

After choosing the correct entry (for example, Boot from USB) the Proxmox VE menu will be displayed, and one of the following options can be selected:

Install Proxmox VE (Graphical)

Starts the normal installation.

Tip It’s possible to use the installation wizard with a keyboard only. Buttons can be clicked by pressing the ALT key combined with the underlined character from the respective button. For example, ALT + N to press a Next button.
Install Proxmox VE (Terminal UI)

Starts the terminal-mode installation wizard. It provides the same overall installation experience as the graphical installer, but has generally better compatibility with very old and very new hardware.

Install Proxmox VE (Terminal UI, Serial Console)

Starts the terminal-mode installation wizard, additionally setting up the Linux kernel to use the (first) serial port of the machine for in- and output. This can be used if the machine is completely headless and only has a serial console available.

screenshot/pve-tui-installer.png

Both modes use the same code base for the actual installation process to benefit from more than a decade of bug fixes and ensure feature parity.

Tip The Terminal UI option can be used in case the graphical installer does not work correctly, due to e.g. driver issues. See also adding the nomodeset kernel parameter.
Advanced Options: Install Proxmox VE (Graphical, Debug Mode)

Starts the installation in debug mode. A console will be opened at several installation steps. This helps to debug the situation if something goes wrong. To exit a debug console, press CTRL-D. This option can be used to boot a live system with all basic tools available. You can use it, for example, to repair a degraded ZFS rpool or fix the bootloader for an existing Proxmox VE setup.

Advanced Options: Install Proxmox VE (Terminal UI, Debug Mode)

Same as the graphical debug mode, but preparing the system to run the terminal-based installer instead.

Advanced Options: Install Proxmox VE (Serial Console Debug Mode)

Same the terminal-based debug mode, but additionally sets up the Linux kernel to use the (first) serial port of the machine for in- and output.

Advanced Options: Rescue Boot

With this option you can boot an existing installation. It searches all attached hard disks. If it finds an existing installation, it boots directly into that disk using the Linux kernel from the ISO. This can be useful if there are problems with the bootloader (GRUB/systemd-boot) or the BIOS/UEFI is unable to read the boot block from the disk.

Advanced Options: Test Memory (memtest86+)

Runs memtest86+. This is useful to check if the memory is functional and free of errors. Secure Boot must be turned off in the UEFI firmware setup utility to run this option.

You normally select Install Proxmox VE (Graphical) to start the installation.

screenshot/pve-select-target-disk.png

The first step is to read our EULA (End User License Agreement). Following this, you can select the target hard disk(s) for the installation.

Caution By default, the whole server is used and all existing data is removed. Make sure there is no important data on the server before proceeding with the installation.

The Options button lets you select the target file system, which defaults to ext4. The installer uses LVM if you select ext4 or xfs as a file system, and offers additional options to restrict LVM space (see below).

Proxmox VE can also be installed on ZFS. As ZFS offers several software RAID levels, this is an option for systems that don’t have a hardware RAID controller. The target disks must be selected in the Options dialog. More ZFS specific settings can be changed under Advanced Options.

Warning ZFS on top of any hardware RAID is not supported and can result in data loss.
screenshot/pve-select-location.png

The next page asks for basic configuration options like your location, time zone, and keyboard layout. The location is used to select a nearby download server, in order to increase the speed of updates. The installer is usually able to auto-detect these settings, so you only need to change them in rare situations when auto-detection fails, or when you want to use a keyboard layout not commonly used in your country.

screenshot/pve-set-password.png

Next the password of the superuser (root) and an email address needs to be specified. The password must consist of at least 5 characters. It’s highly recommended to use a stronger password. Some guidelines are:

  • Use a minimum password length of at least 12 characters.

  • Include lowercase and uppercase alphabetic characters, numbers, and symbols.

  • Avoid character repetition, keyboard patterns, common dictionary words, letter or number sequences, usernames, relative or pet names, romantic links (current or past), and biographical information (for example ID numbers, ancestors' names or dates).

The email address is used to send notifications to the system administrator. For example:

  • Information about available package updates.

  • Error messages from periodic cron jobs.

screenshot/pve-setup-network.png

All those notification mails will be sent to the specified email address.

The last step is the network configuration. Network interfaces that are UP show a filled circle in front of their name in the drop down menu. Please note that during installation you can either specify an IPv4 or IPv6 address, but not both. To configure a dual stack node, add additional IP addresses after the installation.

screenshot/pve-installation.png

The next step shows a summary of the previously selected options. Please re-check every setting and use the Previous button if a setting needs to be changed.

After clicking Install, the installer will begin to format the disks and copy packages to the target disk(s). Please wait until this step has finished; then remove the installation medium and restart your system.

screenshot/pve-install-summary.png

Copying the packages usually takes several minutes, mostly depending on the speed of the installation medium and the target disk performance.

When copying and setting up the packages has finished, you can reboot the server. This will be done automatically after a few seconds by default.

Installation Failure

If the installation failed, check out specific errors on the second TTY (CTRL + ALT + F2) and ensure that the systems meets the minimum requirements.

If the installation is still not working, look at the how to get help chapter.

2.3.1. Accessing the Management Interface Post-Installation

screenshot/gui-login-window.png

After a succesful installation and reboot of the system you can use the Proxmox VE web interface for further configuration.

  1. Point your browser to the IP address given during the installation and port 8006, for example: https://youripaddress:8006

  2. Log in using the root (realm PAM) username and the password chosen during installation.

  3. Upload your subscription key to gain access to the Enterprise repository. Otherwise, you will need to set up one of the public, less tested package repositories to get updates for security fixes, bug fixes, and new features.

  4. Check the IP configuration and hostname.

  5. Check the timezone.

  6. Check your Firewall settings.

2.3.2. Advanced LVM Configuration Options

The installer creates a Volume Group (VG) called pve, and additional Logical Volumes (LVs) called root, data, and swap, if ext4 or xfs is used. To control the size of these volumes use:

hdsize

Defines the total hard disk size to be used. This way you can reserve free space on the hard disk for further partitioning (for example for an additional PV and VG on the same hard disk that can be used for LVM storage).

swapsize

Defines the size of the swap volume. The default is the size of the installed memory, minimum 4 GB and maximum 8 GB. The resulting value cannot be greater than hdsize/8.

Note If set to 0, no swap volume will be created.
maxroot

Defines the maximum size of the root volume, which stores the operation system. The maximum limit of the root volume size is hdsize/4.

maxvz

Defines the maximum size of the data volume. The actual size of the data volume is:

datasize = hdsize - rootsize - swapsize - minfree

Where datasize cannot be bigger than maxvz.

Note In case of LVM thin, the data pool will only be created if datasize is bigger than 4GB.
Note If set to 0, no data volume will be created and the storage configuration will be adapted accordingly.
minfree

Defines the amount of free space that should be left in the LVM volume group pve. With more than 128GB storage available, the default is 16GB, otherwise hdsize/8 will be used.

Note LVM requires free space in the VG for snapshot creation (not required for lvmthin snapshots).

2.3.3. Advanced ZFS Configuration Options

The installer creates the ZFS pool rpool, if ZFS is used. No swap space is created but you can reserve some unpartitioned space on the install disks for swap. You can also create a swap zvol after the installation, although this can lead to problems (see ZFS swap notes).

ashift

Defines the ashift value for the created pool. The ashift needs to be set at least to the sector-size of the underlying disks (2 to the power of ashift is the sector-size), or any disk which might be put in the pool (for example the replacement of a defective disk).

compress

Defines whether compression is enabled for rpool.

checksum

Defines which checksumming algorithm should be used for rpool.

copies

Defines the copies parameter for rpool. Check the zfs(8) manpage for the semantics, and why this does not replace redundancy on disk-level.

ARC max size

Defines the maximum size the ARC can grow to and thus limits the amount of memory ZFS will use. See also the section on how to limit ZFS memory usage for more details.

hdsize

Defines the total hard disk size to be used. This is useful to save free space on the hard disk(s) for further partitioning (for example to create a swap-partition). hdsize is only honored for bootable disks, that is only the first disk or mirror for RAID0, RAID1 or RAID10, and all disks in RAID-Z[123].

2.3.4. ZFS Performance Tips

ZFS works best with a lot of memory. If you intend to use ZFS make sure to have enough RAM available for it. A good calculation is 4GB plus 1GB RAM for each TB RAW disk space.

ZFS can use a dedicated drive as write cache, called the ZFS Intent Log (ZIL). Use a fast drive (SSD) for it. It can be added after installation with the following command:

# zpool add <pool-name> log </dev/path_to_fast_ssd>

2.3.5. Adding the nomodeset Kernel Parameter

Problems may arise on very old or very new hardware due to graphics drivers. If the installation hangs during boot, you can try adding the nomodeset parameter. This prevents the Linux kernel from loading any graphics drivers and forces it to continue using the BIOS/UEFI-provided framebuffer.

On the Proxmox VE bootloader menu, navigate to Install Proxmox VE (Terminal UI) and press e to edit the entry. Using the arrow keys, navigate to the line starting with linux, move the cursor to the end of that line and add the parameter nomodeset, separated by a space from the pre-existing last parameter.

Then press Ctrl-X or F10 to boot the configuration.

2.4. Unattended Installation

It is possible to install Proxmox VE automatically in an unattended manner. This enables you to fully automate the setup process on bare-metal. Once the installation is complete and the host has booted up, automation tools like Ansible can be used to further configure the installation.

The necessary options for the installer must be provided in an answer file. This file allows the use of filter rules to determine which disks and network cards should be used.

To use the automated installation, it is first necessary to prepare an installation ISO. Visit our wiki for more details and information on the unattended installation.

2.5. Install Proxmox VE on Debian

Proxmox VE ships as a set of Debian packages and can be installed on top of a standard Debian installation. After configuring the repositories you need to run the following commands:

# apt-get update
# apt-get install proxmox-ve

Installing on top of an existing Debian installation looks easy, but it presumes that the base system has been installed correctly and that you know how you want to configure and use the local storage. You also need to configure the network manually.

In general, this is not trivial, especially when LVM or ZFS is used.

A detailed step by step how-to can be found on the wiki.

3. Host System Administration

The following sections will focus on common virtualization tasks and explain the Proxmox VE specifics regarding the administration and management of the host machine.

Proxmox VE is based on Debian GNU/Linux with additional repositories to provide the Proxmox VE related packages. This means that the full range of Debian packages is available including security updates and bug fixes. Proxmox VE provides its own Linux kernel based on the Ubuntu kernel. It has all the necessary virtualization and container features enabled and includes ZFS and several extra hardware drivers.

For other topics not included in the following sections, please refer to the Debian documentation. The Debian Administrator's Handbook is available online, and provides a comprehensive introduction to the Debian operating system (see [Hertzog13]).

3.1. Package Repositories

Proxmox VE uses APT as its package management tool like any other Debian-based system.

Proxmox VE automatically checks for package updates on a daily basis. The root@pam user is notified via email about available updates. From the GUI, the Changelog button can be used to see more details about an selected update.

3.1.1. Repositories in Proxmox VE

Repositories are a collection of software packages, they can be used to install new software, but are also important to get new updates.

Note You need valid Debian and Proxmox repositories to get the latest security updates, bug fixes and new features.

APT Repositories are defined in the file /etc/apt/sources.list and in .list files placed in /etc/apt/sources.list.d/.

Repository Management
screenshot/gui-node-repositories.png

Since Proxmox VE 7, you can check the repository state in the web interface. The node summary panel shows a high level status overview, while the separate Repository panel shows in-depth status and list of all configured repositories.

Basic repository management, for example, activating or deactivating a repository, is also supported.

Sources.list

In a sources.list file, each line defines a package repository. The preferred source must come first. Empty lines are ignored. A # character anywhere on a line marks the remainder of that line as a comment. The available packages from a repository are acquired by running apt-get update. Updates can be installed directly using apt-get, or via the GUI (Node → Updates).

File /etc/apt/sources.list
deb http://deb.debian.org/debian bookworm main contrib
deb http://deb.debian.org/debian bookworm-updates main contrib

# security updates
deb http://security.debian.org/debian-security bookworm-security main contrib

Proxmox VE provides three different package repositories.

3.1.2. Proxmox VE Enterprise Repository

This is the recommended repository and available for all Proxmox VE subscription users. It contains the most stable packages and is suitable for production use. The pve-enterprise repository is enabled by default:

File /etc/apt/sources.list.d/pve-enterprise.list
deb https://enterprise.proxmox.com/debian/pve bookworm pve-enterprise

Please note that you need a valid subscription key to access the pve-enterprise repository. We offer different support levels, which you can find further details about at https://proxmox.com/en/proxmox-virtual-environment/pricing.

Note You can disable this repository by commenting out the above line using a # (at the start of the line). This prevents error messages if your host does not have a subscription key. Please configure the pve-no-subscription repository in that case.

3.1.3. Proxmox VE No-Subscription Repository

As the name suggests, you do not need a subscription key to access this repository. It can be used for testing and non-production use. It’s not recommended to use this on production servers, as these packages are not always as heavily tested and validated.

We recommend to configure this repository in /etc/apt/sources.list.

File /etc/apt/sources.list
deb http://ftp.debian.org/debian bookworm main contrib
deb http://ftp.debian.org/debian bookworm-updates main contrib

# Proxmox VE pve-no-subscription repository provided by proxmox.com,
# NOT recommended for production use
deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription

# security updates
deb http://security.debian.org/debian-security bookworm-security main contrib

3.1.4. Proxmox VE Test Repository

This repository contains the latest packages and is primarily used by developers to test new features. To configure it, add the following line to /etc/apt/sources.list:

sources.list entry for pvetest
deb http://download.proxmox.com/debian/pve bookworm pvetest
Warning The pvetest repository should (as the name implies) only be used for testing new features or bug fixes.

3.1.5. Ceph Reef Enterprise Repository

This repository holds the enterprise Proxmox VE Ceph 18.2 Reef packages. They are suitable for production. Use this repository if you run the Ceph client or a full Ceph cluster on Proxmox VE.

File /etc/apt/sources.list.d/ceph.list
deb https://enterprise.proxmox.com/debian/ceph-reef bookworm enterprise

3.1.6. Ceph Reef No-Subscription Repository

This Ceph repository contains the Ceph 18.2 Reef packages before they are moved to the enterprise repository and after they where on the test repository.

Note It’s recommended to use the enterprise repository for production machines.
File /etc/apt/sources.list.d/ceph.list
deb http://download.proxmox.com/debian/ceph-reef bookworm no-subscription

3.1.7. Ceph Reef Test Repository

This Ceph repository contains the Ceph 18.2 Reef packages before they are moved to the main repository. It is used to test new Ceph releases on Proxmox VE.

File /etc/apt/sources.list.d/ceph.list
deb http://download.proxmox.com/debian/ceph-reef bookworm test

3.1.8. Ceph Quincy Enterprise Repository

This repository holds the enterprise Proxmox VE Ceph Quincy packages. They are suitable for production. Use this repository if you run the Ceph client or a full Ceph cluster on Proxmox VE.

File /etc/apt/sources.list.d/ceph.list
deb https://enterprise.proxmox.com/debian/ceph-quincy bookworm enterprise

3.1.9. Ceph Quincy No-Subscription Repository

This Ceph repository contains the Ceph Quincy packages before they are moved to the enterprise repository and after they where on the test repository.

Note It’s recommended to use the enterprise repository for production machines.
File /etc/apt/sources.list.d/ceph.list
deb http://download.proxmox.com/debian/ceph-quincy bookworm no-subscription

3.1.10. Ceph Quincy Test Repository

This Ceph repository contains the Ceph Quincy packages before they are moved to the main repository. It is used to test new Ceph releases on Proxmox VE.

File /etc/apt/sources.list.d/ceph.list
deb http://download.proxmox.com/debian/ceph-quincy bookworm test

3.1.11. Older Ceph Repositories

Proxmox VE 8 doesn’t support Ceph Pacific, Ceph Octopus, or even older releases for hyper-converged setups. For those releases, you need to first upgrade Ceph to a newer release before upgrading to Proxmox VE 8.

See the respective upgrade guide for details.

3.1.12. Debian Firmware Repository

Starting with Debian Bookworm (Proxmox VE 8) non-free firmware (as defined by DFSG) has been moved to the newly created Debian repository component non-free-firmware.

Enable this repository if you want to set up Early OS Microcode Updates or need additional Runtime Firmware Files not already included in the pre-installed package pve-firmware.

To be able to install packages from this component, run editor /etc/apt/sources.list, append non-free-firmware to the end of each .debian.org repository line and run apt update.

3.1.13. SecureApt

The Release files in the repositories are signed with GnuPG. APT is using these signatures to verify that all packages are from a trusted source.

If you install Proxmox VE from an official ISO image, the key for verification is already installed.

If you install Proxmox VE on top of Debian, download and install the key with the following commands:

 # wget https://enterprise.proxmox.com/debian/proxmox-release-bookworm.gpg -O /etc/apt/trusted.gpg.d/proxmox-release-bookworm.gpg

Verify the checksum afterwards with the sha512sum CLI tool:

# sha512sum /etc/apt/trusted.gpg.d/proxmox-release-bookworm.gpg
7da6fe34168adc6e479327ba517796d4702fa2f8b4f0a9833f5ea6e6b48f6507a6da403a274fe201595edc86a84463d50383d07f64bdde2e3658108db7d6dc87 /etc/apt/trusted.gpg.d/proxmox-release-bookworm.gpg

or the md5sum CLI tool:

# md5sum /etc/apt/trusted.gpg.d/proxmox-release-bookworm.gpg
41558dc019ef90bd0f6067644a51cf5b /etc/apt/trusted.gpg.d/proxmox-release-bookworm.gpg

3.2. System Software Updates

Proxmox provides updates on a regular basis for all repositories. To install updates use the web-based GUI or the following CLI commands:

# apt-get update
# apt-get dist-upgrade
Note The APT package management system is very flexible and provides many features, see man apt-get, or [Hertzog13] for additional information.
Tip Regular updates are essential to get the latest patches and security related fixes. Major system upgrades are announced in the Proxmox VE Community Forum.

3.3. Firmware Updates

Firmware updates from this chapter should be applied when running Proxmox VE on a bare-metal server. Whether configuring firmware updates is appropriate within guests, e.g. when using device pass-through, depends strongly on your setup and is therefore out of scope.

In addition to regular software updates, firmware updates are also important for reliable and secure operation.

When obtaining and applying firmware updates, a combination of available options is recommended to get them as early as possible or at all.

The term firmware is usually divided linguistically into microcode (for CPUs) and firmware (for other devices).

3.3.1. Persistent Firmware

This section is suitable for all devices. Updated microcode, which is usually included in a BIOS/UEFI update, is stored on the motherboard, whereas other firmware is stored on the respective device. This persistent method is especially important for the CPU, as it enables the earliest possible regular loading of the updated microcode at boot time.

Caution With some updates, such as for BIOS/UEFI or storage controller, the device configuration could be reset. Please follow the vendor’s instructions carefully and back up the current configuration.

Please check with your vendor which update methods are available.

  • Convenient update methods for servers can include Dell’s Lifecycle Manager or Service Packs from HPE.

  • Sometimes there are Linux utilities available as well. Examples are mlxup for NVIDIA ConnectX or bnxtnvm/niccli for Broadcom network cards.

  • LVFS is also an option if there is a cooperation with the hardware vendor and supported hardware in use. The technical requirement for this is that the system was manufactured after 2014 and is booted via UEFI.

Proxmox VE ships its own version of the fwupd package to enable Secure Boot Support with the Proxmox signing key. This package consciously dropped the dependency recommendation for the udisks2 package, due to observed issues with its use on hypervisors. That means you must explicitly configure the correct mount point of the EFI partition in /etc/fwupd/daemon.conf, for example:

File /etc/fwupd/daemon.conf
# Override the location used for the EFI system partition (ESP) path.
EspLocation=/boot/efi
Tip If the update instructions require a host reboot, make sure that it can be done safely. See also Node Maintenance.

3.3.2. Runtime Firmware Files

This method stores firmware on the Proxmox VE operating system and will pass it to a device if its persisted firmware is less recent. It is supported by devices such as network and graphics cards, but not by those that rely on persisted firmware such as the motherboard and hard disks.

In Proxmox VE the package pve-firmware is already installed by default. Therefore, with the normal system updates (APT), included firmware of common hardware is automatically kept up to date.

An additional Debian Firmware Repository exists, but is not configured by default.

If you try to install an additional firmware package but it conflicts, APT will abort the installation. Perhaps the particular firmware can be obtained in another way.

3.3.3. CPU Microcode Updates

Microcode updates are intended to fix found security vulnerabilities and other serious CPU bugs. While the CPU performance can be affected, a patched microcode is usually still more performant than an unpatched microcode where the kernel itself has to do mitigations. Depending on the CPU type, it is possible that performance results of the flawed factory state can no longer be achieved without knowingly running the CPU in an unsafe state.

To get an overview of present CPU vulnerabilities and their mitigations, run lscpu. Current real-world known vulnerabilities can only show up if the Proxmox VE host is up to date, its version not end of life, and has at least been rebooted since the last kernel update.

Besides the recommended microcode update via persistent BIOS/UEFI updates, there is also an independent method via Early OS Microcode Updates. It is convenient to use and also quite helpful when the motherboard vendor no longer provides BIOS/UEFI updates. Regardless of the method in use, a reboot is always needed to apply a microcode update.

Set up Early OS Microcode Updates

To set up microcode updates that are applied early on boot by the Linux kernel, you need to:

  1. Enable the Debian Firmware Repository

  2. Get the latest available packages apt update (or use the web interface, under Node → Updates)

  3. Install the CPU-vendor specific microcode package:

    • For Intel CPUs: apt install intel-microcode

    • For AMD CPUs: apt install amd64-microcode

  4. Reboot the Proxmox VE host

Any future microcode update will also require a reboot to be loaded.

Microcode Version

To get the current running microcode revision for comparison or debugging purposes:

# grep microcode /proc/cpuinfo | uniq
microcode       : 0xf0

A microcode package has updates for many different CPUs. But updates specifically for your CPU might not come often. So, just looking at the date on the package won’t tell you when the company actually released an update for your specific CPU.

If you’ve installed a new microcode package and rebooted your Proxmox VE host, and this new microcode is newer than both, the version baked into the CPU and the one from the motherboard’s firmware, you’ll see a message in the system log saying "microcode updated early".

# dmesg | grep microcode
[    0.000000] microcode: microcode updated early to revision 0xf0, date = 2021-11-12
[    0.896580] microcode: Microcode Update Driver: v2.2.
Troubleshooting

For debugging purposes, the set up Early OS Microcode Update applied regularly at system boot can be temporarily disabled as follows:

  1. make sure that the host can be rebooted safely

  2. reboot the host to get to the GRUB menu (hold SHIFT if it is hidden)

  3. at the desired Proxmox VE boot entry press E

  4. go to the line which starts with linux and append separated by a space dis_ucode_ldr

  5. press CTRL-X to boot this time without an Early OS Microcode Update

If a problem related to a recent microcode update is suspected, a package downgrade should be considered instead of package removal (apt purge <intel-microcode|amd64-microcode>). Otherwise, a too old persisted microcode might be loaded, even though a more recent one would run without problems.

A downgrade is possible if an earlier microcode package version is available in the Debian repository, as shown in this example:

# apt list -a intel-microcode
Listing... Done
intel-microcode/stable-security,now 3.20230808.1~deb12u1 amd64 [installed]
intel-microcode/stable 3.20230512.1 amd64
# apt install intel-microcode=3.202305*
...
Selected version '3.20230512.1' (Debian:12.1/stable [amd64]) for 'intel-microcode'
...
dpkg: warning: downgrading intel-microcode from 3.20230808.1~deb12u1 to 3.20230512.1
...
intel-microcode: microcode will be updated at next boot
...

Make sure (again) that the host can be rebooted safely. To apply an older microcode potentially included in the microcode package for your CPU type, reboot now.

Tip

It makes sense to hold the downgraded package for a while and try more recent versions again at a later time. Even if the package version is the same in the future, system updates may have fixed the experienced problem in the meantime.

# apt-mark hold intel-microcode
intel-microcode set on hold.
# apt-mark unhold intel-microcode
# apt update
# apt upgrade

3.4. Network Configuration

Proxmox VE is using the Linux network stack. This provides a lot of flexibility on how to set up the network on the Proxmox VE nodes. The configuration can be done either via the GUI, or by manually editing the file /etc/network/interfaces, which contains the whole network configuration. The interfaces(5) manual page contains the complete format description. All Proxmox VE tools try hard to keep direct user modifications, but using the GUI is still preferable, because it protects you from errors.

A Linux bridge interface (commonly called vmbrX) is needed to connect guests to the underlying physical network. It can be thought of as a virtual switch which the guests and physical interfaces are connected to. This section provides some examples on how the network can be set up to accomodate different use cases like redundancy with a bond, vlans or routed and NAT setups.

The Software Defined Network is an option for more complex virtual networks in Proxmox VE clusters.

Warning It’s discouraged to use the traditional Debian tools ifup and ifdown if unsure, as they have some pitfalls like interupting all guest traffic on ifdown vmbrX but not reconnecting those guest again when doing ifup on the same bridge later.

3.4.1. Apply Network Changes

Proxmox VE does not write changes directly to /etc/network/interfaces. Instead, we write into a temporary file called /etc/network/interfaces.new, this way you can do many related changes at once. This also allows to ensure your changes are correct before applying, as a wrong network configuration may render a node inaccessible.

Live-Reload Network with ifupdown2

With the recommended ifupdown2 package (default for new installations since Proxmox VE 7.0), it is possible to apply network configuration changes without a reboot. If you change the network configuration via the GUI, you can click the Apply Configuration button. This will move changes from the staging interfaces.new file to /etc/network/interfaces and apply them live.

If you made manual changes directly to the /etc/network/interfaces file, you can apply them by running ifreload -a

Note If you installed Proxmox VE on top of Debian, or upgraded to Proxmox VE 7.0 from an older Proxmox VE installation, make sure ifupdown2 is installed: apt install ifupdown2
Reboot Node to Apply

Another way to apply a new network configuration is to reboot the node. In that case the systemd service pvenetcommit will activate the staging interfaces.new file before the networking service will apply that configuration.

3.4.2. Naming Conventions

We currently use the following naming conventions for device names:

  • Ethernet devices: en*, systemd network interface names. This naming scheme is used for new Proxmox VE installations since version 5.0.

  • Ethernet devices: eth[N], where 0 ≤ N (eth0, eth1, …) This naming scheme is used for Proxmox VE hosts which were installed before the 5.0 release. When upgrading to 5.0, the names are kept as-is.

  • Bridge names: Commonly vmbr[N], where 0 ≤ N ≤ 4094 (vmbr0 - vmbr4094), but you can use any alphanumeric string that starts with a character and is at most 10 characters long.

  • Bonds: bond[N], where 0 ≤ N (bond0, bond1, …)

  • VLANs: Simply add the VLAN number to the device name, separated by a period (eno1.50, bond1.30)

This makes it easier to debug networks problems, because the device name implies the device type.

Systemd Network Interface Names

Systemd defines a versioned naming scheme for network device names. The scheme uses the two-character prefix en for Ethernet network devices. The next characters depends on the device driver, device location and other attributes. Some possible patterns are:

  • o<index>[n<phys_port_name>|d<dev_port>] — devices on board

  • s<slot>[f<function>][n<phys_port_name>|d<dev_port>] — devices by hotplug id

  • [P<domain>]p<bus>s<slot>[f<function>][n<phys_port_name>|d<dev_port>] — devices by bus id

  • x<MAC> — devices by MAC address

Some examples for the most common patterns are:

  • eno1 — is the first on-board NIC

  • enp3s0f1 — is function 1 of the NIC on PCI bus 3, slot 0

For a full list of possible device name patterns, see the systemd.net-naming-scheme(7) manpage.

A new version of systemd may define a new version of the network device naming scheme, which it then uses by default. Consequently, updating to a newer systemd version, for example during a major Proxmox VE upgrade, can change the names of network devices and require adjusting the network configuration. To avoid name changes due to a new version of the naming scheme, you can manually pin a particular naming scheme version (see below).

However, even with a pinned naming scheme version, network device names can still change due to kernel or driver updates. In order to avoid name changes for a particular network device altogether, you can manually override its name using a link file (see below).

For more information on network interface names, see Predictable Network Interface Names.

Pinning a specific naming scheme version

You can pin a specific version of the naming scheme for network devices by adding the net.naming-scheme=<version> parameter to the kernel command line. For a list of naming scheme versions, see the systemd.net-naming-scheme(7) manpage.

For example, to pin the version v252, which is the latest naming scheme version for a fresh Proxmox VE 8.0 installation, add the following kernel command-line parameter:

net.naming-scheme=v252

See also this section on editing the kernel command line. You need to reboot for the changes to take effect.

Overriding network device names

You can manually assign a name to a particular network device using a custom systemd.link file. This overrides the name that would be assigned according to the latest network device naming scheme. This way, you can avoid naming changes due to kernel updates, driver updates or newer versions of the naming scheme.

Custom link files should be placed in /etc/systemd/network/ and named <n>-<id>.link, where n is a priority smaller than 99 and id is some identifier. A link file has two sections: [Match] determines which interfaces the file will apply to; [Link] determines how these interfaces should be configured, including their naming.

To assign a name to a particular network device, you need a way to uniquely and permanently identify that device in the [Match] section. One possibility is to match the device’s MAC address using the MACAddress option, as it is unlikely to change. Then, you can assign a name using the Name option in the [Link] section.

For example, to assign the name enwan0 to the device with MAC address aa:bb:cc:dd:ee:ff, create a file /etc/systemd/network/10-enwan0.link with the following contents:

[Match]
MACAddress=aa:bb:cc:dd:ee:ff

[Link]
Name=enwan0

Do not forget to adjust /etc/network/interfaces to use the new name. You need to reboot the node for the change to take effect.

Note It is recommended to assign a name starting with en or eth so that Proxmox VE recognizes the interface as a physical network device which can then be configured via the GUI. Also, you should ensure that the name will not clash with other interface names in the future. One possibility is to assign a name that does not match any name pattern that systemd uses for network interfaces (see above), such as enwan0 in the example above.

For more information on link files, see the systemd.link(5) manpage.

3.4.3. Choosing a network configuration

Depending on your current network organization and your resources you can choose either a bridged, routed, or masquerading networking setup.

Proxmox VE server in a private LAN, using an external gateway to reach the internet

The Bridged model makes the most sense in this case, and this is also the default mode on new Proxmox VE installations. Each of your Guest system will have a virtual interface attached to the Proxmox VE bridge. This is similar in effect to having the Guest network card directly connected to a new switch on your LAN, the Proxmox VE host playing the role of the switch.

Proxmox VE server at hosting provider, with public IP ranges for Guests

For this setup, you can use either a Bridged or Routed model, depending on what your provider allows.

Proxmox VE server at hosting provider, with a single public IP address

In that case the only way to get outgoing network accesses for your guest systems is to use Masquerading. For incoming network access to your guests, you will need to configure Port Forwarding.

For further flexibility, you can configure VLANs (IEEE 802.1q) and network bonding, also known as "link aggregation". That way it is possible to build complex and flexible virtual networks.

3.4.4. Default Configuration using a Bridge

default-network-setup-bridge.svg

Bridges are like physical network switches implemented in software. All virtual guests can share a single bridge, or you can create multiple bridges to separate network domains. Each host can have up to 4094 bridges.

The installation program creates a single bridge named vmbr0, which is connected to the first Ethernet card. The corresponding configuration in /etc/network/interfaces might look like this:

auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.10.2/24
        gateway 192.168.10.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0

Virtual machines behave as if they were directly connected to the physical network. The network, in turn, sees each virtual machine as having its own MAC, even though there is only one network cable connecting all of these VMs to the network.

3.4.5. Routed Configuration

Most hosting providers do not support the above setup. For security reasons, they disable networking as soon as they detect multiple MAC addresses on a single interface.

Tip Some providers allow you to register additional MACs through their management interface. This avoids the problem, but can be clumsy to configure because you need to register a MAC for each of your VMs.

You can avoid the problem by “routing” all traffic via a single interface. This makes sure that all network packets use the same MAC address.

default-network-setup-routed.svg

A common scenario is that you have a public IP (assume 198.51.100.5 for this example), and an additional IP block for your VMs (203.0.113.16/28). We recommend the following setup for such situations:

auto lo
iface lo inet loopback

auto eno0
iface eno0 inet static
        address  198.51.100.5/29
        gateway  198.51.100.1
        post-up echo 1 > /proc/sys/net/ipv4/ip_forward
        post-up echo 1 > /proc/sys/net/ipv4/conf/eno0/proxy_arp


auto vmbr0
iface vmbr0 inet static
        address  203.0.113.17/28
        bridge-ports none
        bridge-stp off
        bridge-fd 0

3.4.6. Masquerading (NAT) with iptables

Masquerading allows guests having only a private IP address to access the network by using the host IP address for outgoing traffic. Each outgoing packet is rewritten by iptables to appear as originating from the host, and responses are rewritten accordingly to be routed to the original sender.

auto lo
iface lo inet loopback

auto eno1
#real IP address
iface eno1 inet static
        address  198.51.100.5/24
        gateway  198.51.100.1

auto vmbr0
#private sub network
iface vmbr0 inet static
        address  10.10.10.1/24
        bridge-ports none
        bridge-stp off
        bridge-fd 0

        post-up   echo 1 > /proc/sys/net/ipv4/ip_forward
        post-up   iptables -t nat -A POSTROUTING -s '10.10.10.0/24' -o eno1 -j MASQUERADE
        post-down iptables -t nat -D POSTROUTING -s '10.10.10.0/24' -o eno1 -j MASQUERADE
Note In some masquerade setups with firewall enabled, conntrack zones might be needed for outgoing connections. Otherwise the firewall could block outgoing connections since they will prefer the POSTROUTING of the VM bridge (and not MASQUERADE).

Adding these lines in the /etc/network/interfaces can fix this problem:

post-up   iptables -t raw -I PREROUTING -i fwbr+ -j CT --zone 1
post-down iptables -t raw -D PREROUTING -i fwbr+ -j CT --zone 1

For more information about this, refer to the following links:

3.4.7. Linux Bond

Bonding (also called NIC teaming or Link Aggregation) is a technique for binding multiple NIC’s to a single network device. It is possible to achieve different goals, like make the network fault-tolerant, increase the performance or both together.

High-speed hardware like Fibre Channel and the associated switching hardware can be quite expensive. By doing link aggregation, two NICs can appear as one logical interface, resulting in double speed. This is a native Linux kernel feature that is supported by most switches. If your nodes have multiple Ethernet ports, you can distribute your points of failure by running network cables to different switches and the bonded connection will failover to one cable or the other in case of network trouble.

Aggregated links can improve live-migration delays and improve the speed of replication of data between Proxmox VE Cluster nodes.

There are 7 modes for bonding:

  • Round-robin (balance-rr): Transmit network packets in sequential order from the first available network interface (NIC) slave through the last. This mode provides load balancing and fault tolerance.

  • Active-backup (active-backup): Only one NIC slave in the bond is active. A different slave becomes active if, and only if, the active slave fails. The single logical bonded interface’s MAC address is externally visible on only one NIC (port) to avoid distortion in the network switch. This mode provides fault tolerance.

  • XOR (balance-xor): Transmit network packets based on [(source MAC address XOR’d with destination MAC address) modulo NIC slave count]. This selects the same NIC slave for each destination MAC address. This mode provides load balancing and fault tolerance.

  • Broadcast (broadcast): Transmit network packets on all slave network interfaces. This mode provides fault tolerance.

  • IEEE 802.3ad Dynamic link aggregation (802.3ad)(LACP): Creates aggregation groups that share the same speed and duplex settings. Utilizes all slave network interfaces in the active aggregator group according to the 802.3ad specification.

  • Adaptive transmit load balancing (balance-tlb): Linux bonding driver mode that does not require any special network-switch support. The outgoing network packet traffic is distributed according to the current load (computed relative to the speed) on each network interface slave. Incoming traffic is received by one currently designated slave network interface. If this receiving slave fails, another slave takes over the MAC address of the failed receiving slave.

  • Adaptive load balancing (balance-alb): Includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special network switch support. The receive load balancing is achieved by ARP negotiation. The bonding driver intercepts the ARP Replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the NIC slaves in the single logical bonded interface such that different network-peers use different MAC addresses for their network packet traffic.

If your switch support the LACP (IEEE 802.3ad) protocol then we recommend using the corresponding bonding mode (802.3ad). Otherwise you should generally use the active-backup mode.

For the cluster network (Corosync) we recommend configuring it with multiple networks. Corosync does not need a bond for network reduncancy as it can switch between networks by itself, if one becomes unusable.

The following bond configuration can be used as distributed/shared storage network. The benefit would be that you get more speed and the network will be fault-tolerant.

Example: Use bond with fixed IP address
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

auto bond0
iface bond0 inet static
      bond-slaves eno1 eno2
      address  192.168.1.2/24
      bond-miimon 100
      bond-mode 802.3ad
      bond-xmit-hash-policy layer2+3

auto vmbr0
iface vmbr0 inet static
        address  10.10.10.2/24
        gateway  10.10.10.1
        bridge-ports eno3
        bridge-stp off
        bridge-fd 0
default-network-setup-bond.svg

Another possibility it to use the bond directly as bridge port. This can be used to make the guest network fault-tolerant.

Example: Use a bond as bridge port
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

auto bond0
iface bond0 inet manual
      bond-slaves eno1 eno2
      bond-miimon 100
      bond-mode 802.3ad
      bond-xmit-hash-policy layer2+3

auto vmbr0
iface vmbr0 inet static
        address  10.10.10.2/24
        gateway  10.10.10.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

3.4.8. VLAN 802.1Q

A virtual LAN (VLAN) is a broadcast domain that is partitioned and isolated in the network at layer two. So it is possible to have multiple networks (4096) in a physical network, each independent of the other ones.

Each VLAN network is identified by a number often called tag. Network packages are then tagged to identify which virtual network they belong to.

VLAN for Guest Networks

Proxmox VE supports this setup out of the box. You can specify the VLAN tag when you create a VM. The VLAN tag is part of the guest network configuration. The networking layer supports different modes to implement VLANs, depending on the bridge configuration:

  • VLAN awareness on the Linux bridge: In this case, each guest’s virtual network card is assigned to a VLAN tag, which is transparently supported by the Linux bridge. Trunk mode is also possible, but that makes configuration in the guest necessary.

  • "traditional" VLAN on the Linux bridge: In contrast to the VLAN awareness method, this method is not transparent and creates a VLAN device with associated bridge for each VLAN. That is, creating a guest on VLAN 5 for example, would create two interfaces eno1.5 and vmbr0v5, which would remain until a reboot occurs.

  • Open vSwitch VLAN: This mode uses the OVS VLAN feature.

  • Guest configured VLAN: VLANs are assigned inside the guest. In this case, the setup is completely done inside the guest and can not be influenced from the outside. The benefit is that you can use more than one VLAN on a single virtual NIC.

VLAN on the Host

To allow host communication with an isolated network. It is possible to apply VLAN tags to any network device (NIC, Bond, Bridge). In general, you should configure the VLAN on the interface with the least abstraction layers between itself and the physical NIC.

For example, in a default configuration where you want to place the host management address on a separate VLAN.

Example: Use VLAN 5 for the Proxmox VE management IP with traditional Linux bridge
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno1.5 inet manual

auto vmbr0v5
iface vmbr0v5 inet static
        address  10.10.10.2/24
        gateway  10.10.10.1
        bridge-ports eno1.5
        bridge-stp off
        bridge-fd 0

auto vmbr0
iface vmbr0 inet manual
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
Example: Use VLAN 5 for the Proxmox VE management IP with VLAN aware Linux bridge
auto lo
iface lo inet loopback

iface eno1 inet manual


auto vmbr0.5
iface vmbr0.5 inet static
        address  10.10.10.2/24
        gateway  10.10.10.1

auto vmbr0
iface vmbr0 inet manual
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

The next example is the same setup but a bond is used to make this network fail-safe.

Example: Use VLAN 5 with bond0 for the Proxmox VE management IP with traditional Linux bridge
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

auto bond0
iface bond0 inet manual
      bond-slaves eno1 eno2
      bond-miimon 100
      bond-mode 802.3ad
      bond-xmit-hash-policy layer2+3

iface bond0.5 inet manual

auto vmbr0v5
iface vmbr0v5 inet static
        address  10.10.10.2/24
        gateway  10.10.10.1
        bridge-ports bond0.5
        bridge-stp off
        bridge-fd 0

auto vmbr0
iface vmbr0 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

3.4.9. Disabling IPv6 on the Node

Proxmox VE works correctly in all environments, irrespective of whether IPv6 is deployed or not. We recommend leaving all settings at the provided defaults.

Should you still need to disable support for IPv6 on your node, do so by creating an appropriate sysctl.conf (5) snippet file and setting the proper sysctls, for example adding /etc/sysctl.d/disable-ipv6.conf with content:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

This method is preferred to disabling the loading of the IPv6 module on the kernel commandline.

3.4.10. Disabling MAC Learning on a Bridge

By default, MAC learning is enabled on a bridge to ensure a smooth experience with virtual guests and their networks.

But in some environments this can be undesired. Since Proxmox VE 7.3 you can disable MAC learning on the bridge by setting the ‘bridge-disable-mac-learning 1` configuration on a bridge in `/etc/network/interfaces’, for example:

# ...

auto vmbr0
iface vmbr0 inet static
        address  10.10.10.2/24
        gateway  10.10.10.1
        bridge-ports ens18
        bridge-stp off
        bridge-fd 0
        bridge-disable-mac-learning 1

Once enabled, Proxmox VE will manually add the configured MAC address from VMs and Containers to the bridges forwarding database to ensure that guest can still use the network - but only when they are using their actual MAC address.

3.5. Time Synchronization

The Proxmox VE cluster stack itself relies heavily on the fact that all the nodes have precisely synchronized time. Some other components, like Ceph, also won’t work properly if the local time on all nodes is not in sync.

Time synchronization between nodes can be achieved using the “Network Time Protocol” (NTP). As of Proxmox VE 7, chrony is used as the default NTP daemon, while Proxmox VE 6 uses systemd-timesyncd. Both come preconfigured to use a set of public servers.

Important If you upgrade your system to Proxmox VE 7, it is recommended that you manually install either chrony, ntp or openntpd.

3.5.1. Using Custom NTP Servers

In some cases, it might be desired to use non-default NTP servers. For example, if your Proxmox VE nodes do not have access to the public internet due to restrictive firewall rules, you need to set up local NTP servers and tell the NTP daemon to use them.

For systems using chrony:

Specify which servers chrony should use in /etc/chrony/chrony.conf:

server ntp1.example.com iburst
server ntp2.example.com iburst
server ntp3.example.com iburst

Restart chrony:

# systemctl restart chronyd

Check the journal to confirm that the newly configured NTP servers are being used:

# journalctl --since -1h -u chrony
...
Aug 26 13:00:09 node1 systemd[1]: Started chrony, an NTP client/server.
Aug 26 13:00:15 node1 chronyd[4873]: Selected source 10.0.0.1 (ntp1.example.com)
Aug 26 13:00:15 node1 chronyd[4873]: System clock TAI offset set to 37 seconds
...
For systems using systemd-timesyncd:

Specify which servers systemd-timesyncd should use in /etc/systemd/timesyncd.conf:

[Time]
NTP=ntp1.example.com ntp2.example.com ntp3.example.com ntp4.example.com

Then, restart the synchronization service (systemctl restart systemd-timesyncd), and verify that your newly configured NTP servers are in use by checking the journal (journalctl --since -1h -u systemd-timesyncd):

...
Oct 07 14:58:36 node1 systemd[1]: Stopping Network Time Synchronization...
Oct 07 14:58:36 node1 systemd[1]: Starting Network Time Synchronization...
Oct 07 14:58:36 node1 systemd[1]: Started Network Time Synchronization.
Oct 07 14:58:36 node1 systemd-timesyncd[13514]: Using NTP server 10.0.0.1:123 (ntp1.example.com).
Oct 07 14:58:36 node1 systemd-timesyncd[13514]: interval/delta/delay/jitter/drift 64s/-0.002s/0.020s/0.000s/-31ppm
...

3.6. External Metric Server

screenshot/gui-datacenter-metric-server-list.png

In Proxmox VE, you can define external metric servers, which will periodically receive various stats about your hosts, virtual guests and storages.

Currently supported are:

The external metric server definitions are saved in /etc/pve/status.cfg, and can be edited through the web interface.

3.6.1. Graphite server configuration

screenshot/gui-datacenter-metric-server-graphite.png

The default port is set to 2003 and the default graphite path is proxmox.

By default, Proxmox VE sends the data over UDP, so the graphite server has to be configured to accept this. Here the maximum transmission unit (MTU) can be configured for environments not using the standard 1500 MTU.

You can also configure the plugin to use TCP. In order not to block the important pvestatd statistic collection daemon, a timeout is required to cope with network problems.

3.6.2. Influxdb plugin configuration

screenshot/gui-datacenter-metric-server-influxdb.png

Proxmox VE sends the data over UDP, so the influxdb server has to be configured for this. The MTU can also be configured here, if necessary.

Here is an example configuration for influxdb (on your influxdb server):

[[udp]]
   enabled = true
   bind-address = "0.0.0.0:8089"
   database = "proxmox"
   batch-size = 1000
   batch-timeout = "1s"

With this configuration, your server listens on all IP addresses on port 8089, and writes the data in the proxmox database

Alternatively, the plugin can be configured to use the http(s) API of InfluxDB 2.x. InfluxDB 1.8.x does contain a forwards compatible API endpoint for this v2 API.

To use it, set influxdbproto to http or https (depending on your configuration). By default, Proxmox VE uses the organization proxmox and the bucket/db proxmox (They can be set with the configuration organization and bucket respectively).

Since InfluxDB’s v2 API is only available with authentication, you have to generate a token that can write into the correct bucket and set it.

In the v2 compatible API of 1.8.x, you can use user:password as token (if required), and can omit the organization since that has no meaning in InfluxDB 1.x.

You can also set the HTTP Timeout (default is 1s) with the timeout setting, as well as the maximum batch size (default 25000000 bytes) with the max-body-size setting (this corresponds to the InfluxDB setting with the same name).

3.7. Disk Health Monitoring

Although a robust and redundant storage is recommended, it can be very helpful to monitor the health of your local disks.

Starting with Proxmox VE 4.3, the package smartmontools
[smartmontools homepage https://www.smartmontools.org]
is installed and required. This is a set of tools to monitor and control the S.M.A.R.T. system for local hard disks.

You can get the status of a disk by issuing the following command:

# smartctl -a /dev/sdX

where /dev/sdX is the path to one of your local disks.

If the output says:

SMART support is: Disabled

you can enable it with the command:

# smartctl -s on /dev/sdX

For more information on how to use smartctl, please see man smartctl.

By default, smartmontools daemon smartd is active and enabled, and scans the disks under /dev/sdX and /dev/hdX every 30 minutes for errors and warnings, and sends an e-mail to root if it detects a problem.

For more information about how to configure smartd, please see man smartd and man smartd.conf.

If you use your hard disks with a hardware raid controller, there are most likely tools to monitor the disks in the raid array and the array itself. For more information about this, please refer to the vendor of your raid controller.

3.8. Logical Volume Manager (LVM)

Most people install Proxmox VE directly on a local disk. The Proxmox VE installation CD offers several options for local disk management, and the current default setup uses LVM. The installer lets you select a single disk for such setup, and uses that disk as physical volume for the Volume Group (VG) pve. The following output is from a test installation using a small 8GB disk:

# pvs
  PV         VG   Fmt  Attr PSize PFree
  /dev/sda3  pve  lvm2 a--  7.87g 876.00m

# vgs
  VG   #PV #LV #SN Attr   VSize VFree
  pve    1   3   0 wz--n- 7.87g 876.00m

The installer allocates three Logical Volumes (LV) inside this VG:

# lvs
  LV   VG   Attr       LSize   Pool Origin Data%  Meta%
  data pve  twi-a-tz--   4.38g             0.00   0.63
  root pve  -wi-ao----   1.75g
  swap pve  -wi-ao---- 896.00m
root

Formatted as ext4, and contains the operating system.

swap

Swap partition

data

This volume uses LVM-thin, and is used to store VM images. LVM-thin is preferable for this task, because it offers efficient support for snapshots and clones.

For Proxmox VE versions up to 4.1, the installer creates a standard logical volume called “data”, which is mounted at /var/lib/vz.

Starting from version 4.2, the logical volume “data” is a LVM-thin pool, used to store block based guest images, and /var/lib/vz is simply a directory on the root file system.

3.8.1. Hardware

We highly recommend to use a hardware RAID controller (with BBU) for such setups. This increases performance, provides redundancy, and make disk replacements easier (hot-pluggable).

LVM itself does not need any special hardware, and memory requirements are very low.

3.8.2. Bootloader

We install two boot loaders by default. The first partition contains the standard GRUB boot loader. The second partition is an EFI System Partition (ESP), which makes it possible to boot on EFI systems and to apply persistent firmware updates from the user space.

3.8.3. Creating a Volume Group

Let’s assume we have an empty disk /dev/sdb, onto which we want to create a volume group named “vmdata”.

Caution Please note that the following commands will destroy all existing data on /dev/sdb.

First create a partition.

# sgdisk -N 1 /dev/sdb

Create a Physical Volume (PV) without confirmation and 250K metadatasize.

# pvcreate --metadatasize 250k -y -ff /dev/sdb1

Create a volume group named “vmdata” on /dev/sdb1

# vgcreate vmdata /dev/sdb1

3.8.4. Creating an extra LV for /var/lib/vz

This can be easily done by creating a new thin LV.

# lvcreate -n <Name> -V <Size[M,G,T]> <VG>/<LVThin_pool>

A real world example:

# lvcreate -n vz -V 10G pve/data

Now a filesystem must be created on the LV.

# mkfs.ext4 /dev/pve/vz

At last this has to be mounted.

Warning be sure that /var/lib/vz is empty. On a default installation it’s not.

To make it always accessible add the following line in /etc/fstab.

# echo '/dev/pve/vz /var/lib/vz ext4 defaults 0 2' >> /etc/fstab

3.8.5. Resizing the thin pool

Resize the LV and the metadata pool with the following command:

# lvresize --size +<size[\M,G,T]> --poolmetadatasize +<size[\M,G]> <VG>/<LVThin_pool>
Note When extending the data pool, the metadata pool must also be extended.

3.8.6. Create a LVM-thin pool

A thin pool has to be created on top of a volume group. How to create a volume group see Section LVM.

# lvcreate -L 80G -T -n vmstore vmdata

3.9. ZFS on Linux

ZFS is a combined file system and logical volume manager designed by Sun Microsystems. Starting with Proxmox VE 3.4, the native Linux kernel port of the ZFS file system is introduced as optional file system and also as an additional selection for the root file system. There is no need for manually compile ZFS modules - all packages are included.

By using ZFS, its possible to achieve maximum enterprise features with low budget hardware, but also high performance systems by leveraging SSD caching or even SSD only setups. ZFS can replace cost intense hardware raid cards by moderate CPU and memory load combined with easy management.

General ZFS advantages
  • Easy configuration and management with Proxmox VE GUI and CLI.

  • Reliable

  • Protection against data corruption

  • Data compression on file system level

  • Snapshots

  • Copy-on-write clone

  • Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2, RAIDZ-3, dRAID, dRAID2, dRAID3

  • Can use SSD for cache

  • Self healing

  • Continuous integrity checking

  • Designed for high storage capacities

  • Asynchronous replication over network

  • Open Source

  • Encryption

3.9.1. Hardware

ZFS depends heavily on memory, so you need at least 8GB to start. In practice, use as much as you can get for your hardware/budget. To prevent data corruption, we recommend the use of high quality ECC RAM.

If you use a dedicated cache and/or log disk, you should use an enterprise class SSD. This can increase the overall performance significantly.

Important Do not use ZFS on top of a hardware RAID controller which has its own cache management. ZFS needs to communicate directly with the disks. An HBA adapter or something like an LSI controller flashed in “IT” mode is more appropriate.

If you are experimenting with an installation of Proxmox VE inside a VM (Nested Virtualization), don’t use virtio for disks of that VM, as they are not supported by ZFS. Use IDE or SCSI instead (also works with the virtio SCSI controller type).

3.9.2. Installation as Root File System

When you install using the Proxmox VE installer, you can choose ZFS for the root file system. You need to select the RAID type at installation time:

RAID0

Also called “striping”. The capacity of such volume is the sum of the capacities of all disks. But RAID0 does not add any redundancy, so the failure of a single drive makes the volume unusable.

RAID1

Also called “mirroring”. Data is written identically to all disks. This mode requires at least 2 disks with the same size. The resulting capacity is that of a single disk.

RAID10

A combination of RAID0 and RAID1. Requires at least 4 disks.

RAIDZ-1

A variation on RAID-5, single parity. Requires at least 3 disks.

RAIDZ-2

A variation on RAID-5, double parity. Requires at least 4 disks.

RAIDZ-3

A variation on RAID-5, triple parity. Requires at least 5 disks.

The installer automatically partitions the disks, creates a ZFS pool called rpool, and installs the root file system on the ZFS subvolume rpool/ROOT/pve-1.

Another subvolume called rpool/data is created to store VM images. In order to use that with the Proxmox VE tools, the installer creates the following configuration entry in /etc/pve/storage.cfg:

zfspool: local-zfs
        pool rpool/data
        sparse
        content images,rootdir

After installation, you can view your ZFS pool status using the zpool command:

# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda2    ONLINE       0     0     0
            sdb2    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

The zfs command is used to configure and manage your ZFS file systems. The following command lists all file systems after installation:

# zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool             4.94G  7.68T    96K  /rpool
rpool/ROOT         702M  7.68T    96K  /rpool/ROOT
rpool/ROOT/pve-1   702M  7.68T   702M  /
rpool/data          96K  7.68T    96K  /rpool/data
rpool/swap        4.25G  7.69T    64K  -

3.9.3. ZFS RAID Level Considerations

There are a few factors to take into consideration when choosing the layout of a ZFS pool. The basic building block of a ZFS pool is the virtual device, or vdev. All vdevs in a pool are used equally and the data is striped among them (RAID0). Check the zpoolconcepts(7) manpage for more details on vdevs.

Performance

Each vdev type has different performance behaviors. The two parameters of interest are the IOPS (Input/Output Operations per Second) and the bandwidth with which data can be written or read.

A mirror vdev (RAID1) will approximately behave like a single disk in regard to both parameters when writing data. When reading data the performance will scale linearly with the number of disks in the mirror.

A common situation is to have 4 disks. When setting it up as 2 mirror vdevs (RAID10) the pool will have the write characteristics as two single disks in regard to IOPS and bandwidth. For read operations it will resemble 4 single disks.

A RAIDZ of any redundancy level will approximately behave like a single disk in regard to IOPS with a lot of bandwidth. How much bandwidth depends on the size of the RAIDZ vdev and the redundancy level.

A dRAID pool should match the performance of an equivalent RAIDZ pool.

For running VMs, IOPS is the more important metric in most situations.

Size, Space usage and Redundancy

While a pool made of mirror vdevs will have the best performance characteristics, the usable space will be 50% of the disks available. Less if a mirror vdev consists of more than 2 disks, for example in a 3-way mirror. At least one healthy disk per mirror is needed for the pool to stay functional.

The usable space of a RAIDZ type vdev of N disks is roughly N-P, with P being the RAIDZ-level. The RAIDZ-level indicates how many arbitrary disks can fail without losing data. A special case is a 4 disk pool with RAIDZ2. In this situation it is usually better to use 2 mirror vdevs for the better performance as the usable space will be the same.

Another important factor when using any RAIDZ level is how ZVOL datasets, which are used for VM disks, behave. For each data block the pool needs parity data which is at least the size of the minimum block size defined by the ashift value of the pool. With an ashift of 12 the block size of the pool is 4k. The default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block written will cause two additional 4k parity blocks to be written, 8k + 4k + 4k = 16k. This is of course a simplified approach and the real situation will be slightly different with metadata, compression and such not being accounted for in this example.

This behavior can be observed when checking the following properties of the ZVOL:

  • volsize

  • refreservation (if the pool is not thin provisioned)

  • used (if the pool is thin provisioned and without snapshots present)

# zfs get volsize,refreservation,used <pool>/vm-<vmid>-disk-X

volsize is the size of the disk as it is presented to the VM, while refreservation shows the reserved space on the pool which includes the expected space needed for the parity data. If the pool is thin provisioned, the refreservation will be set to 0. Another way to observe the behavior is to compare the used disk space within the VM and the used property. Be aware that snapshots will skew the value.

There are a few options to counter the increased use of space:

  • Increase the volblocksize to improve the data to parity ratio

  • Use mirror vdevs instead of RAIDZ

  • Use ashift=9 (block size of 512 bytes)

The volblocksize property can only be set when creating a ZVOL. The default value can be changed in the storage configuration. When doing this, the guest needs to be tuned accordingly and depending on the use case, the problem of write amplification is just moved from the ZFS layer up to the guest.

Using ashift=9 when creating the pool can lead to bad performance, depending on the disks underneath, and cannot be changed later on.

Mirror vdevs (RAID1, RAID10) have favorable behavior for VM workloads. Use them, unless your environment has specific needs and characteristics where RAIDZ performance characteristics are acceptable.

3.9.4. ZFS dRAID

In a ZFS dRAID (declustered RAID) the hot spare drive(s) participate in the RAID. Their spare capacity is reserved and used for rebuilding when one drive fails. This provides, depending on the configuration, faster rebuilding compared to a RAIDZ in case of drive failure. More information can be found in the official OpenZFS documentation.
[OpenZFS dRAID https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html]

Note dRAID is intended for more than 10-15 disks in a dRAID. A RAIDZ setup should be better for a lower amount of disks in most use cases.
Note The GUI requires one more disk than the minimum (i.e. dRAID1 needs 3). It expects that a spare disk is added as well.
  • dRAID1 or dRAID: requires at least 2 disks, one can fail before data is lost

  • dRAID2: requires at least 3 disks, two can fail before data is lost

  • dRAID3: requires at least 4 disks, three can fail before data is lost

Additional information can be found on the manual page:

# man zpoolconcepts
Spares and Data

The number of spares tells the system how many disks it should keep ready in case of a disk failure. The default value is 0 spares. Without spares, rebuilding won’t get any speed benefits.

data defines the number of devices in a redundancy group. The default value is 8. Except when disks - parity - spares equal something less than 8, the lower number is used. In general, a smaller number of data devices leads to higher IOPS, better compression ratios and faster resilvering, but defining fewer data devices reduces the available storage capacity of the pool.

3.9.5. Bootloader

Proxmox VE uses proxmox-boot-tool to manage the bootloader configuration. See the chapter on Proxmox VE host bootloaders for details.

3.9.6. ZFS Administration

This section gives you some usage examples for common tasks. ZFS itself is really powerful and provides many options. The main commands to manage ZFS are zfs and zpool. Both commands come with great manual pages, which can be read with:

# man zpool
# man zfs
Create a new zpool

To create a new pool, at least one disk is needed. The ashift should have the same sector-size (2 power of ashift) or larger as the underlying disk.

# zpool create -f -o ashift=12 <pool> <device>
Tip

Pool names must adhere to the following rules:

  • begin with a letter (a-z or A-Z)

  • contain only alphanumeric, -, _, ., : or ` ` (space) characters

  • must not begin with one of mirror, raidz, draid or spare

  • must not be log

To activate compression (see section Compression in ZFS):

# zfs set compression=lz4 <pool>
Create a new pool with RAID-0

Minimum 1 disk

# zpool create -f -o ashift=12 <pool> <device1> <device2>
Create a new pool with RAID-1

Minimum 2 disks

# zpool create -f -o ashift=12 <pool> mirror <device1> <device2>
Create a new pool with RAID-10

Minimum 4 disks

# zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4>
Create a new pool with RAIDZ-1

Minimum 3 disks

# zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3>
Create a new pool with RAIDZ-2

Minimum 4 disks

# zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4>

Please read the section for ZFS RAID Level Considerations to get a rough estimate on how IOPS and bandwidth expectations before setting up a pool, especially when wanting to use a RAID-Z mode.

Create a new pool with cache (L2ARC)

It is possible to use a dedicated device, or partition, as second-level cache to increase the performance. Such a cache device will especially help with random-read workloads of data that is mostly static. As it acts as additional caching layer between the actual storage, and the in-memory ARC, it can also help if the ARC must be reduced due to memory constraints.

Create ZFS pool with a on-disk cache
# zpool create -f -o ashift=12 <pool> <device> cache <cache-device>

Here only a single <device> and a single <cache-device> was used, but it is possible to use more devices, like it’s shown in Create a new pool with RAID.

Note that for cache devices no mirror or raid modi exist, they are all simply accumulated.

If any cache device produces errors on read, ZFS will transparently divert that request to the underlying storage layer.

Create a new pool with log (ZIL)

It is possible to use a dedicated drive, or partition, for the ZFS Intent Log (ZIL), it is mainly used to provide safe synchronous transactions, so often in performance critical paths like databases, or other programs that issue fsync operations more frequently.

The pool is used as default ZIL location, diverting the ZIL IO load to a separate device can, help to reduce transaction latencies while relieving the main pool at the same time, increasing overall performance.

For disks to be used as log devices, directly or through a partition, it’s recommend to:

  • use fast SSDs with power-loss protection, as those have much smaller commit latencies.

  • Use at least a few GB for the partition (or whole device), but using more than half of your installed memory won’t provide you with any real advantage.

Create ZFS pool with separate log device
# zpool create -f -o ashift=12 <pool> <device> log <log-device>

In above example a single <device> and a single <log-device> is used, but you can also combine this with other RAID variants, as described in the Create a new pool with RAID section.

You can also mirror the log device to multiple devices, this is mainly useful to ensure that performance doesn’t immediately degrades if a single log device fails.

If all log devices fail the ZFS main pool itself will be used again, until the log device(s) get replaced.

Add cache and log to an existing pool

If you have a pool without cache and log you can still add both, or just one of them, at any time.

For example, let’s assume you got a good enterprise SSD with power-loss protection that you want to use for improving the overall performance of your pool.

As the maximum size of a log device should be about half the size of the installed physical memory, it means that the ZIL will mostly likely only take up a relatively small part of the SSD, the remaining space can be used as cache.

First you have to create two GPT partitions on the SSD with parted or gdisk.

Then you’re ready to add them to an pool:

Add both, a separate log device and a second-level cache, to an existing pool
# zpool add -f <pool> log <device-part1> cache <device-part2>

Just replay <pool>, <device-part1> and <device-part2> with the pool name and the two /dev/disk/by-id/ paths to the partitions.

You can also add ZIL and cache separately.

Add a log device to an existing ZFS pool
# zpool add <pool> log <log-device>
Changing a failed device
# zpool replace -f <pool> <old-device> <new-device>
Changing a failed bootable device

Depending on how Proxmox VE was installed it is either using systemd-boot or GRUB through proxmox-boot-tool
[Systems installed with Proxmox VE 6.4 or later, EFI systems installed with Proxmox VE 5.4 or later]
or plain GRUB as bootloader (see Host Bootloader). You can check by running:

# proxmox-boot-tool status

The first steps of copying the partition table, reissuing GUIDs and replacing the ZFS partition are the same. To make the system bootable from the new disk, different steps are needed which depend on the bootloader in use.

# sgdisk <healthy bootable device> -R <new device>
# sgdisk -G <new device>
# zpool replace -f <pool> <old zfs partition> <new zfs partition>
Note Use the zpool status -v command to monitor how far the resilvering process of the new disk has progressed.
With proxmox-boot-tool:
# proxmox-boot-tool format <new disk's ESP>
# proxmox-boot-tool init <new disk's ESP> [grub]
Note ESP stands for EFI System Partition, which is setup as partition #2 on bootable disks setup by the Proxmox VE installer since version 5.4. For details, see Setting up a new partition for use as synced ESP.
Note Make sure to pass grub as mode to proxmox-boot-tool init if proxmox-boot-tool status indicates your current disks are using GRUB, especially if Secure Boot is enabled!
With plain GRUB:
# grub-install <new disk>
Note Plain GRUB is only used on systems installed with Proxmox VE 6.3 or earlier, which have not been manually migrated to using proxmox-boot-tool yet.

3.9.7. Configure E-Mail Notification

ZFS comes with an event daemon ZED, which monitors events generated by the ZFS kernel module. The daemon can also send emails on ZFS events like pool errors. Newer ZFS packages ship the daemon in a separate zfs-zed package, which should already be installed by default in Proxmox VE.

You can configure the daemon via the file /etc/zfs/zed.d/zed.rc with your favorite editor. The required setting for email notification is ZED_EMAIL_ADDR, which is set to root by default.

ZED_EMAIL_ADDR="root"

Please note Proxmox VE forwards mails to root to the email address configured for the root user.

3.9.8. Limit ZFS Memory Usage

ZFS uses 50 % of the host memory for the Adaptive Replacement Cache (ARC) by default. For new installations starting with Proxmox VE 8.1, the ARC usage limit will be set to 10 % of the installed physical memory, clamped to a maximum of 16 GiB. This value is written to /etc/modprobe.d/zfs.conf.

Allocating enough memory for the ARC is crucial for IO performance, so reduce it with caution. As a general rule of thumb, allocate at least 2 GiB Base + 1 GiB/TiB-Storage. For example, if you have a pool with 8 TiB of available storage space then you should use 10 GiB of memory for the ARC.

ZFS also enforces a minimum value of 64 MiB.

You can change the ARC usage limit for the current boot (a reboot resets this change again) by writing to the zfs_arc_max module parameter directly:

 echo "$[10 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max

To permanently change the ARC limits, add (or change if already present) the following line to /etc/modprobe.d/zfs.conf:

options zfs zfs_arc_max=8589934592

This example setting limits the usage to 8 GiB (8 * 230).

Important In case your desired zfs_arc_max value is lower than or equal to zfs_arc_min (which defaults to 1/32 of the system memory), zfs_arc_max will be ignored unless you also set zfs_arc_min to at most zfs_arc_max - 1.
echo "$[8 * 1024*1024*1024 - 1]" >/sys/module/zfs/parameters/zfs_arc_min
echo "$[8 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max

This example setting (temporarily) limits the usage to 8 GiB (8 * 230) on systems with more than 256 GiB of total memory, where simply setting zfs_arc_max alone would not work.

Important

If your root file system is ZFS, you must update your initramfs every time this value changes:

# update-initramfs -u -k all

You must reboot to activate these changes.

3.9.9. SWAP on ZFS

Swap-space created on a zvol may generate some troubles, like blocking the server or generating a high IO load, often seen when starting a Backup to an external Storage.

We strongly recommend to use enough memory, so that you normally do not run into low memory situations. Should you need or want to add swap, it is preferred to create a partition on a physical disk and use it as a swap device. You can leave some space free for this purpose in the advanced options of the installer. Additionally, you can lower the “swappiness” value. A good value for servers is 10:

# sysctl -w vm.swappiness=10

To make the swappiness persistent, open /etc/sysctl.conf with an editor of your choice and add the following line:

vm.swappiness = 10
Table 1. Linux kernel swappiness parameter values
Value Strategy

vm.swappiness = 0

The kernel will swap only to avoid an out of memory condition

vm.swappiness = 1

Minimum amount of swapping without disabling it entirely.

vm.swappiness = 10

This value is sometimes recommended to improve performance when sufficient memory exists in a system.

vm.swappiness = 60

The default value.

vm.swappiness = 100

The kernel will swap aggressively.

3.9.10. Encrypted ZFS Datasets

Warning Native ZFS encryption in Proxmox VE is experimental. Known limitations and issues include Replication with encrypted datasets
[https://bugzilla.proxmox.com/show_bug.cgi?id=2350]
, as well as checksum errors when using Snapshots or ZVOLs.
[https://github.com/openzfs/zfs/issues/11688]

ZFS on Linux version 0.8.0 introduced support for native encryption of datasets. After an upgrade from previous ZFS on Linux versions, the encryption feature can be enabled per pool:

# zpool get feature@encryption tank
NAME  PROPERTY            VALUE            SOURCE
tank  feature@encryption  disabled         local

# zpool set feature@encryption=enabled

# zpool get feature@encryption tank
NAME  PROPERTY            VALUE            SOURCE
tank  feature@encryption  enabled         local
Warning There is currently no support for booting from pools with encrypted datasets using GRUB, and only limited support for automatically unlocking encrypted datasets on boot. Older versions of ZFS without encryption support will not be able to decrypt stored data.
Note It is recommended to either unlock storage datasets manually after booting, or to write a custom unit to pass the key material needed for unlocking on boot to zfs load-key.
Warning Establish and test a backup procedure before enabling encryption of production data. If the associated key material/passphrase/keyfile has been lost, accessing the encrypted data is no longer possible.

Encryption needs to be setup when creating datasets/zvols, and is inherited by default to child datasets. For example, to create an encrypted dataset tank/encrypted_data and configure it as storage in Proxmox VE, run the following commands:

# zfs create -o encryption=on -o keyformat=passphrase tank/encrypted_data
Enter passphrase:
Re-enter passphrase:

# pvesm add zfspool encrypted_zfs -pool tank/encrypted_data

All guest volumes/disks create on this storage will be encrypted with the shared key material of the parent dataset.

To actually use the storage, the associated key material needs to be loaded and the dataset needs to be mounted. This can be done in one step with:

# zfs mount -l tank/encrypted_data
Enter passphrase for 'tank/encrypted_data':

It is also possible to use a (random) keyfile instead of prompting for a passphrase by setting the keylocation and keyformat properties, either at creation time or with zfs change-key on existing datasets:

# dd if=/dev/urandom of=/path/to/keyfile bs=32 count=1

# zfs change-key -o keyformat=raw -o keylocation=file:///path/to/keyfile tank/encrypted_data
Warning When using a keyfile, special care needs to be taken to secure the keyfile against unauthorized access or accidental loss. Without the keyfile, it is not possible to access the plaintext data!

A guest volume created underneath an encrypted dataset will have its encryptionroot property set accordingly. The key material only needs to be loaded once per encryptionroot to be available to all encrypted datasets underneath it.

See the encryptionroot, encryption, keylocation, keyformat and keystatus properties, the zfs load-key, zfs unload-key and zfs change-key commands and the Encryption section from man zfs for more details and advanced usage.

3.9.11. Compression in ZFS

When compression is enabled on a dataset, ZFS tries to compress all new blocks before writing them and decompresses them on reading. Already existing data will not be compressed retroactively.

You can enable compression with:

# zfs set compression=<algorithm> <dataset>

We recommend using the lz4 algorithm, because it adds very little CPU overhead. Other algorithms like lzjb and gzip-N, where N is an integer from 1 (fastest) to 9 (best compression ratio), are also available. Depending on the algorithm and how compressible the data is, having compression enabled can even increase I/O performance.

You can disable compression at any time with:

# zfs set compression=off <dataset>

Again, only new blocks will be affected by this change.

3.9.12. ZFS Special Device

Since version 0.8.0 ZFS supports special devices. A special device in a pool is used to store metadata, deduplication tables, and optionally small file blocks.

A special device can improve the speed of a pool consisting of slow spinning hard disks with a lot of metadata changes. For example workloads that involve creating, updating or deleting a large number of files will benefit from the presence of a special device. ZFS datasets can also be configured to store whole small files on the special device which can further improve the performance. Use fast SSDs for the special device.

Important The redundancy of the special device should match the one of the pool, since the special device is a point of failure for the whole pool.
Warning Adding a special device to a pool cannot be undone!
Create a pool with special device and RAID-1:
# zpool create -f -o ashift=12 <pool> mirror <device1> <device2> special mirror <device3> <device4>
Add a special device to an existing pool with RAID-1:
# zpool add <pool> special mirror <device1> <device2>

ZFS datasets expose the special_small_blocks=<size> property. size can be 0 to disable storing small file blocks on the special device or a power of two in the range between 512B to 1M. After setting the property new file blocks smaller than size will be allocated on the special device.

Important If the value for special_small_blocks is greater than or equal to the recordsize (default 128K) of the dataset, all data will be written to the special device, so be careful!

Setting the special_small_blocks property on a pool will change the default value of that property for all child ZFS datasets (for example all containers in the pool will opt in for small file blocks).

Opt in for all file smaller than 4K-blocks pool-wide:
# zfs set special_small_blocks=4K <pool>
Opt in for small file blocks for a single dataset:
# zfs set special_small_blocks=4K <pool>/<filesystem>
Opt out from small file blocks for a single dataset:
# zfs set special_small_blocks=0 <pool>/<filesystem>

3.9.13. ZFS Pool Features

Changes to the on-disk format in ZFS are only made between major version changes and are specified through features. All features, as well as the general mechanism are well documented in the zpool-features(5) manpage.

Since enabling new features can render a pool not importable by an older version of ZFS, this needs to be done actively by the administrator, by running zpool upgrade on the pool (see the zpool-upgrade(8) manpage).

Unless you need to use one of the new features, there is no upside to enabling them.

In fact, there are some downsides to enabling new features:

  • A system with root on ZFS, that still boots using GRUB will become unbootable if a new feature is active on the rpool, due to the incompatible implementation of ZFS in GRUB.

  • The system will not be able to import any upgraded pool when booted with an older kernel, which still ships with the old ZFS modules.

  • Booting an older Proxmox VE ISO to repair a non-booting system will likewise not work.

Important Do not upgrade your rpool if your system is still booted with GRUB, as this will render your system unbootable. This includes systems installed before Proxmox VE 5.4, and systems booting with legacy BIOS boot (see how to determine the bootloader).
Enable new features for a ZFS pool:
# zpool upgrade <pool>

3.10. BTRFS

Warning BTRFS integration is currently a technology preview in Proxmox VE.

BTRFS is a modern copy on write file system natively supported by the Linux kernel, implementing features such as snapshots, built-in RAID and self healing via checksums for data and metadata. Starting with Proxmox VE 7.0, BTRFS is introduced as optional selection for the root file system.

General BTRFS advantages
  • Main system setup almost identical to the traditional ext4 based setup

  • Snapshots

  • Data compression on file system level

  • Copy-on-write clone

  • RAID0, RAID1 and RAID10

  • Protection against data corruption

  • Self healing

  • natively supported by the Linux kernel

Caveats
  • RAID levels 5/6 are experimental and dangerous

3.10.1. Installation as Root File System

When you install using the Proxmox VE installer, you can choose BTRFS for the root file system. You need to select the RAID type at installation time:

RAID0

Also called “striping”. The capacity of such volume is the sum of the capacities of all disks. But RAID0 does not add any redundancy, so the failure of a single drive makes the volume unusable.

RAID1

Also called “mirroring”. Data is written identically to all disks. This mode requires at least 2 disks with the same size. The resulting capacity is that of a single disk.

RAID10

A combination of RAID0 and RAID1. Requires at least 4 disks.

The installer automatically partitions the disks and creates an additional subvolume at /var/lib/pve/local-btrfs. In order to use that with the Proxmox VE tools, the installer creates the following configuration entry in /etc/pve/storage.cfg:

dir: local
        path /var/lib/vz
        content iso,vztmpl,backup
        disable

btrfs: local-btrfs
        path /var/lib/pve/local-btrfs
        content iso,vztmpl,backup,images,rootdir

This explicitly disables the default local storage in favor of a BTRFS specific storage entry on the additional subvolume.

The btrfs command is used to configure and manage the BTRFS file system, After the installation, the following command lists all additional subvolumes:

# btrfs subvolume list /
ID 256 gen 6 top level 5 path var/lib/pve/local-btrfs

3.10.2. BTRFS Administration

This section gives you some usage examples for common tasks.

Creating a BTRFS file system

To create BTRFS file systems, mkfs.btrfs is used. The -d and -m parameters are used to set the profile for metadata and data respectively. With the optional -L parameter, a label can be set.

Generally, the following modes are supported: single, raid0, raid1, raid10.

Create a BTRFS file system on a single disk /dev/sdb with the label My-Storage:

 # mkfs.btrfs -m single -d single -L My-Storage /dev/sdb

Or create a RAID1 on the two partitions /dev/sdb1 and /dev/sdc1:

 # mkfs.btrfs -m raid1 -d raid1 -L My-Storage /dev/sdb1 /dev/sdc1
Mounting a BTRFS file system

The new file-system can then be mounted either manually, for example:

 # mkdir /my-storage
 # mount /dev/sdb /my-storage

A BTRFS can also be added to /etc/fstab like any other mount point, automatically mounting it on boot. It’s recommended to avoid using block-device paths but use the UUID value the mkfs.btrfs command printed, especially there is more than one disk in a BTRFS setup.

For example:

File /etc/fstab
# ... other mount points left out for brevity

# using the UUID from the mkfs.btrfs output is highly recommended
UUID=e2c0c3ff-2114-4f54-b767-3a203e49f6f3 /my-storage btrfs defaults 0 0
Tip If you do not have the UUID available anymore you can use the blkid tool to list all properties of block-devices.

Afterwards you can trigger the first mount by executing:

mount /my-storage

After the next reboot this will be automatically done by the system at boot.

Adding a BTRFS file system to Proxmox VE

You can add an existing BTRFS file system to Proxmox VE via the web interface, or using the CLI, for example:

pvesm add btrfs my-storage --path /my-storage
Creating a subvolume

Creating a subvolume links it to a path in the BTRFS file system, where it will appear as a regular directory.

# btrfs subvolume create /some/path

Afterwards /some/path will act like a regular directory.

Deleting a subvolume

Contrary to directories removed via rmdir, subvolumes do not need to be empty in order to be deleted via the btrfs command.

# btrfs subvolume delete /some/path
Creating a snapshot of a subvolume

BTRFS does not actually distinguish between snapshots and normal subvolumes, so taking a snapshot can also be seen as creating an arbitrary copy of a subvolume. By convention, Proxmox VE will use the read-only flag when creating snapshots of guest disks or subvolumes, but this flag can also be changed later on.

# btrfs subvolume snapshot -r /some/path /a/new/path

This will create a read-only "clone" of the subvolume on /some/path at /a/new/path. Any future modifications to /some/path cause the modified data to be copied before modification.

If the read-only (-r) option is left out, both subvolumes will be writable.

Enabling compression

By default, BTRFS does not compress data. To enable compression, the compress mount option can be added. Note that data already written will not be compressed after the fact.

By default, the rootfs will be listed in /etc/fstab as follows:

UUID=<uuid of your root file system> / btrfs defaults 0 1

You can simply append compress=zstd, compress=lzo, or compress=zlib to the defaults above like so:

UUID=<uuid of your root file system> / btrfs defaults,compress=zstd 0 1

This change will take effect after rebooting.

Checking Space Usage

The classic df tool may output confusing values for some BTRFS setups. For a better estimate use the btrfs filesystem usage /PATH command, for example:

# btrfs fi usage /my-storage

3.11. Proxmox Node Management

The Proxmox VE node management tool (pvenode) allows you to control node specific settings and resources.

Currently pvenode allows you to set a node’s description, run various bulk operations on the node’s guests, view the node’s task history, and manage the node’s SSL certificates, which are used for the API and the web GUI through pveproxy.

3.11.1. Wake-on-LAN

Wake-on-LAN (WoL) allows you to switch on a sleeping computer in the network, by sending a magic packet. At least one NIC must support this feature, and the respective option needs to be enabled in the computer’s firmware (BIOS/UEFI) configuration. The option name can vary from Enable Wake-on-Lan to Power On By PCIE Device; check your motherboard’s vendor manual, if you’re unsure. ethtool can be used to check the WoL configuration of <interface> by running:

ethtool <interface> | grep Wake-on

pvenode allows you to wake sleeping members of a cluster via WoL, using the command:

pvenode wakeonlan <node>

This broadcasts the WoL magic packet on UDP port 9, containing the MAC address of <node> obtained from the wakeonlan property. The node-specific wakeonlan property can be set using the following command:

pvenode config set -wakeonlan XX:XX:XX:XX:XX:XX

The interface via which to send the WoL packet is determined from the default route. It can be overwritten by setting the bind-interface via the following command:

pvenode config set -wakeonlan XX:XX:XX:XX:XX:XX,bind-interface=<iface-name>

The broadcast address (default 255.255.255.255) used when sending the WoL packet can further be changed by setting the broadcast-address explicitly using the following command:

pvenode config set -wakeonlan XX:XX:XX:XX:XX:XX,broadcast-address=<broadcast-address>

3.11.2. Task History

When troubleshooting server issues, for example, failed backup jobs, it can often be helpful to have a log of the previously run tasks. With Proxmox VE, you can access the nodes’s task history through the pvenode task command.

You can get a filtered list of a node’s finished tasks with the list subcommand. For example, to get a list of tasks related to VM 100 that ended with an error, the command would be:

pvenode task list --errors --vmid 100

The log of a task can then be printed using its UPID:

pvenode task log UPID:pve1:00010D94:001CA6EA:6124E1B9:vzdump:100:root@pam:

3.11.3. Bulk Guest Power Management

In case you have many VMs/containers, starting and stopping guests can be carried out in bulk operations with the startall and stopall subcommands of pvenode. By default, pvenode startall will only start VMs/containers which have been set to automatically start on boot (see Automatic Start and Shutdown of Virtual Machines), however, you can override this behavior with the --force flag. Both commands also have a --vms option, which limits the stopped/started guests to the specified VMIDs.

For example, to start VMs 100, 101, and 102, regardless of whether they have onboot set, you can use:

pvenode startall --vms 100,101,102 --force

To stop these guests (and any other guests that may be running), use the command:

pvenode stopall
Note The stopall command first attempts to perform a clean shutdown and then waits until either all guests have successfully shut down or an overridable timeout (3 minutes by default) has expired. Once that happens and the force-stop parameter is not explicitly set to 0 (false), all virtual guests that are still running are hard stopped.

3.11.4. First Guest Boot Delay

In case your VMs/containers rely on slow-to-start external resources, for example an NFS server, you can also set a per-node delay between the time Proxmox VE boots and the time the first VM/container that is configured to autostart boots (see Automatic Start and Shutdown of Virtual Machines).

You can achieve this by setting the following (where 10 represents the delay in seconds):

pvenode config set --startall-onboot-delay 10

3.11.5. Bulk Guest Migration

In case an upgrade situation requires you to migrate all of your guests from one node to another, pvenode also offers the migrateall subcommand for bulk migration. By default, this command will migrate every guest on the system to the target node. It can however be set to only migrate a set of guests.

For example, to migrate VMs 100, 101, and 102, to the node pve2, with live-migration for local disks enabled, you can run:

pvenode migrateall pve2 --vms 100,101,102 --with-local-disks

3.12. Certificate Management

3.12.1. Certificates for Intra-Cluster Communication

Each Proxmox VE cluster creates by default its own (self-signed) Certificate Authority (CA) and generates a certificate for each node which gets signed by the aforementioned CA. These certificates are used for encrypted communication with the cluster’s pveproxy service and the Shell/Console feature if SPICE is used.

The CA certificate and key are stored in the Proxmox Cluster File System (pmxcfs).

3.12.2. Certificates for API and Web GUI

The REST API and web GUI are provided by the pveproxy service, which runs on each node.

You have the following options for the certificate used by pveproxy:

  1. By default the node-specific certificate in /etc/pve/nodes/NODENAME/pve-ssl.pem is used. This certificate is signed by the cluster CA and therefore not automatically trusted by browsers and operating systems.

  2. use an externally provided certificate (e.g. signed by a commercial CA).

  3. use ACME (Let’s Encrypt) to get a trusted certificate with automatic renewal, this is also integrated in the Proxmox VE API and web interface.

For options 2 and 3 the file /etc/pve/local/pveproxy-ssl.pem (and /etc/pve/local/pveproxy-ssl.key, which needs to be without password) is used.

Note Keep in mind that /etc/pve/local is a node specific symlink to /etc/pve/nodes/NODENAME.

Certificates are managed with the Proxmox VE Node management command (see the pvenode(1) manpage).

Warning Do not replace or manually modify the automatically generated node certificate files in /etc/pve/local/pve-ssl.pem and /etc/pve/local/pve-ssl.key or the cluster CA files in /etc/pve/pve-root-ca.pem and /etc/pve/priv/pve-root-ca.key.

3.12.3. Upload Custom Certificate

If you already have a certificate which you want to use for a Proxmox VE node you can upload that certificate simply over the web interface.

screenshot/gui-node-certs-upload-custom.png

Note that the certificates key file, if provided, mustn’t be password protected.

3.12.4. Trusted certificates via Let’s Encrypt (ACME)

Proxmox VE includes an implementation of the Automatic Certificate Management Environment ACME protocol, allowing Proxmox VE admins to use an ACME provider like Let’s Encrypt for easy setup of TLS certificates which are accepted and trusted on modern operating systems and web browsers out of the box.

Currently, the two ACME endpoints implemented are the Let’s Encrypt (LE) production and its staging environment. Our ACME client supports validation of http-01 challenges using a built-in web server and validation of dns-01 challenges using a DNS plugin supporting all the DNS API endpoints acme.sh does.

ACME Account
screenshot/gui-datacenter-acme-register-account.png

You need to register an ACME account per cluster with the endpoint you want to use. The email address used for that account will serve as contact point for renewal-due or similar notifications from the ACME endpoint.

You can register and deactivate ACME accounts over the web interface Datacenter -> ACME or using the pvenode command-line tool.

 pvenode acme account register account-name mail@example.com
Tip Because of rate-limits you should use LE staging for experiments or if you use ACME for the first time.
ACME Plugins

The ACME plugins task is to provide automatic verification that you, and thus the Proxmox VE cluster under your operation, are the real owner of a domain. This is the basis building block for automatic certificate management.

The ACME protocol specifies different types of challenges, for example the http-01 where a web server provides a file with a certain content to prove that it controls a domain. Sometimes this isn’t possible, either because of technical limitations or if the address of a record to is not reachable from the public internet. The dns-01 challenge can be used in these cases. This challenge is fulfilled by creating a certain DNS record in the domain’s zone.

screenshot/gui-datacenter-acme-overview.png

Proxmox VE supports both of those challenge types out of the box, you can configure plugins either over the web interface under Datacenter -> ACME, or using the pvenode acme plugin add command.

ACME Plugin configurations are stored in /etc/pve/priv/acme/plugins.cfg. A plugin is available for all nodes in the cluster.

Node Domains

Each domain is node specific. You can add new or manage existing domain entries under Node -> Certificates, or using the pvenode config command.

screenshot/gui-node-certs-add-domain.png

After configuring the desired domain(s) for a node and ensuring that the desired ACME account is selected, you can order your new certificate over the web interface. On success the interface will reload after 10 seconds.

Renewal will happen automatically.

3.12.5. ACME HTTP Challenge Plugin

There is always an implicitly configured standalone plugin for validating http-01 challenges via the built-in webserver spawned on port 80.

Note The name standalone means that it can provide the validation on it’s own, without any third party service. So, this plugin works also for cluster nodes.

There are a few prerequisites to use it for certificate management with Let’s Encrypts ACME.

  • You have to accept the ToS of Let’s Encrypt to register an account.

  • Port 80 of the node needs to be reachable from the internet.

  • There must be no other listener on port 80.

  • The requested (sub)domain needs to resolve to a public IP of the Node.

3.12.6. ACME DNS API Challenge Plugin

On systems where external access for validation via the http-01 method is not possible or desired, it is possible to use the dns-01 validation method. This validation method requires a DNS server that allows provisioning of TXT records via an API.

Configuring ACME DNS APIs for validation

Proxmox VE re-uses the DNS plugins developed for the acme.sh
[acme.sh https://github.com/acmesh-official/acme.sh]
project, please refer to its documentation for details on configuration of specific APIs.

The easiest way to configure a new plugin with the DNS API is using the web interface (Datacenter -> ACME).

screenshot/gui-datacenter-acme-add-dns-plugin.png

Choose DNS as challenge type. Then you can select your API provider, enter the credential data to access your account over their API.

Tip See the acme.sh How to use DNS API wiki for more detailed information about getting API credentials for your provider.

As there are many DNS providers and API endpoints Proxmox VE automatically generates the form for the credentials for some providers. For the others you will see a bigger text area, simply copy all the credentials KEY=VALUE pairs in there.

DNS Validation through CNAME Alias

A special alias mode can be used to handle the validation on a different domain/DNS server, in case your primary/real DNS does not support provisioning via an API. Manually set up a permanent CNAME record for _acme-challenge.domain1.example pointing to _acme-challenge.domain2.example and set the alias property in the Proxmox VE node configuration file to domain2.example to allow the DNS server of domain2.example to validate all challenges for domain1.example.

Combination of Plugins

Combining http-01 and dns-01 validation is possible in case your node is reachable via multiple domains with different requirements / DNS provisioning capabilities. Mixing DNS APIs from multiple providers or instances is also possible by specifying different plugin instances per domain.

Tip Accessing the same service over multiple domains increases complexity and should be avoided if possible.

3.12.7. Automatic renewal of ACME certificates

If a node has been successfully configured with an ACME-provided certificate (either via pvenode or via the GUI), the certificate will be automatically renewed by the pve-daily-update.service. Currently, renewal will be attempted if the certificate has expired already, or will expire in the next 30 days.

3.12.8. ACME Examples with pvenode

Example: Sample pvenode invocation for using Let’s Encrypt certificates
root@proxmox:~# pvenode acme account register default mail@example.invalid
Directory endpoints:
0) Let's Encrypt V2 (https://acme-v02.api.letsencrypt.org/directory)
1) Let's Encrypt V2 Staging (https://acme-staging-v02.api.letsencrypt.org/directory)
2) Custom
Enter selection: 1

Terms of Service: https://letsencrypt.org/documents/LE-SA-v1.2-November-15-2017.pdf
Do you agree to the above terms? [y|N]y
...
Task OK
root@proxmox:~# pvenode config set --acme domains=example.invalid
root@proxmox:~# pvenode acme cert order
Loading ACME account details
Placing ACME order
...
Status is 'valid'!

All domains validated!
...
Downloading certificate
Setting pveproxy certificate and key
Restarting pveproxy
Task OK
Example: Setting up the OVH API for validating a domain
Note the account registration steps are the same no matter which plugins are used, and are not repeated here.
Note OVH_AK and OVH_AS need to be obtained from OVH according to the OVH API documentation

First you need to get all information so you and Proxmox VE can access the API.

root@proxmox:~# cat /path/to/api-token
OVH_AK=XXXXXXXXXXXXXXXX
OVH_AS=YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
root@proxmox:~# source /path/to/api-token
root@proxmox:~# curl -XPOST -H"X-Ovh-Application: $OVH_AK" -H "Content-type: application/json" \
https://eu.api.ovh.com/1.0/auth/credential  -d '{
  "accessRules": [
    {"method": "GET","path": "/auth/time"},
    {"method": "GET","path": "/domain"},
    {"method": "GET","path": "/domain/zone/*"},
    {"method": "GET","path": "/domain/zone/*/record"},
    {"method": "POST","path": "/domain/zone/*/record"},
    {"method": "POST","path": "/domain/zone/*/refresh"},
    {"method": "PUT","path": "/domain/zone/*/record/"},
    {"method": "DELETE","path": "/domain/zone/*/record/*"}
]
}'
{"consumerKey":"ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ","state":"pendingValidation","validationUrl":"https://eu.api.ovh.com/auth/?credentialToken=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"}

(open validation URL and follow instructions to link Application Key with account/Consumer Key)

root@proxmox:~# echo "OVH_CK=ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ" >> /path/to/api-token

Now you can setup the the ACME plugin:

root@proxmox:~# pvenode acme plugin add dns example_plugin --api ovh --data /path/to/api_token
root@proxmox:~# pvenode acme plugin config example_plugin
┌────────┬──────────────────────────────────────────┐
│ key    │ value                                    │
╞════════╪══════════════════════════════════════════╡
│ api    │ ovh                                      │
├────────┼──────────────────────────────────────────┤
│ data   │ OVH_AK=XXXXXXXXXXXXXXXX                  │
│        │ OVH_AS=YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY  │
│        │ OVH_CK=ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ  │
├────────┼──────────────────────────────────────────┤
│ digest │ 867fcf556363ca1bea866863093fcab83edf47a1 │
├────────┼──────────────────────────────────────────┤
│ plugin │ example_plugin                           │
├────────┼──────────────────────────────────────────┤
│ type   │ dns                                      │
└────────┴──────────────────────────────────────────┘

At last you can configure the domain you want to get certificates for and place the certificate order for it:

root@proxmox:~# pvenode config set -acmedomain0 example.proxmox.com,plugin=example_plugin
root@proxmox:~# pvenode acme cert order
Loading ACME account details
Placing ACME order
Order URL: https://acme-staging-v02.api.letsencrypt.org/acme/order/11111111/22222222

Getting authorization details from 'https://acme-staging-v02.api.letsencrypt.org/acme/authz-v3/33333333'
The validation for example.proxmox.com is pending!
[Wed Apr 22 09:25:30 CEST 2020] Using OVH endpoint: ovh-eu
[Wed Apr 22 09:25:30 CEST 2020] Checking authentication
[Wed Apr 22 09:25:30 CEST 2020] Consumer key is ok.
[Wed Apr 22 09:25:31 CEST 2020] Adding record
[Wed Apr 22 09:25:32 CEST 2020] Added, sleep 10 seconds.
Add TXT record: _acme-challenge.example.proxmox.com
Triggering validation
Sleeping for 5 seconds
Status is 'valid'!
[Wed Apr 22 09:25:48 CEST 2020] Using OVH endpoint: ovh-eu
[Wed Apr 22 09:25:48 CEST 2020] Checking authentication
[Wed Apr 22 09:25:48 CEST 2020] Consumer key is ok.
Remove TXT record: _acme-challenge.example.proxmox.com

All domains validated!

Creating CSR
Checking order status
Order is ready, finalizing order
valid!

Downloading certificate
Setting pveproxy certificate and key
Restarting pveproxy
Task OK
Example: Switching from the staging to the regular ACME directory

Changing the ACME directory for an account is unsupported, but as Proxmox VE supports more than one account you can just create a new one with the production (trusted) ACME directory as endpoint. You can also deactivate the staging account and recreate it.

Example: Changing the default ACME account from staging to directory using pvenode
root@proxmox:~# pvenode acme account deactivate default
Renaming account file from '/etc/pve/priv/acme/default' to '/etc/pve/priv/acme/_deactivated_default_4'
Task OK

root@proxmox:~# pvenode acme account register default example@proxmox.com
Directory endpoints:
0) Let's Encrypt V2 (https://acme-v02.api.letsencrypt.org/directory)
1) Let's Encrypt V2 Staging (https://acme-staging-v02.api.letsencrypt.org/directory)
2) Custom
Enter selection: 0

Terms of Service: https://letsencrypt.org/documents/LE-SA-v1.2-November-15-2017.pdf
Do you agree to the above terms? [y|N]y
...
Task OK

3.13. Host Bootloader

Proxmox VE currently uses one of two bootloaders depending on the disk setup selected in the installer.

For EFI Systems installed with ZFS as the root filesystem systemd-boot is used, unless Secure Boot is enabled. All other deployments use the standard GRUB bootloader (this usually also applies to systems which are installed on top of Debian).

3.13.1. Partitioning Scheme Used by the Installer

The Proxmox VE installer creates 3 partitions on all disks selected for installation.

The created partitions are:

  • a 1 MB BIOS Boot Partition (gdisk type EF02)

  • a 512 MB EFI System Partition (ESP, gdisk type EF00)

  • a third partition spanning the set hdsize parameter or the remaining space used for the chosen storage type

Systems using ZFS as root filesystem are booted with a kernel and initrd image stored on the 512 MB EFI System Partition. For legacy BIOS systems, and EFI systems with Secure Boot enabled, GRUB is used, for EFI systems without Secure Boot, systemd-boot is used. Both are installed and configured to point to the ESPs.

GRUB in BIOS mode (--target i386-pc) is installed onto the BIOS Boot Partition of all selected disks on all systems booted with GRUB
[These are all installs with root on ext4 or xfs and installs with root on ZFS on non-EFI systems]
.

3.13.2. Synchronizing the content of the ESP with proxmox-boot-tool

proxmox-boot-tool is a utility used to keep the contents of the EFI System Partitions properly configured and synchronized. It copies certain kernel versions to all ESPs and configures the respective bootloader to boot from the vfat formatted ESPs. In the context of ZFS as root filesystem this means that you can use all optional features on your root pool instead of the subset which is also present in the ZFS implementation in GRUB or having to create a separate small boot-pool
[Booting ZFS on root with GRUB https://github.com/zfsonlinux/zfs/wiki/Debian-Stretch-Root-on-ZFS]
.

In setups with redundancy all disks are partitioned with an ESP, by the installer. This ensures the system boots even if the first boot device fails or if the BIOS can only boot from a particular disk.

The ESPs are not kept mounted during regular operation. This helps to prevent filesystem corruption to the vfat formatted ESPs in case of a system crash, and removes the need to manually adapt /etc/fstab in case the primary boot device fails.

proxmox-boot-tool handles the following tasks:

  • formatting and setting up a new partition

  • copying and configuring new kernel images and initrd images to all listed ESPs

  • synchronizing the configuration on kernel upgrades and other maintenance tasks

  • managing the list of kernel versions which are synchronized

  • configuring the boot-loader to boot a particular kernel version (pinning)

You can view the currently configured ESPs and their state by running:

# proxmox-boot-tool status
Setting up a new partition for use as synced ESP

To format and initialize a partition as synced ESP, e.g., after replacing a failed vdev in an rpool, or when converting an existing system that pre-dates the sync mechanism, proxmox-boot-tool from proxmox-kernel-helper can be used.

Warning the format command will format the <partition>, make sure to pass in the right device/partition!

For example, to format an empty partition /dev/sda2 as ESP, run the following:

# proxmox-boot-tool format /dev/sda2

To setup an existing, unmounted ESP located on /dev/sda2 for inclusion in Proxmox VE’s kernel update synchronization mechanism, use the following:

# proxmox-boot-tool init /dev/sda2

or

# proxmox-boot-tool init /dev/sda2 grub

to force initialization with GRUB instead of systemd-boot, for example for Secure Boot support.

Afterwards /etc/kernel/proxmox-boot-uuids should contain a new line with the UUID of the newly added partition. The init command will also automatically trigger a refresh of all configured ESPs.

Updating the configuration on all ESPs

To copy and configure all bootable kernels and keep all ESPs listed in /etc/kernel/proxmox-boot-uuids in sync you just need to run:

# proxmox-boot-tool refresh

(The equivalent to running update-grub systems with ext4 or xfs on root).

This is necessary should you make changes to the kernel commandline, or want to sync all kernels and initrds.

Note Both update-initramfs and apt (when necessary) will automatically trigger a refresh.
Kernel Versions considered by proxmox-boot-tool

The following kernel versions are configured by default:

  • the currently running kernel

  • the version being newly installed on package updates

  • the two latest already installed kernels

  • the latest version of the second-to-last kernel series (e.g. 5.0, 5.3), if applicable

  • any manually selected kernels

Manually keeping a kernel bootable

Should you wish to add a certain kernel and initrd image to the list of bootable kernels use proxmox-boot-tool kernel add.

For example run the following to add the kernel with ABI version 5.0.15-1-pve to the list of kernels to keep installed and synced to all ESPs:

# proxmox-boot-tool kernel add 5.0.15-1-pve

proxmox-boot-tool kernel list will list all kernel versions currently selected for booting:

# proxmox-boot-tool kernel list
Manually selected kernels:
5.0.15-1-pve

Automatically selected kernels:
5.0.12-1-pve
4.15.18-18-pve

Run proxmox-boot-tool kernel remove to remove a kernel from the list of manually selected kernels, for example:

# proxmox-boot-tool kernel remove 5.0.15-1-pve
Note It’s required to run proxmox-boot-tool refresh to update all EFI System Partitions (ESPs) after a manual kernel addition or removal from above.

3.13.3. Determine which Bootloader is Used

screenshot/boot-grub.png

The simplest and most reliable way to determine which bootloader is used, is to watch the boot process of the Proxmox VE node.

You will either see the blue box of GRUB or the simple black on white systemd-boot.

screenshot/boot-systemdboot.png

Determining the bootloader from a running system might not be 100% accurate. The safest way is to run the following command:

# efibootmgr -v

If it returns a message that EFI variables are not supported, GRUB is used in BIOS/Legacy mode.

If the output contains a line that looks similar to the following, GRUB is used in UEFI mode.

Boot0005* proxmox       [...] File(\EFI\proxmox\grubx64.efi)

If the output contains a line similar to the following, systemd-boot is used.

Boot0006* Linux Boot Manager    [...] File(\EFI\systemd\systemd-bootx64.efi)

By running:

# proxmox-boot-tool status

you can find out if proxmox-boot-tool is configured, which is a good indication of how the system is booted.

3.13.4. GRUB

GRUB has been the de-facto standard for booting Linux systems for many years and is quite well documented
[GRUB Manual https://www.gnu.org/software/grub/manual/grub/grub.html]
.

Configuration

Changes to the GRUB configuration are done via the defaults file /etc/default/grub or config snippets in /etc/default/grub.d. To regenerate the configuration file after a change to the configuration run:
[Systems using proxmox-boot-tool will call proxmox-boot-tool refresh upon update-grub.]

# update-grub

3.13.5. Systemd-boot

systemd-boot is a lightweight EFI bootloader. It reads the kernel and initrd images directly from the EFI Service Partition (ESP) where it is installed. The main advantage of directly loading the kernel from the ESP is that it does not need to reimplement the drivers for accessing the storage. In Proxmox VE proxmox-boot-tool is used to keep the configuration on the ESPs synchronized.

Configuration

systemd-boot is configured via the file loader/loader.conf in the root directory of an EFI System Partition (ESP). See the loader.conf(5) manpage for details.

Each bootloader entry is placed in a file of its own in the directory loader/entries/

An example entry.conf looks like this (/ refers to the root of the ESP):

title    Proxmox
version  5.0.15-1-pve
options   root=ZFS=rpool/ROOT/pve-1 boot=zfs
linux    /EFI/proxmox/5.0.15-1-pve/vmlinuz-5.0.15-1-pve
initrd   /EFI/proxmox/5.0.15-1-pve/initrd.img-5.0.15-1-pve

3.13.6. Editing the Kernel Commandline

You can modify the kernel commandline in the following places, depending on the bootloader used:

GRUB

The kernel commandline needs to be placed in the variable GRUB_CMDLINE_LINUX_DEFAULT in the file /etc/default/grub. Running update-grub appends its content to all linux entries in /boot/grub/grub.cfg.

Systemd-boot

The kernel commandline needs to be placed as one line in /etc/kernel/cmdline. To apply your changes, run proxmox-boot-tool refresh, which sets it as the option line for all config files in loader/entries/proxmox-*.conf.

A complete list of kernel parameters can be found at https://www.kernel.org/doc/html/v<YOUR-KERNEL-VERSION>/admin-guide/kernel-parameters.html. replace <YOUR-KERNEL-VERSION> with the major.minor version, for example, for kernels based on version 6.5 the URL would be: https://www.kernel.org/doc/html/v6.5/admin-guide/kernel-parameters.html

You can find your kernel version by checking the web interface (Node → Summary), or by running

# uname -r

Use the first two numbers at the front of the output.

3.13.7. Override the Kernel-Version for next Boot

To select a kernel that is not currently the default kernel, you can either:

  • use the boot loader menu that is displayed at the beginning of the boot process

  • use the proxmox-boot-tool to pin the system to a kernel version either once or permanently (until pin is reset).

This should help you work around incompatibilities between a newer kernel version and the hardware.

Note Such a pin should be removed as soon as possible so that all current security patches of the latest kernel are also applied to the system.

For example: To permanently select the version 5.15.30-1-pve for booting you would run:

# proxmox-boot-tool kernel pin 5.15.30-1-pve
Tip The pinning functionality works for all Proxmox VE systems, not only those using proxmox-boot-tool to synchronize the contents of the ESPs, if your system does not use proxmox-boot-tool for synchronizing you can also skip the proxmox-boot-tool refresh call in the end.

You can also set a kernel version to be booted on the next system boot only. This is for example useful to test if an updated kernel has resolved an issue, which caused you to pin a version in the first place:

# proxmox-boot-tool kernel pin 5.15.30-1-pve --next-boot

To remove any pinned version configuration use the unpin subcommand:

# proxmox-boot-tool kernel unpin

While unpin has a --next-boot option as well, it is used to clear a pinned version set with --next-boot. As that happens already automatically on boot, invonking it manually is of little use.

After setting, or clearing pinned versions you also need to synchronize the content and configuration on the ESPs by running the refresh subcommand.

Tip You will be prompted to automatically do for proxmox-boot-tool managed systems if you call the tool interactively.
# proxmox-boot-tool refresh

3.13.8. Secure Boot

Since Proxmox VE 8.1, Secure Boot is supported out of the box via signed packages and integration in proxmox-boot-tool.

The following packages are required for secure boot to work. You can install them all at once by using the ‘proxmox-secure-boot-support’ meta-package.

  • shim-signed (shim bootloader signed by Microsoft)

  • shim-helpers-amd64-signed (fallback bootloader and MOKManager, signed by Proxmox)

  • grub-efi-amd64-signed (GRUB EFI bootloader, signed by Proxmox)

  • proxmox-kernel-6.X.Y-Z-pve-signed (Kernel image, signed by Proxmox)

Only GRUB is supported as bootloader out of the box, since other bootloader are currently not eligible for secure boot code-signing.

Any new installation of Proxmox VE will automatically have all of the above packages included.

More details about how Secure Boot works, and how to customize the setup, are available in our wiki.

Switching an Existing Installation to Secure Boot
Warning This can lead to an unbootable installation in some cases if not done correctly. Reinstalling the host will setup Secure Boot automatically if available, without any extra interactions. Make sure you have a working and well-tested backup of your Proxmox VE host!

An existing UEFI installation can be switched over to Secure Boot if desired, without having to reinstall Proxmox VE from scratch.

First, ensure all your system is up-to-date. Next, install proxmox-secure-boot-support. GRUB automatically creates the needed EFI boot entry for booting via the default shim.

systemd-boot

If systemd-boot is used as a bootloader (see Determine which Bootloader is used), some additional setup is needed. This is only the case if Proxmox VE was installed with ZFS-on-root.

To check the latter, run:

# findmnt /

If the host is indeed using ZFS as root filesystem, the FSTYPE column should contain zfs:

TARGET SOURCE           FSTYPE OPTIONS
/      rpool/ROOT/pve-1 zfs    rw,relatime,xattr,noacl,casesensitive

Next, a suitable potential ESP (EFI system partition) must be found. This can be done using the lsblk command as following:

# lsblk -o +FSTYPE

The output should look something like this:

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS FSTYPE
sda      8:0    0   32G  0 disk
├─sda1   8:1    0 1007K  0 part
├─sda2   8:2    0  512M  0 part             vfat
└─sda3   8:3    0 31.5G  0 part             zfs_member
sdb      8:16   0   32G  0 disk
├─sdb1   8:17   0 1007K  0 part
├─sdb2   8:18   0  512M  0 part             vfat
└─sdb3   8:19   0 31.5G  0 part             zfs_member

In this case, the partitions sda2 and sdb2 are the targets. They can be identified by the their size of 512M and their FSTYPE being vfat, in this case on a ZFS RAID-1 installation.

These partitions must be properly set up for booting through GRUB using proxmox-boot-tool. This command (using sda2 as an example) must be run separately for each individual ESP:

# proxmox-boot-tool init /dev/sda2 grub

Afterwards, you can sanity-check the setup by running the following command:

# efibootmgr -v

This list should contain an entry looking similar to this:

[..]
Boot0009* proxmox       HD(2,GPT,..,0x800,0x100000)/File(\EFI\proxmox\shimx64.efi)
[..]
Note The old systemd-boot bootloader will be kept, but GRUB will be preferred. This way, if booting using GRUB in Secure Boot mode does not work for any reason, the system can still be booted using systemd-boot with Secure Boot turned off.

Now the host can be rebooted and Secure Boot enabled in the UEFI firmware setup utility.

On reboot, a new entry named proxmox should be selectable in the UEFI firmware boot menu, which boots using the pre-signed EFI shim.

If, for any reason, no proxmox entry can be found in the UEFI boot menu, you can try adding it manually (if supported by the firmware), by adding the file \EFI\proxmox\shimx64.efi as a custom boot entry.

Note Some UEFI firmwares are known to drop the proxmox boot option on reboot. This can happen if the proxmox boot entry is pointing to a GRUB installation on a disk, where the disk itself is not a boot option. If possible, try adding the disk as a boot option in the UEFI firmware setup utility and run proxmox-boot-tool again.
Tip To enroll custom keys, see the accompanying Secure Boot wiki page.
Using DKMS/Third Party Modules With Secure Boot

On systems with Secure Boot enabled, the kernel will refuse to load modules which are not signed by a trusted key. The default set of modules shipped with the kernel packages is signed with an ephemeral key embedded in the kernel image which is trusted by that specific version of the kernel image.

In order to load other modules, such as those built with DKMS or manually, they need to be signed with a key trusted by the Secure Boot stack. The easiest way to achieve this is to enroll them as Machine Owner Key (MOK) with mokutil.

The dkms tool will automatically generate a keypair and certificate in /var/lib/dkms/mok.key and /var/lib/dkms/mok.pub and use it for signing the kernel modules it builds and installs.

You can view the certificate contents with

# openssl x509 -in /var/lib/dkms/mok.pub -noout -text

and enroll it on your system using the following command:

# mokutil --import /var/lib/dkms/mok.pub
input password:
input password again:

The mokutil command will ask for a (temporary) password twice, this password needs to be entered one more time in the next step of the process! Rebooting the system should automatically boot into the MOKManager EFI binary, which allows you to verify the key/certificate and confirm the enrollment using the password selected when starting the enrollment using mokutil. Afterwards, the kernel should allow loading modules built with DKMS (which are signed with the enrolled MOK). The MOK can also be used to sign custom EFI binaries and kernel images if desired.

The same procedure can also be used for custom/third-party modules not managed with DKMS, but the key/certificate generation and signing steps need to be done manually in that case.

3.14. Kernel Samepage Merging (KSM)

Kernel Samepage Merging (KSM) is an optional memory deduplication feature offered by the Linux kernel, which is enabled by default in Proxmox VE. KSM works by scanning a range of physical memory pages for identical content, and identifying the virtual pages that are mapped to them. If identical pages are found, the corresponding virtual pages are re-mapped so that they all point to the same physical page, and the old pages are freed. The virtual pages are marked as "copy-on-write", so that any writes to them will be written to a new area of memory, leaving the shared physical page intact.

3.14.1. Implications of KSM

KSM can optimize memory usage in virtualization environments, as multiple VMs running similar operating systems or workloads could potentially share a lot of common memory pages.

However, while KSM can reduce memory usage, it also comes with some security risks, as it can expose VMs to side-channel attacks. Research has shown that it is possible to infer information about a running VM via a second VM on the same host, by exploiting certain characteristics of KSM.

Thus, if you are using Proxmox VE to provide hosting services, you should consider disabling KSM, in order to provide your users with additional security. Furthermore, you should check your country’s regulations, as disabling KSM may be a legal requirement.

3.14.2. Disabling KSM

To see if KSM is active, you can check the output of:

# systemctl status ksmtuned

If it is, it can be disabled immediately with:

# systemctl disable --now ksmtuned

Finally, to unmerge all the currently merged pages, run:

# echo 2 > /sys/kernel/mm/ksm/run

4. Graphical User Interface

Proxmox VE is simple. There is no need to install a separate management tool, and everything can be done through your web browser (Latest Firefox or Google Chrome is preferred). A built-in HTML5 console is used to access the guest console. As an alternative, SPICE can be used.

Because we use the Proxmox cluster file system (pmxcfs), you can connect to any node to manage the entire cluster. Each node can manage the entire cluster. There is no need for a dedicated manager node.

You can use the web-based administration interface with any modern browser. When Proxmox VE detects that you are connecting from a mobile device, you are redirected to a simpler, touch-based user interface.

The web interface can be reached via https://youripaddress:8006 (default login is: root, and the password is specified during the installation process).

4.1. Features

  • Seamless integration and management of Proxmox VE clusters

  • AJAX technologies for dynamic updates of resources

  • Secure access to all Virtual Machines and Containers via SSL encryption (https)

  • Fast search-driven interface, capable of handling hundreds and probably thousands of VMs

  • Secure HTML5 console or SPICE

  • Role based permission management for all objects (VMs, storages, nodes, etc.)

  • Support for multiple authentication sources (e.g. local, MS ADS, LDAP, …)

  • Two-Factor Authentication (OATH, Yubikey)

  • Based on ExtJS 7.x JavaScript framework

4.2. Login

screenshot/gui-login-window.png

When you connect to the server, you will first see the login window. Proxmox VE supports various authentication backends (Realm), and you can select the language here. The GUI is translated to more than 20 languages.

Note You can save the user name on the client side by selecting the checkbox at the bottom. This saves some typing when you login next time.

4.3. GUI Overview

screenshot/gui-datacenter-summary.png

The Proxmox VE user interface consists of four regions.

Header

On top. Shows status information and contains buttons for most important actions.

Resource Tree

At the left side. A navigation tree where you can select specific objects.

Content Panel

Center region. Selected objects display configuration options and status here.

Log Panel

At the bottom. Displays log entries for recent tasks. You can double-click on those log entries to get more details, or to abort a running task.

Note You can shrink and expand the size of the resource tree and log panel, or completely hide the log panel. This can be helpful when you work on small displays and want more space to view other content.

4.3.1. Header

On the top left side, the first thing you see is the Proxmox logo. Next to it is the current running version of Proxmox VE. In the search bar nearside you can search for specific objects (VMs, containers, nodes, …). This is sometimes faster than selecting an object in the resource tree.

The right part of the header contains four buttons:

Documentation

Opens a new browser window showing the reference documentation.

Create VM

Opens the virtual machine creation wizard.

Create CT

Open the container creation wizard.

User Menu

Displays the identity of the user you’re currently logged in with, and clicking it opens a menu with user-specific options.

In the user menu, you’ll find the My Settings dialog, which provides local UI settings. Below that, there are shortcuts for TFA (Two-Factor Authentication) and Password self-service. You’ll also find options to change the Language and the Color Theme. Finally, at the bottom of the menu is the Logout option.

4.3.2. My Settings

screenshot/gui-my-settings.png

The My Settings window allows you to set locally stored settings. These include the Dashboard Storages which allow you to enable or disable specific storages to be counted towards the total amount visible in the datacenter summary. If no storage is checked the total is the sum of all storages, same as enabling every single one.

Below the dashboard settings you find the stored user name and a button to clear it as well as a button to reset every layout in the GUI to its default.

On the right side there are xterm.js Settings. These contain the following options:

Font-Family

The font to be used in xterm.js (e.g. Arial).

Font-Size

The preferred font size to be used.

Letter Spacing

Increases or decreases spacing between letters in text.

Line Height

Specify the absolute height of a line.

4.3.3. Resource Tree

This is the main navigation tree. On top of the tree you can select some predefined views, which change the structure of the tree below. The default view is the Server View, and it shows the following object types:

Datacenter

Contains cluster-wide settings (relevant for all nodes).

Node

Represents the hosts inside a cluster, where the guests run.

Guest

VMs, containers and templates.

Storage

Data Storage.

Pool

It is possible to group guests using a pool to simplify management.

The following view types are available:

Server View

Shows all kinds of objects, grouped by nodes.

Folder View

Shows all kinds of objects, grouped by object type.

Pool View

Show VMs and containers, grouped by pool.

4.3.4. Log Panel

The main purpose of the log panel is to show you what is currently going on in your cluster. Actions like creating an new VM are executed in the background, and we call such a background job a task.

Any output from such a task is saved into a separate log file. You can view that log by simply double-click a task log entry. It is also possible to abort a running task there.

Please note that we display the most recent tasks from all cluster nodes here. So you can see when somebody else is working on another cluster node in real-time.

Note We remove older and finished task from the log panel to keep that list short. But you can still find those tasks within the node panel in the Task History.

Some short-running actions simply send logs to all cluster members. You can see those messages in the Cluster log panel.

4.4. Content Panels

When you select an item from the resource tree, the corresponding object displays configuration and status information in the content panel. The following sections provide a brief overview of this functionality. Please refer to the corresponding chapters in the reference documentation to get more detailed information.

4.4.1. Datacenter

screenshot/gui-datacenter-search.png

On the datacenter level, you can access cluster-wide settings and information.

  • Search: perform a cluster-wide search for nodes, VMs, containers, storage devices, and pools.

  • Summary: gives a brief overview of the cluster’s health and resource usage.

  • Cluster: provides the functionality and information necessary to create or join a cluster.

  • Options: view and manage cluster-wide default settings.

  • Storage: provides an interface for managing cluster storage.

  • Backup: schedule backup jobs. This operates cluster wide, so it doesn’t matter where the VMs/containers are on your cluster when scheduling.

  • Replication: view and manage replication jobs.

  • Permissions: manage user, group, and API token permissions, and LDAP, MS-AD and Two-Factor authentication.

  • HA: manage Proxmox VE High Availability.

  • ACME: set up ACME (Let’s Encrypt) certificates for server nodes.

  • Firewall: configure and make templates for the Proxmox Firewall cluster wide.

  • Metric Server: define external metric servers for Proxmox VE.

  • Notifications: configurate notification behavior and targets for Proxmox VE.

  • Support: display information about your support subscription.

4.4.2. Nodes

screenshot/gui-node-summary.png

Nodes in your cluster can be managed individually at this level.

The top header has useful buttons such as Reboot, Shutdown, Shell, Bulk Actions and Help. Shell has the options noVNC, SPICE and xterm.js. Bulk Actions has the options Bulk Start, Bulk Shutdown and Bulk Migrate.

  • Search: search a node for VMs, containers, storage devices, and pools.

  • Summary: display a brief overview of the node’s resource usage.

  • Notes: write custom comments in Markdown syntax.

  • Shell: access to a shell interface for the node.

  • System: configure network, DNS and time settings, and access the syslog.

  • Updates: upgrade the system and see the available new packages.

  • Firewall: manage the Proxmox Firewall for a specific node.

  • Disks: get an overview of the attached disks, and manage how they are used.

  • Ceph: is only used if you have installed a Ceph server on your host. In this case, you can manage your Ceph cluster and see the status of it here.

  • Replication: view and manage replication jobs.

  • Task History: see a list of past tasks.

  • Subscription: upload a subscription key, and generate a system report for use in support cases.

4.4.3. Guests

screenshot/gui-qemu-summary.png

There are two different kinds of guests and both can be converted to a template. One of them is a Kernel-based Virtual Machine (KVM) and the other is a Linux Container (LXC). Navigation for these are mostly the same; only some options are different.

To access the various guest management interfaces, select a VM or container from the menu on the left.

The header contains commands for items such as power management, migration, console access and type, cloning, HA, and help. Some of these buttons contain drop-down menus, for example, Shutdown also contains other power options, and Console contains the different console types: SPICE, noVNC and xterm.js.

The panel on the right contains an interface for whatever item is selected from the menu on the left.

The available interfaces are as follows.

  • Summary: provides a brief overview of the VM’s activity and a Notes field for Markdown syntax comments.

  • Console: access to an interactive console for the VM/container.

  • (KVM)Hardware: define the hardware available to the KVM VM.

  • (LXC)Resources: define the system resources available to the LXC.

  • (LXC)Network: configure a container’s network settings.

  • (LXC)DNS: configure a container’s DNS settings.

  • Options: manage guest options.

  • Task History: view all previous tasks related to the selected guest.

  • (KVM) Monitor: an interactive communication interface to the KVM process.

  • Backup: create and restore system backups.

  • Replication: view and manage the replication jobs for the selected guest.

  • Snapshots: create and restore VM snapshots.

  • Firewall: configure the firewall on the VM level.

  • Permissions: manage permissions for the selected guest.

4.4.4. Storage

screenshot/gui-storage-summary-local.png

As with the guest interface, the interface for storage consists of a menu on the left for certain storage elements and an interface on the right to manage these elements.

In this view we have a two partition split-view. On the left side we have the storage options and on the right side the content of the selected option will be shown.

  • Summary: shows important information about the storage, such as the type, usage, and content which it stores.

  • Content: a menu item for each content type which the storage stores, for example, Backups, ISO Images, CT Templates.

  • Permissions: manage permissions for the storage.

4.4.5. Pools

screenshot/gui-pool-summary-development.png

Again, the pools view comprises two partitions: a menu on the left, and the corresponding interfaces for each menu item on the right.

  • Summary: shows a description of the pool.

  • Members: display and manage pool members (guests and storage).

  • Permissions: manage the permissions for the pool.

4.5. Tags

screenshot/gui-qemu-summary-tags-edit.png

For organizational purposes, it is possible to set tags for guests. Currently, these only provide informational value to users. Tags are displayed in two places in the web interface: in the Resource Tree and in the status line when a guest is selected.

Tags can be added, edited, and removed in the status line of the guest by clicking on the pencil icon. You can add multiple tags by pressing the + button and remove them by pressing the - button. To save or cancel the changes, you can use the and x button respectively.

Tags can also be set via the CLI, where multiple tags are separated by semicolons. For example:

# qm set ID --tags myfirsttag;mysecondtag

4.5.1. Style Configuration

screenshot/gui-datacenter-tag-style.png

By default, the tag colors are derived from their text in a deterministic way. The color, shape in the resource tree, and case-sensitivity, as well as how tags are sorted, can be customized. This can be done via the web interface under Datacenter → Options → Tag Style Override. Alternatively, this can be done via the CLI. For example:

# pvesh set /cluster/options --tag-style color-map=example:000000:FFFFFF

sets the background color of the tag example to black (#000000) and the text color to white (#FFFFFF).

4.5.2. Permissions

screenshot/gui-datacenter-options.png

By default, users with the privilege VM.Config.Options on a guest (/vms/ID) can set any tags they want (see Permission Management). If you want to restrict this behavior, appropriate permissions can be set under Datacenter → Options → User Tag Access:

  • free: users are not restricted in setting tags (Default)

  • list: users can set tags based on a predefined list of tags

  • existing: like list but users can also use already existing tags

  • none: users are restricted from using tags

The same can also be done via the CLI.

Note that a user with the Sys.Modify privileges on / is always able to set or delete any tags, regardless of the settings here. Additionally, there is a configurable list of registered tags which can only be added and removed by users with the privilege Sys.Modify on /. The list of registered tags can be edited under Datacenter → Options → Registered Tags or via the CLI.

For more details on the exact options and how to invoke them in the CLI, see Datacenter Configuration.

5. Cluster Manager

The Proxmox VE cluster manager pvecm is a tool to create a group of physical servers. Such a group is called a cluster. We use the Corosync Cluster Engine for reliable group communication. There’s no explicit limit for the number of nodes in a cluster. In practice, the actual possible node count may be limited by the host and network performance. Currently (2021), there are reports of clusters (using high-end enterprise hardware) with over 50 nodes in production.

pvecm can be used to create a new cluster, join nodes to a cluster, leave the cluster, get status information, and do various other cluster-related tasks. The Proxmox Cluster File System (“pmxcfs”) is used to transparently distribute the cluster configuration to all cluster nodes.

Grouping nodes into a cluster has the following advantages:

  • Centralized, web-based management

  • Multi-master clusters: each node can do all management tasks

  • Use of pmxcfs, a database-driven file system, for storing configuration files, replicated in real-time on all nodes using corosync

  • Easy migration of virtual machines and containers between physical hosts

  • Fast deployment

  • Cluster-wide services like firewall and HA

5.1. Requirements

  • All nodes must be able to connect to each other via UDP ports 5405-5412 for corosync to work.

  • Date and time must be synchronized.

  • An SSH tunnel on TCP port 22 between nodes is required.

  • If you are interested in High Availability, you need to have at least three nodes for reliable quorum. All nodes should have the same version.

  • We recommend a dedicated NIC for the cluster traffic, especially if you use shared storage.

  • The root password of a cluster node is required for adding nodes.

  • Online migration of virtual machines is only supported when nodes have CPUs from the same vendor. It might work otherwise, but this is never guaranteed.

Note It is not possible to mix Proxmox VE 3.x and earlier with Proxmox VE 4.X cluster nodes.
Note While it’s possible to mix Proxmox VE 4.4 and Proxmox VE 5.0 nodes, doing so is not supported as a production configuration and should only be done temporarily, during an upgrade of the whole cluster from one major version to another.
Note Running a cluster of Proxmox VE 6.x with earlier versions is not possible. The cluster protocol (corosync) between Proxmox VE 6.x and earlier versions changed fundamentally. The corosync 3 packages for Proxmox VE 5.4 are only intended for the upgrade procedure to Proxmox VE 6.0.

5.2. Preparing Nodes

First, install Proxmox VE on all nodes. Make sure that each node is installed with the final hostname and IP configuration. Changing the hostname and IP is not possible after cluster creation.

While it’s common to reference all node names and their IPs in /etc/hosts (or make their names resolvable through other means), this is not necessary for a cluster to work. It may be useful however, as you can then connect from one node to another via SSH, using the easier to remember node name (see also Link Address Types). Note that we always recommend referencing nodes by their IP addresses in the cluster configuration.

5.3. Create a Cluster

You can either create a cluster on the console (login via ssh), or through the API using the Proxmox VE web interface (Datacenter → Cluster).

Note Use a unique name for your cluster. This name cannot be changed later. The cluster name follows the same rules as node names.

5.3.1. Create via Web GUI

screenshot/gui-cluster-create.png

Under Datacenter → Cluster, click on Create Cluster. Enter the cluster name and select a network connection from the drop-down list to serve as the main cluster network (Link 0). It defaults to the IP resolved via the node’s hostname.

As of Proxmox VE 6.2, up to 8 fallback links can be added to a cluster. To add a redundant link, click the Add button and select a link number and IP address from the respective fields. Prior to Proxmox VE 6.2, to add a second link as fallback, you can select the Advanced checkbox and choose an additional network interface (Link 1, see also Corosync Redundancy).

Note Ensure that the network selected for cluster communication is not used for any high traffic purposes, like network storage or live-migration. While the cluster network itself produces small amounts of data, it is very sensitive to latency. Check out full cluster network requirements.

5.3.2. Create via the Command Line

Login via ssh to the first Proxmox VE node and run the following command:

 hp1# pvecm create CLUSTERNAME

To check the state of the new cluster use:

 hp1# pvecm status

5.3.3. Multiple Clusters in the Same Network

It is possible to create multiple clusters in the same physical or logical network. In this case, each cluster must have a unique name to avoid possible clashes in the cluster communication stack. Furthermore, this helps avoid human confusion by making clusters clearly distinguishable.

While the bandwidth requirement of a corosync cluster is relatively low, the latency of packages and the package per second (PPS) rate is the limiting factor. Different clusters in the same network can compete with each other for these resources, so it may still make sense to use separate physical network infrastructure for bigger clusters.

5.4. Adding Nodes to the Cluster

Caution All existing configuration in /etc/pve is overwritten when joining a cluster. In particular, a joining node cannot hold any guests, since guest IDs could otherwise conflict, and the node will inherit the cluster’s storage configuration. To join a node with existing guest, as a workaround, you can create a backup of each guest (using vzdump) and restore it under a different ID after joining. If the node’s storage layout differs, you will need to re-add the node’s storages, and adapt each storage’s node restriction to reflect on which nodes the storage is actually available.

5.4.1. Join Node to Cluster via GUI

screenshot/gui-cluster-join-information.png

Log in to the web interface on an existing cluster node. Under Datacenter → Cluster, click the Join Information button at the top. Then, click on the button Copy Information. Alternatively, copy the string from the Information field manually.

screenshot/gui-cluster-join.png

Next, log in to the web interface on the node you want to add. Under Datacenter → Cluster, click on Join Cluster. Fill in the Information field with the Join Information text you copied earlier. Most settings required for joining the cluster will be filled out automatically. For security reasons, the cluster password has to be entered manually.

Note To enter all required data manually, you can disable the Assisted Join checkbox.

After clicking the Join button, the cluster join process will start immediately. After the node has joined the cluster, its current node certificate will be replaced by one signed from the cluster certificate authority (CA). This means that the current session will stop working after a few seconds. You then might need to force-reload the web interface and log in again with the cluster credentials.

Now your node should be visible under Datacenter → Cluster.

5.4.2. Join Node to Cluster via Command Line

Log in to the node you want to join into an existing cluster via ssh.

 # pvecm add IP-ADDRESS-CLUSTER

For IP-ADDRESS-CLUSTER, use the IP or hostname of an existing cluster node. An IP address is recommended (see Link Address Types).

To check the state of the cluster use:

 # pvecm status
Cluster status after adding 4 nodes
 # pvecm status
Cluster information
~~~~~~~~~~~~~~~~~~~
Name:             prod-central
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
~~~~~~~~~~~~~~~~~~
Date:             Tue Sep 14 11:06:47 2021
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1.1a8
Quorate:          Yes

Votequorum information
~~~~~~~~~~~~~~~~~~~~~~
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
~~~~~~~~~~~~~~~~~~~~~~
    Nodeid      Votes Name
0x00000001          1 192.168.15.91
0x00000002          1 192.168.15.92 (local)
0x00000003          1 192.168.15.93
0x00000004          1 192.168.15.94

If you only want a list of all nodes, use:

 # pvecm nodes
List nodes in a cluster
 # pvecm nodes

Membership information
~~~~~~~~~~~~~~~~~~~~~~
    Nodeid      Votes Name
         1          1 hp1
         2          1 hp2 (local)
         3          1 hp3
         4          1 hp4

5.4.3. Adding Nodes with Separated Cluster Network

When adding a node to a cluster with a separated cluster network, you need to use the link0 parameter to set the nodes address on that network:

# pvecm add IP-ADDRESS-CLUSTER --link0 LOCAL-IP-ADDRESS-LINK0

If you want to use the built-in redundancy of the Kronosnet transport layer, also use the link1 parameter.

Using the GUI, you can select the correct interface from the corresponding Link X fields in the Cluster Join dialog.

5.5. Remove a Cluster Node

Caution Read the procedure carefully before proceeding, as it may not be what you want or need.

Move all virtual machines from the node. Ensure that you have made copies of any local data or backups that you want to keep. In addition, make sure to remove any scheduled replication jobs to the node to be removed.

Caution Failure to remove replication jobs to a node before removing said node will result in the replication job becoming irremovable. Especially note that replication automatically switches direction if a replicated VM is migrated, so by migrating a replicated VM from a node to be deleted, replication jobs will be set up to that node automatically.

In the following example, we will remove the node hp4 from the cluster.

Log in to a different cluster node (not hp4), and issue a pvecm nodes command to identify the node ID to remove:

 hp1# pvecm nodes

Membership information
~~~~~~~~~~~~~~~~~~~~~~
    Nodeid      Votes Name
         1          1 hp1 (local)
         2          1 hp2
         3          1 hp3
         4          1 hp4

At this point, you must power off hp4 and ensure that it will not power on again (in the network) with its current configuration.

Important As mentioned above, it is critical to power off the node before removal, and make sure that it will not power on again (in the existing cluster network) with its current configuration. If you power on the node as it is, the cluster could end up broken, and it could be difficult to restore it to a functioning state.

After powering off the node hp4, we can safely remove it from the cluster.

 hp1# pvecm delnode hp4
 Killing node 4
Note At this point, it is possible that you will receive an error message stating Could not kill node (error = CS_ERR_NOT_EXIST). This does not signify an actual failure in the deletion of the node, but rather a failure in corosync trying to kill an offline node. Thus, it can be safely ignored.

Use pvecm nodes or pvecm status to check the node list again. It should look something like:

hp1# pvecm status

...

Votequorum information
~~~~~~~~~~~~~~~~~~~~~~
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
~~~~~~~~~~~~~~~~~~~~~~
    Nodeid      Votes Name
0x00000001          1 192.168.15.90 (local)
0x00000002          1 192.168.15.91
0x00000003          1 192.168.15.92

If, for whatever reason, you want this server to join the same cluster again, you have to:

  • do a fresh install of Proxmox VE on it,

  • then join it, as explained in the previous section.

The configuration files for the removed node will still reside in /etc/pve/nodes/hp4. Recover any configuration you still need and remove the directory afterwards.

Note After removal of the node, its SSH fingerprint will still reside in the known_hosts of the other nodes. If you receive an SSH error after rejoining a node with the same IP or hostname, run pvecm updatecerts once on the re-added node to update its fingerprint cluster wide.

5.5.1. Separate a Node Without Reinstalling

Caution This is not the recommended method, proceed with caution. Use the previous method if you’re unsure.

You can also separate a node from a cluster without reinstalling it from scratch. But after removing the node from the cluster, it will still have access to any shared storage. This must be resolved before you start removing the node from the cluster. A Proxmox VE cluster cannot share the exact same storage with another cluster, as storage locking doesn’t work over the cluster boundary. Furthermore, it may also lead to VMID conflicts.

It’s suggested that you create a new storage, where only the node which you want to separate has access. This can be a new export on your NFS or a new Ceph pool, to name a few examples. It’s just important that the exact same storage does not get accessed by multiple clusters. After setting up this storage, move all data and VMs from the node to it. Then you are ready to separate the node from the cluster.

Warning Ensure that all shared resources are cleanly separated! Otherwise you will run into conflicts and problems.

First, stop the corosync and pve-cluster services on the node:

systemctl stop pve-cluster
systemctl stop corosync

Start the cluster file system again in local mode:

pmxcfs -l

Delete the corosync configuration files:

rm /etc/pve/corosync.conf
rm -r /etc/corosync/*

You can now start the file system again as a normal service:

killall pmxcfs
systemctl start pve-cluster

The node is now separated from the cluster. You can deleted it from any remaining node of the cluster with:

pvecm delnode oldnode

If the command fails due to a loss of quorum in the remaining node, you can set the expected votes to 1 as a workaround:

pvecm expected 1

And then repeat the pvecm delnode command.

Now switch back to the separated node and delete all the remaining cluster files on it. This ensures that the node can be added to another cluster again without problems.

rm /var/lib/corosync/*

As the configuration files from the other nodes are still in the cluster file system, you may want to clean those up too. After making absolutely sure that you have the correct node name, you can simply remove the entire directory recursively from /etc/pve/nodes/NODENAME.

Caution The node’s SSH keys will remain in the authorized_key file. This means that the nodes can still connect to each other with public key authentication. You should fix this by removing the respective keys from the /etc/pve/priv/authorized_keys file.

5.6. Quorum

Proxmox VE use a quorum-based technique to provide a consistent state among all cluster nodes.

A quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system.

Quorum (distributed computing)
— from Wikipedia

In case of network partitioning, state changes requires that a majority of nodes are online. The cluster switches to read-only mode if it loses quorum.

Note Proxmox VE assigns a single vote to each node by default.

5.7. Cluster Network

The cluster network is the core of a cluster. All messages sent over it have to be delivered reliably to all nodes in their respective order. In Proxmox VE this part is done by corosync, an implementation of a high performance, low overhead, high availability development toolkit. It serves our decentralized configuration file system (pmxcfs).

5.7.1. Network Requirements

The Proxmox VE cluster stack requires a reliable network with latencies under 5 milliseconds (LAN performance) between all nodes to operate stably. While on setups with a small node count a network with higher latencies may work, this is not guaranteed and gets rather unlikely with more than three nodes and latencies above around 10 ms.

The network should not be used heavily by other members, as while corosync does not uses much bandwidth it is sensitive to latency jitters; ideally corosync runs on its own physically separated network. Especially do not use a shared network for corosync and storage (except as a potential low-priority fallback in a redundant configuration).

Before setting up a cluster, it is good practice to check if the network is fit for that purpose. To ensure that the nodes can connect to each other on the cluster network, you can test the connectivity between them with the ping tool.

If the Proxmox VE firewall is enabled, ACCEPT rules for corosync will automatically be generated - no manual action is required.

Note Corosync used Multicast before version 3.0 (introduced in Proxmox VE 6.0). Modern versions rely on Kronosnet for cluster communication, which, for now, only supports regular UDP unicast.
Caution You can still enable Multicast or legacy unicast by setting your transport to udp or udpu in your corosync.conf, but keep in mind that this will disable all cryptography and redundancy support. This is therefore not recommended.

5.7.2. Separate Cluster Network

When creating a cluster without any parameters, the corosync cluster network is generally shared with the web interface and the VMs' network. Depending on your setup, even storage traffic may get sent over the same network. It’s recommended to change that, as corosync is a time-critical, real-time application.

Setting Up a New Network

First, you have to set up a new network interface. It should be on a physically separate network. Ensure that your network fulfills the cluster network requirements.

Separate On Cluster Creation

This is possible via the linkX parameters of the pvecm create command, used for creating a new cluster.

If you have set up an additional NIC with a static address on 10.10.10.1/25, and want to send and receive all cluster communication over this interface, you would execute:

pvecm create test --link0 10.10.10.1

To check if everything is working properly, execute:

systemctl status corosync

Afterwards, proceed as described above to add nodes with a separated cluster network.

Separate After Cluster Creation

You can do this if you have already created a cluster and want to switch its communication to another network, without rebuilding the whole cluster. This change may lead to short periods of quorum loss in the cluster, as nodes have to restart corosync and come up one after the other on the new network.

Check how to edit the corosync.conf file first. Then, open it and you should see a file similar to:

logging {
  debug: off
  to_syslog: yes
}

nodelist {

  node {
    name: due
    nodeid: 2
    quorum_votes: 1
    ring0_addr: due
  }

  node {
    name: tre
    nodeid: 3
    quorum_votes: 1
    ring0_addr: tre
  }

  node {
    name: uno
    nodeid: 1
    quorum_votes: 1
    ring0_addr: uno
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: testcluster
  config_version: 3
  ip_version: ipv4-6
  secauth: on
  version: 2
  interface {
    linknumber: 0
  }

}
Note ringX_addr actually specifies a corosync link address. The name "ring" is a remnant of older corosync versions that is kept for backwards compatibility.

The first thing you want to do is add the name properties in the node entries, if you do not see them already. Those must match the node name.

Then replace all addresses from the ring0_addr properties of all nodes with the new addresses. You may use plain IP addresses or hostnames here. If you use hostnames, ensure that they are resolvable from all nodes (see also Link Address Types).

In this example, we want to switch cluster communication to the 10.10.10.0/25 network, so we change the ring0_addr of each node respectively.

Note The exact same procedure can be used to change other ringX_addr values as well. However, we recommend only changing one link address at a time, so that it’s easier to recover if something goes wrong.

After we increase the config_version property, the new configuration file should look like:

logging {
  debug: off
  to_syslog: yes
}

nodelist {

  node {
    name: due
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.2
  }

  node {
    name: tre
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.3
  }

  node {
    name: uno
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.1
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: testcluster
  config_version: 4
  ip_version: ipv4-6
  secauth: on
  version: 2
  interface {
    linknumber: 0
  }

}

Then, after a final check to see that all changed information is correct, we save it and once again follow the edit corosync.conf file section to bring it into effect.

The changes will be applied live, so restarting corosync is not strictly necessary. If you changed other settings as well, or notice corosync complaining, you can optionally trigger a restart.

On a single node execute:

systemctl restart corosync

Now check if everything is okay:

systemctl status corosync

If corosync begins to work again, restart it on all other nodes too. They will then join the cluster membership one by one on the new network.

5.7.3. Corosync Addresses

A corosync link address (for backwards compatibility denoted by ringX_addr in corosync.conf) can be specified in two ways:

  • IPv4/v6 addresses can be used directly. They are recommended, since they are static and usually not changed carelessly.

  • Hostnames will be resolved using getaddrinfo, which means that by default, IPv6 addresses will be used first, if available (see also man gai.conf). Keep this in mind, especially when upgrading an existing cluster to IPv6.

Caution Hostnames should be used with care, since the addresses they resolve to can be changed without touching corosync or the node it runs on - which may lead to a situation where an address is changed without thinking about implications for corosync.

A separate, static hostname specifically for corosync is recommended, if hostnames are preferred. Also, make sure that every node in the cluster can resolve all hostnames correctly.

Since Proxmox VE 5.1, while supported, hostnames will be resolved at the time of entry. Only the resolved IP is saved to the configuration.

Nodes that joined the cluster on earlier versions likely still use their unresolved hostname in corosync.conf. It might be a good idea to replace them with IPs or a separate hostname, as mentioned above.

5.8. Corosync Redundancy

Corosync supports redundant networking via its integrated Kronosnet layer by default (it is not supported on the legacy udp/udpu transports). It can be enabled by specifying more than one link address, either via the --linkX parameters of pvecm, in the GUI as Link 1 (while creating a cluster or adding a new node) or by specifying more than one ringX_addr in corosync.conf.

Note To provide useful failover, every link should be on its own physical network connection.

Links are used according to a priority setting. You can configure this priority by setting knet_link_priority in the corresponding interface section in corosync.conf, or, preferably, using the priority parameter when creating your cluster with pvecm:

 # pvecm create CLUSTERNAME --link0 10.10.10.1,priority=15 --link1 10.20.20.1,priority=20

This would cause link1 to be used first, since it has the higher priority.

If no priorities are configured manually (or two links have the same priority), links will be used in order of their number, with the lower number having higher priority.

Even if all links are working, only the one with the highest priority will see corosync traffic. Link priorities cannot be mixed, meaning that links with different priorities will not be able to communicate with each other.

Since lower priority links will not see traffic unless all higher priorities have failed, it becomes a useful strategy to specify networks used for other tasks (VMs, storage, etc.) as low-priority links. If worst comes to worst, a higher latency or more congested connection might be better than no connection at all.

To add a new link to a running configuration, first check how to edit the corosync.conf file.

Then, add a new ringX_addr to every node in the nodelist section. Make sure that your X is the same for every node you add it to, and that it is unique for each node.

Lastly, add a new interface, as shown below, to your totem section, replacing X with the link number chosen above.

Assuming you added a link with number 1, the new configuration file could look like this:

logging {
  debug: off
  to_syslog: yes
}

nodelist {

  node {
    name: due
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.2
    ring1_addr: 10.20.20.2
  }

  node {
    name: tre
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.3
    ring1_addr: 10.20.20.3
  }

  node {
    name: uno
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.1
    ring1_addr: 10.20.20.1
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: testcluster
  config_version: 4
  ip_version: ipv4-6
  secauth: on
  version: 2
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
}

The new link will be enabled as soon as you follow the last steps to edit the corosync.conf file. A restart should not be necessary. You can check that corosync loaded the new link using:

journalctl -b -u corosync

It might be a good idea to test the new link by temporarily disconnecting the old link on one node and making sure that its status remains online while disconnected:

pvecm status

If you see a healthy cluster state, it means that your new link is being used.

5.9. Role of SSH in Proxmox VE Clusters

Proxmox VE utilizes SSH tunnels for various features.

  • Proxying console/shell sessions (node and guests)

    When using the shell for node B while being connected to node A, connects to a terminal proxy on node A, which is in turn connected to the login shell on node B via a non-interactive SSH tunnel.

  • VM and CT memory and local-storage migration in secure mode.

    During the migration, one or more SSH tunnel(s) are established between the source and target nodes, in order to exchange migration information and transfer memory and disk contents.

  • Storage replication

5.9.1. SSH setup

On Proxmox VE systems, the following changes are made to the SSH configuration/setup:

  • the root user’s SSH client config gets setup to prefer AES over ChaCha20

  • the root user’s authorized_keys file gets linked to /etc/pve/priv/authorized_keys, merging all authorized keys within a cluster

  • sshd is configured to allow logging in as root with a password

Note Older systems might also have /etc/ssh/ssh_known_hosts set up as symlink pointing to /etc/pve/priv/known_hosts, containing a merged version of all node host keys. This system was replaced with explicit host key pinning in pve-cluster <<INSERT VERSION>>, the symlink can be deconfigured if still in place by running pvecm updatecerts --unmerge-known-hosts.

5.9.2. Pitfalls due to automatic execution of .bashrc and siblings

In case you have a custom .bashrc, or similar files that get executed on login by the configured shell, ssh will automatically run it once the session is established successfully. This can cause some unexpected behavior, as those commands may be executed with root permissions on any of the operations described above. This can cause possible problematic side-effects!

In order to avoid such complications, it’s recommended to add a check in /root/.bashrc to make sure the session is interactive, and only then run .bashrc commands.

You can add this snippet at the beginning of your .bashrc file:

# Early exit if not running interactively to avoid side-effects!
case $- in
    *i*) ;;
      *) return;;
esac

5.10. Corosync External Vote Support

This section describes a way to deploy an external voter in a Proxmox VE cluster. When configured, the cluster can sustain more node failures without violating safety properties of the cluster communication.

For this to work, there are two services involved:

  • A QDevice daemon which runs on each Proxmox VE node

  • An external vote daemon which runs on an independent server

As a result, you can achieve higher availability, even in smaller setups (for example 2+1 nodes).

5.10.1. QDevice Technical Overview

The Corosync Quorum Device (QDevice) is a daemon which runs on each cluster node. It provides a configured number of votes to the cluster’s quorum subsystem, based on an externally running third-party arbitrator’s decision. Its primary use is to allow a cluster to sustain more node failures than standard quorum rules allow. This can be done safely as the external device can see all nodes and thus choose only one set of nodes to give its vote. This will only be done if said set of nodes can have quorum (again) after receiving the third-party vote.

Currently, only QDevice Net is supported as a third-party arbitrator. This is a daemon which provides a vote to a cluster partition, if it can reach the partition members over the network. It will only give votes to one partition of a cluster at any time. It’s designed to support multiple clusters and is almost configuration and state free. New clusters are handled dynamically and no configuration file is needed on the host running a QDevice.

The only requirements for the external host are that it needs network access to the cluster and to have a corosync-qnetd package available. We provide a package for Debian based hosts, and other Linux distributions should also have a package available through their respective package manager.

Note Unlike corosync itself, a QDevice connects to the cluster over TCP/IP. The daemon can also run outside the LAN of the cluster and isn’t limited to the low latencies requirements of corosync.

5.10.2. Supported Setups

We support QDevices for clusters with an even number of nodes and recommend it for 2 node clusters, if they should provide higher availability. For clusters with an odd node count, we currently discourage the use of QDevices. The reason for this is the difference in the votes which the QDevice provides for each cluster type. Even numbered clusters get a single additional vote, which only increases availability, because if the QDevice itself fails, you are in the same position as with no QDevice at all.

On the other hand, with an odd numbered cluster size, the QDevice provides (N-1) votes — where N corresponds to the cluster node count. This alternative behavior makes sense; if it had only one additional vote, the cluster could get into a split-brain situation. This algorithm allows for all nodes but one (and naturally the QDevice itself) to fail. However, there are two drawbacks to this:

  • If the QNet daemon itself fails, no other node may fail or the cluster immediately loses quorum. For example, in a cluster with 15 nodes, 7 could fail before the cluster becomes inquorate. But, if a QDevice is configured here and it itself fails, no single node of the 15 may fail. The QDevice acts almost as a single point of failure in this case.

  • The fact that all but one node plus QDevice may fail sounds promising at first, but this may result in a mass recovery of HA services, which could overload the single remaining node. Furthermore, a Ceph server will stop providing services if only ((N-1)/2) nodes or less remain online.

If you understand the drawbacks and implications, you can decide yourself if you want to use this technology in an odd numbered cluster setup.

5.10.3. QDevice-Net Setup

We recommend running any daemon which provides votes to corosync-qdevice as an unprivileged user. Proxmox VE and Debian provide a package which is already configured to do so. The traffic between the daemon and the cluster must be encrypted to ensure a safe and secure integration of the QDevice in Proxmox VE.

First, install the corosync-qnetd package on your external server

external# apt install corosync-qnetd

and the corosync-qdevice package on all cluster nodes

pve# apt install corosync-qdevice

After doing this, ensure that all the nodes in the cluster are online.

You can now set up your QDevice by running the following command on one of the Proxmox VE nodes:

pve# pvecm qdevice setup <QDEVICE-IP>

The SSH key from the cluster will be automatically copied to the QDevice.

Note Make sure to setup key-based access for the root user on your external server, or temporarily allow root login with password during the setup phase. If you receive an error such as Host key verification failed. at this stage, running pvecm updatecerts could fix the issue.

After all the steps have successfully completed, you will see "Done". You can verify that the QDevice has been set up with:

pve# pvecm status

...

Votequorum information
~~~~~~~~~~~~~~~~~~~~~
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
~~~~~~~~~~~~~~~~~~~~~~
    Nodeid      Votes    Qdevice Name
    0x00000001      1    A,V,NMW 192.168.22.180 (local)
    0x00000002      1    A,V,NMW 192.168.22.181
    0x00000000      1            Qdevice
QDevice Status Flags

The status output of the QDevice, as seen above, will usually contain three columns:

  • A / NA: Alive or Not Alive. Indicates if the communication to the external corosync-qnetd daemon works.

  • V / NV: If the QDevice will cast a vote for the node. In a split-brain situation, where the corosync connection between the nodes is down, but they both can still communicate with the external corosync-qnetd daemon, only one node will get the vote.

  • MW / NMW: Master wins (MV) or not (NMW). Default is NMW, see
    [votequorum_qdevice_master_wins manual page https://manpages.debian.org/bookworm/libvotequorum-dev/votequorum_qdevice_master_wins.3.en.html]
    .

  • NR: QDevice is not registered.

Note If your QDevice is listed as Not Alive (NA in the output above), ensure that port 5403 (the default port of the qnetd server) of your external server is reachable via TCP/IP!

5.10.4. Frequently Asked Questions

Tie Breaking

In case of a tie, where two same-sized cluster partitions cannot see each other but can see the QDevice, the QDevice chooses one of those partitions randomly and provides a vote to it.

Possible Negative Implications

For clusters with an even node count, there are no negative implications when using a QDevice. If it fails to work, it is the same as not having a QDevice at all.

Adding/Deleting Nodes After QDevice Setup

If you want to add a new node or remove an existing one from a cluster with a QDevice setup, you need to remove the QDevice first. After that, you can add or remove nodes normally. Once you have a cluster with an even node count again, you can set up the QDevice again as described previously.

Removing the QDevice

If you used the official pvecm tool to add the QDevice, you can remove it by running:

pve# pvecm qdevice remove

5.11. Corosync Configuration

The /etc/pve/corosync.conf file plays a central role in a Proxmox VE cluster. It controls the cluster membership and its network. For further information about it, check the corosync.conf man page:

man corosync.conf

For node membership, you should always use the pvecm tool provided by Proxmox VE. You may have to edit the configuration file manually for other changes. Here are a few best practice tips for doing this.

5.11.1. Edit corosync.conf

Editing the corosync.conf file is not always very straightforward. There are two on each cluster node, one in /etc/pve/corosync.conf and the other in /etc/corosync/corosync.conf. Editing the one in our cluster file system will propagate the changes to the local one, but not vice versa.

The configuration will get updated automatically, as soon as the file changes. This means that changes which can be integrated in a running corosync will take effect immediately. Thus, you should always make a copy and edit that instead, to avoid triggering unintended changes when saving the file while editing.

cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new

Then, open the config file with your favorite editor, such as nano or vim.tiny, which come pre-installed on every Proxmox VE node.

Note Always increment the config_version number after configuration changes; omitting this can lead to problems.

After making the necessary changes, create another copy of the current working configuration file. This serves as a backup if the new configuration fails to apply or causes other issues.

cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak

Then replace the old configuration file with the new one:

mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf

You can check if the changes could be applied automatically, using the following commands:

systemctl status corosync
journalctl -b -u corosync

If the changes could not be applied automatically, you may have to restart the corosync service via:

systemctl restart corosync

On errors, check the troubleshooting section below.

5.11.2. Troubleshooting

Issue: quorum.expected_votes must be configured

When corosync starts to fail and you get the following message in the system log:

[...]
corosync[1647]:  [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
corosync[1647]:  [SERV  ] Service engine 'corosync_quorum' failed to load for reason
    'configuration error: nodelist or quorum.expected_votes must be configured!'
[...]

It means that the hostname you set for a corosync ringX_addr in the configuration could not be resolved.

Write Configuration When Not Quorate

If you need to change /etc/pve/corosync.conf on a node with no quorum, and you understand what you are doing, use:

pvecm expected 1

This sets the expected vote count to 1 and makes the cluster quorate. You can then fix your configuration, or revert it back to the last working backup.

This is not enough if corosync cannot start anymore. In that case, it is best to edit the local copy of the corosync configuration in /etc/corosync/corosync.conf, so that corosync can start again. Ensure that on all nodes, this configuration has the same content to avoid split-brain situations.

5.11.3. Corosync Configuration Glossary

ringX_addr

This names the different link addresses for the Kronosnet connections between nodes.

5.12. Cluster Cold Start

It is obvious that a cluster is not quorate when all nodes are offline. This is a common case after a power failure.

Note It is always a good idea to use an uninterruptible power supply (“UPS”, also called “battery backup”) to avoid this state, especially if you want HA.

On node startup, the pve-guests service is started and waits for quorum. Once quorate, it starts all guests which have the onboot flag set.

When you turn on nodes, or when power comes back after power failure, it is likely that some nodes will boot faster than others. Please keep in mind that guest startup is delayed until you reach quorum.

5.13. Guest VMID Auto-Selection

When creating new guests the web interface will ask the backend for a free VMID automatically. The default range for searching is 100 to 1000000 (lower than the maximal allowed VMID enforced by the schema).

Sometimes admins either want to allocate new VMIDs in a separate range, for example to easily separate temporary VMs with ones that choose a VMID manually. Other times its just desired to provided a stable length VMID, for which setting the lower boundary to, for example, 100000 gives much more room for.

To accommodate this use case one can set either lower, upper or both boundaries via the datacenter.cfg configuration file, which can be edited in the web interface under DatacenterOptions.

Note The range is only used for the next-id API call, so it isn’t a hard limit.

5.14. Guest Migration

Migrating virtual guests to other nodes is a useful feature in a cluster. There are settings to control the behavior of such migrations. This can be done via the configuration file datacenter.cfg or for a specific migration via API or command-line parameters.

It makes a difference if a guest is online or offline, or if it has local resources (like a local disk).

For details about virtual machine migration, see the QEMU/KVM Migration Chapter.

For details about container migration, see the Container Migration Chapter.

5.14.1. Migration Type

The migration type defines if the migration data should be sent over an encrypted (secure) channel or an unencrypted (insecure) one. Setting the migration type to insecure means that the RAM content of a virtual guest is also transferred unencrypted, which can lead to information disclosure of critical data from inside the guest (for example, passwords or encryption keys).

Therefore, we strongly recommend using the secure channel if you do not have full control over the network and can not guarantee that no one is eavesdropping on it.

Note Storage migration does not follow this setting. Currently, it always sends the storage content over a secure channel.

Encryption requires a lot of computing power, so this setting is often changed to insecure to achieve better performance. The impact on modern systems is lower because they implement AES encryption in hardware. The performance impact is particularly evident in fast networks, where you can transfer 10 Gbps or more.

5.14.2. Migration Network

By default, Proxmox VE uses the network in which cluster communication takes place to send the migration traffic. This is not optimal both because sensitive cluster traffic can be disrupted and this network may not have the best bandwidth available on the node.

Setting the migration network parameter allows the use of a dedicated network for all migration traffic. In addition to the memory, this also affects the storage traffic for offline migrations.

The migration network is set as a network using CIDR notation. This has the advantage that you don’t have to set individual IP addresses for each node. Proxmox VE can determine the real address on the destination node from the network specified in the CIDR form. To enable this, the network must be specified so that each node has exactly one IP in the respective network.

Example

We assume that we have a three-node setup, with three separate networks. One for public communication with the Internet, one for cluster communication, and a very fast one, which we want to use as a dedicated network for migration.

A network configuration for such a setup might look as follows:

iface eno1 inet manual

# public network
auto vmbr0
iface vmbr0 inet static
    address 192.X.Y.57/24
    gateway 192.X.Y.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

# cluster network
auto eno2
iface eno2 inet static
    address  10.1.1.1/24

# fast network
auto eno3
iface eno3 inet static
    address  10.1.2.1/24

Here, we will use the network 10.1.2.0/24 as a migration network. For a single migration, you can do this using the migration_network parameter of the command-line tool:

# qm migrate 106 tre --online --migration_network 10.1.2.0/24

To configure this as the default network for all migrations in the cluster, set the migration property of the /etc/pve/datacenter.cfg file:

# use dedicated migration network
migration: secure,network=10.1.2.0/24
Note The migration type must always be set when the migration network is set in /etc/pve/datacenter.cfg.

6. Proxmox Cluster File System (pmxcfs)

The Proxmox Cluster file system (“pmxcfs”) is a database-driven file system for storing configuration files, replicated in real time to all cluster nodes using corosync. We use this to store all Proxmox VE related configuration files.

Although the file system stores all data inside a persistent database on disk, a copy of the data resides in RAM. This imposes restrictions on the maximum size, which is currently 128 MiB. This is still enough to store the configuration of several thousand virtual machines.

This system provides the following advantages:

  • Seamless replication of all configuration to all nodes in real time

  • Provides strong consistency checks to avoid duplicate VM IDs

  • Read-only when a node loses quorum

  • Automatic updates of the corosync cluster configuration to all nodes

  • Includes a distributed locking mechanism

6.1. POSIX Compatibility

The file system is based on FUSE, so the behavior is POSIX like. But some feature are simply not implemented, because we do not need them:

  • You can just generate normal files and directories, but no symbolic links, …

  • You can’t rename non-empty directories (because this makes it easier to guarantee that VMIDs are unique).

  • You can’t change file permissions (permissions are based on paths)

  • O_EXCL creates were not atomic (like old NFS)

  • O_TRUNC creates are not atomic (FUSE restriction)

6.2. File Access Rights

All files and directories are owned by user root and have group www-data. Only root has write permissions, but group www-data can read most files. Files below the following paths are only accessible by root:

/etc/pve/priv/
/etc/pve/nodes/${NAME}/priv/

6.3. Technology

We use the Corosync Cluster Engine for cluster communication, and SQlite for the database file. The file system is implemented in user space using FUSE.

6.4. File System Layout

The file system is mounted at:

/etc/pve

6.4.1. Files

authkey.pub

Public key used by the ticket system

ceph.conf

Ceph configuration file (note: /etc/ceph/ceph.conf is a symbolic link to this)

corosync.conf

Corosync cluster configuration file (prior to Proxmox VE 4.x, this file was called cluster.conf)

datacenter.cfg

Proxmox VE datacenter-wide configuration (keyboard layout, proxy, …)

domains.cfg

Proxmox VE authentication domains

firewall/cluster.fw

Firewall configuration applied to all nodes

firewall/<NAME>.fw

Firewall configuration for individual nodes

firewall/<VMID>.fw

Firewall configuration for VMs and containers

ha/crm_commands

Displays HA operations that are currently being carried out by the CRM

ha/manager_status

JSON-formatted information regarding HA services on the cluster

ha/resources.cfg

Resources managed by high availability, and their current state

nodes/<NAME>/config

Node-specific configuration

nodes/<NAME>/lxc/<VMID>.conf

VM configuration data for LXC containers

nodes/<NAME>/openvz/

Prior to Proxmox VE 4.0, used for container configuration data (deprecated, removed soon)

nodes/<NAME>/pve-ssl.key

Private SSL key for pve-ssl.pem

nodes/<NAME>/pve-ssl.pem

Public SSL certificate for web server (signed by cluster CA)

nodes/<NAME>/pveproxy-ssl.key

Private SSL key for pveproxy-ssl.pem (optional)

nodes/<NAME>/pveproxy-ssl.pem

Public SSL certificate (chain) for web server (optional override for pve-ssl.pem)

nodes/<NAME>/qemu-server/<VMID>.conf

VM configuration data for KVM VMs

priv/authkey.key

Private key used by ticket system

priv/authorized_keys

SSH keys of cluster members for authentication

priv/ceph*

Ceph authentication keys and associated capabilities

priv/known_hosts

SSH keys of the cluster members for verification

priv/lock/*

Lock files used by various services to ensure safe cluster-wide operations

priv/pve-root-ca.key

Private key of cluster CA

priv/shadow.cfg

Shadow password file for PVE Realm users

priv/storage/<STORAGE-ID>.pw

Contains the password of a storage in plain text

priv/tfa.cfg

Base64-encoded two-factor authentication configuration

priv/token.cfg

API token secrets of all tokens

pve-root-ca.pem

Public certificate of cluster CA

pve-www.key

Private key used for generating CSRF tokens

sdn/*

Shared configuration files for Software Defined Networking (SDN)

status.cfg

Proxmox VE external metrics server configuration

storage.cfg

Proxmox VE storage configuration

user.cfg

Proxmox VE access control configuration (users/groups/…)

virtual-guest/cpu-models.conf

For storing custom CPU models

vzdump.cron

Cluster-wide vzdump backup-job schedule

Certain directories within the cluster file system use symbolic links, in order to point to a node’s own configuration files. Thus, the files pointed to in the table below refer to different files on each node of the cluster.

local

nodes/<LOCAL_HOST_NAME>

lxc

nodes/<LOCAL_HOST_NAME>/lxc/

openvz

nodes/<LOCAL_HOST_NAME>/openvz/ (deprecated, removed soon)

qemu-server

nodes/<LOCAL_HOST_NAME>/qemu-server/

6.4.3. Special status files for debugging (JSON)

.version

File versions (to detect file modifications)

.members

Info about cluster members

.vmlist

List of all VMs

.clusterlog

Cluster log (last 50 entries)

.rrd

RRD data (most recent entries)

6.4.4. Enable/Disable debugging

You can enable verbose syslog messages with:

echo "1" >/etc/pve/.debug

And disable verbose syslog messages with:

echo "0" >/etc/pve/.debug

6.5. Recovery

If you have major problems with your Proxmox VE host, for example hardware issues, it could be helpful to copy the pmxcfs database file /var/lib/pve-cluster/config.db, and move it to a new Proxmox VE host. On the new host (with nothing running), you need to stop the pve-cluster service and replace the config.db file (required permissions 0600). Following this, adapt /etc/hostname and /etc/hosts according to the lost Proxmox VE host, then reboot and check (and don’t forget your VM/CT data).

6.5.1. Remove Cluster Configuration

The recommended way is to reinstall the node after you remove it from your cluster. This ensures that all secret cluster/ssh keys and any shared configuration data is destroyed.

In some cases, you might prefer to put a node back to local mode without reinstalling, which is described in Separate A Node Without Reinstalling

6.5.2. Recovering/Moving Guests from Failed Nodes

For the guest configuration files in nodes/<NAME>/qemu-server/ (VMs) and nodes/<NAME>/lxc/ (containers), Proxmox VE sees the containing node <NAME> as the owner of the respective guest. This concept enables the usage of local locks instead of expensive cluster-wide locks for preventing concurrent guest configuration changes.

As a consequence, if the owning node of a guest fails (for example, due to a power outage, fencing event, etc.), a regular migration is not possible (even if all the disks are located on shared storage), because such a local lock on the (offline) owning node is unobtainable. This is not a problem for HA-managed guests, as Proxmox VE’s High Availability stack includes the necessary (cluster-wide) locking and watchdog functionality to ensure correct and automatic recovery of guests from fenced nodes.

If a non-HA-managed guest has only shared disks (and no other local resources which are only available on the failed node), a manual recovery is possible by simply moving the guest configuration file from the failed node’s directory in /etc/pve/ to an online node’s directory (which changes the logical owner or location of the guest).

For example, recovering the VM with ID 100 from an offline node1 to another node node2 works by running the following command as root on any member node of the cluster:

mv /etc/pve/nodes/node1/qemu-server/100.conf /etc/pve/nodes/node2/qemu-server/
Warning Before manually recovering a guest like this, make absolutely sure that the failed source node is really powered off/fenced. Otherwise Proxmox VE’s locking principles are violated by the mv command, which can have unexpected consequences.
Warning Guests with local disks (or other local resources which are only available on the offline node) are not recoverable like this. Either wait for the failed node to rejoin the cluster or restore such guests from backups.

7. Proxmox VE Storage

The Proxmox VE storage model is very flexible. Virtual machine images can either be stored on one or several local storages, or on shared storage like NFS or iSCSI (NAS, SAN). There are no limits, and you may configure as many storage pools as you like. You can use all storage technologies available for Debian Linux.

One major benefit of storing VMs on shared storage is the ability to live-migrate running machines without any downtime, as all nodes in the cluster have direct access to VM disk images. There is no need to copy VM image data, so live migration is very fast in that case.

The storage library (package libpve-storage-perl) uses a flexible plugin system to provide a common interface to all storage types. This can be easily adopted to include further storage types in the future.

7.1. Storage Types

There are basically two different classes of storage types:

File level storage

File level based storage technologies allow access to a fully featured (POSIX) file system. They are in general more flexible than any Block level storage (see below), and allow you to store content of any type. ZFS is probably the most advanced system, and it has full support for snapshots and clones.

Block level storage

Allows to store large raw images. It is usually not possible to store other files (ISO, backups, ..) on such storage types. Most modern block level storage implementations support snapshots and clones. RADOS and GlusterFS are distributed systems, replicating storage data to different nodes.

Table 2. Available storage types
Description Plugin type Level Shared Snapshots Stable

ZFS (local)

zfspool

both1

no

yes

yes

Directory

dir

file

no

no2

yes

BTRFS

btrfs

file

no

yes

technology preview

NFS

nfs

file

yes

no2

yes

CIFS

cifs

file

yes

no2

yes

Proxmox Backup

pbs

both

yes

n/a

yes

GlusterFS

glusterfs

file

yes

no2

yes

CephFS

cephfs

file

yes

yes

yes

LVM

lvm

block

no3

no

yes

LVM-thin

lvmthin

block

no

yes

yes

iSCSI/kernel

iscsi

block

yes

no

yes

iSCSI/libiscsi

iscsidirect

block

yes

no

yes

Ceph/RBD

rbd

block

yes

yes

yes

ZFS over iSCSI

zfs

block

yes

yes

yes

1: Disk images for VMs are stored in ZFS volume (zvol) datasets, which provide block device functionality.

2: On file based storages, snapshots are possible with the qcow2 format.

3: It is possible to use LVM on top of an iSCSI or FC-based storage. That way you get a shared LVM storage

7.1.1. Thin Provisioning

A number of storages, and the QEMU image format qcow2, support thin provisioning. With thin provisioning activated, only the blocks that the guest system actually use will be written to the storage.

Say for instance you create a VM with a 32GB hard disk, and after installing the guest system OS, the root file system of the VM contains 3 GB of data. In that case only 3GB are written to the storage, even if the guest VM sees a 32GB hard drive. In this way thin provisioning allows you to create disk images which are larger than the currently available storage blocks. You can create large disk images for your VMs, and when the need arises, add more disks to your storage without resizing the VMs' file systems.

All storage types which have the “Snapshots” feature also support thin provisioning.

Caution If a storage runs full, all guests using volumes on that storage receive IO errors. This can cause file system inconsistencies and may corrupt your data. So it is advisable to avoid over-provisioning of your storage resources, or carefully observe free space to avoid such conditions.

7.2. Storage Configuration

All Proxmox VE related storage configuration is stored within a single text file at /etc/pve/storage.cfg. As this file is within /etc/pve/, it gets automatically distributed to all cluster nodes. So all nodes share the same storage configuration.

Sharing storage configuration makes perfect sense for shared storage, because the same “shared” storage is accessible from all nodes. But it is also useful for local storage types. In this case such local storage is available on all nodes, but it is physically different and can have totally different content.

7.2.1. Storage Pools

Each storage pool has a <type>, and is uniquely identified by its <STORAGE_ID>. A pool configuration looks like this:

<type>: <STORAGE_ID>
        <property> <value>
        <property> <value>
        <property>
        ...

The <type>: <STORAGE_ID> line starts the pool definition, which is then followed by a list of properties. Most properties require a value. Some have reasonable defaults, in which case you can omit the value.

To be more specific, take a look at the default storage configuration after installation. It contains one special local storage pool named local, which refers to the directory /var/lib/vz and is always available. The Proxmox VE installer creates additional storage entries depending on the storage type chosen at installation time.

Default storage configuration (/etc/pve/storage.cfg)
dir: local
        path /var/lib/vz
        content iso,vztmpl,backup

# default image store on LVM based installation
lvmthin: local-lvm
        thinpool data
        vgname pve
        content rootdir,images

# default image store on ZFS based installation
zfspool: local-zfs
        pool rpool/data
        sparse
        content images,rootdir
Caution It is problematic to have multiple storage configurations pointing to the exact same underlying storage. Such an aliased storage configuration can lead to two different volume IDs (volid) pointing to the exact same disk image. Proxmox VE expects that the images' volume IDs point to, are unique. Choosing different content types for aliased storage configurations can be fine, but is not recommended.

7.2.2. Common Storage Properties

A few storage properties are common among different storage types.

nodes

List of cluster node names where this storage is usable/accessible. One can use this property to restrict storage access to a limited set of nodes.

content

A storage can support several content types, for example virtual disk images, cdrom iso images, container templates or container root directories. Not all storage types support all content types. One can set this property to select what this storage is used for.

images

QEMU/KVM VM images.

rootdir

Allow to store container data.

vztmpl

Container templates.

backup

Backup files (vzdump).

iso

ISO images

snippets

Snippet files, for example guest hook scripts

shared

Indicate that this is a single storage with the same contents on all nodes (or all listed in the nodes option). It will not make the contents of a local storage automatically accessible to other nodes, it just marks an already shared storage as such!

disable

You can use this flag to disable the storage completely.

maxfiles

Deprecated, please use prune-backups instead. Maximum number of backup files per VM. Use 0 for unlimited.

prune-backups

Retention options for backups. For details, see Backup Retention.

format

Default image format (raw|qcow2|vmdk)

preallocation

Preallocation mode (off|metadata|falloc|full) for raw and qcow2 images on file-based storages. The default is metadata, which is treated like off for raw images. When using network storages in combination with large qcow2 images, using off can help to avoid timeouts.

Warning It is not advisable to use the same storage pool on different Proxmox VE clusters. Some storage operation need exclusive access to the storage, so proper locking is required. While this is implemented within a cluster, it does not work between different clusters.

7.3. Volumes

We use a special notation to address storage data. When you allocate data from a storage pool, it returns such a volume identifier. A volume is identified by the <STORAGE_ID>, followed by a storage type dependent volume name, separated by colon. A valid <VOLUME_ID> looks like:

local:230/example-image.raw
local:iso/debian-501-amd64-netinst.iso
local:vztmpl/debian-5.0-joomla_1.5.9-1_i386.tar.gz
iscsi-storage:0.0.2.scsi-14f504e46494c4500494b5042546d2d646744372d31616d61

To get the file system path for a <VOLUME_ID> use:

pvesm path <VOLUME_ID>

7.3.1. Volume Ownership

There exists an ownership relation for image type volumes. Each such volume is owned by a VM or Container. For example volume local:230/example-image.raw is owned by VM 230. Most storage backends encodes this ownership information into the volume name.

When you remove a VM or Container, the system also removes all associated volumes which are owned by that VM or Container.

7.4. Using the Command-line Interface

It is recommended to familiarize yourself with the concept behind storage pools and volume identifiers, but in real life, you are not forced to do any of those low level operations on the command line. Normally, allocation and removal of volumes is done by the VM and Container management tools.

Nevertheless, there is a command-line tool called pvesm (“Proxmox VE Storage Manager”), which is able to perform common storage management tasks.

7.4.1. Examples

Add storage pools

pvesm add <TYPE> <STORAGE_ID> <OPTIONS>
pvesm add dir <STORAGE_ID> --path <PATH>
pvesm add nfs <STORAGE_ID> --path <PATH> --server <SERVER> --export <EXPORT>
pvesm add lvm <STORAGE_ID> --vgname <VGNAME>
pvesm add iscsi <STORAGE_ID> --portal <HOST[:PORT]> --target <TARGET>

Disable storage pools

pvesm set <STORAGE_ID> --disable 1

Enable storage pools

pvesm set <STORAGE_ID> --disable 0

Change/set storage options

pvesm set <STORAGE_ID> <OPTIONS>
pvesm set <STORAGE_ID> --shared 1
pvesm set local --format qcow2
pvesm set <STORAGE_ID> --content iso

Remove storage pools. This does not delete any data, and does not disconnect or unmount anything. It just removes the storage configuration.

pvesm remove <STORAGE_ID>

Allocate volumes

pvesm alloc <STORAGE_ID> <VMID> <name> <size> [--format <raw|qcow2>]

Allocate a 4G volume in local storage. The name is auto-generated if you pass an empty string as <name>

pvesm alloc local <VMID> '' 4G

Free volumes

pvesm free <VOLUME_ID>
Warning This really destroys all volume data.

List storage status

pvesm status

List storage contents

pvesm list <STORAGE_ID> [--vmid <VMID>]

List volumes allocated by VMID

pvesm list <STORAGE_ID> --vmid <VMID>

List iso images

pvesm list <STORAGE_ID> --content iso

List container templates

pvesm list <STORAGE_ID> --content vztmpl

Show file system path for a volume

pvesm path <VOLUME_ID>

Exporting the volume local:103/vm-103-disk-0.qcow2 to the file target. This is mostly used internally with pvesm import. The stream format qcow2+size is different to the qcow2 format. Consequently, the exported file cannot simply be attached to a VM. This also holds for the other formats.

pvesm export local:103/vm-103-disk-0.qcow2 qcow2+size target --with-snapshots 1

7.5. Directory Backend

Storage pool type: dir

Proxmox VE can use local directories or locally mounted shares for storage. A directory is a file level storage, so you can store any content type like virtual disk images, containers, templates, ISO images or backup files.

Note You can mount additional storages via standard linux /etc/fstab, and then define a directory storage for that mount point. This way you can use any file system supported by Linux.

This backend assumes that the underlying directory is POSIX compatible, but nothing else. This implies that you cannot create snapshots at the storage level. But there exists a workaround for VM images using the qcow2 file format, because that format supports snapshots internally.

Tip Some storage types do not support O_DIRECT, so you can’t use cache mode none with such storages. Simply use cache mode writeback instead.

We use a predefined directory layout to store different content types into different sub-directories. This layout is used by all file level storage backends.

Table 3. Directory layout
Content type Subdir

VM images

images/<VMID>/

ISO images

template/iso/

Container templates

template/cache/

Backup files

dump/

Snippets

snippets/

7.5.1. Configuration

This backend supports all common storage properties, and adds two additional properties. The path property is used to specify the directory. This needs to be an absolute file system path.

The optional content-dirs property allows for the default layout to be changed. It consists of a comma-separated list of identifiers in the following format:

vtype=path

Where vtype is one of the allowed content types for the storage, and path is a path relative to the mountpoint of the storage.

Configuration Example (/etc/pve/storage.cfg)
dir: backup
        path /mnt/backup
        content backup
        prune-backups keep-last=7
        max-protected-backups 3
        content-dirs backup=custom/backup/dir

The above configuration defines a storage pool called backup. That pool can be used to store up to 7 regular backups (keep-last=7) and 3 protected backups per VM. The real path for the backup files is /mnt/backup/custom/backup/dir/....

7.5.2. File naming conventions

This backend uses a well defined naming scheme for VM images:

vm-<VMID>-<NAME>.<FORMAT>
<VMID>

This specifies the owner VM.

<NAME>

This can be an arbitrary name (ascii) without white space. The backend uses disk-[N] as default, where [N] is replaced by an integer to make the name unique.

<FORMAT>

Specifies the image format (raw|qcow2|vmdk).

When you create a VM template, all VM images are renamed to indicate that they are now read-only, and can be used as a base image for clones:

base-<VMID>-<NAME>.<FORMAT>
Note Such base images are used to generate cloned images. So it is important that those files are read-only, and never get modified. The backend changes the access mode to 0444, and sets the immutable flag (chattr +i) if the storage supports that.

7.5.3. Storage Features

As mentioned above, most file systems do not support snapshots out of the box. To workaround that problem, this backend is able to use qcow2 internal snapshot capabilities.

Same applies to clones. The backend uses the qcow2 base image feature to create clones.

Table 4. Storage features for backend dir
Content types Image formats Shared Snapshots Clones

images rootdir vztmpl iso backup snippets

raw qcow2 vmdk subvol

no

qcow2

qcow2

7.5.4. Examples

Please use the following command to allocate a 4GB image on storage local:

# pvesm alloc local 100 vm-100-disk10.raw 4G
Formatting '/var/lib/vz/images/100/vm-100-disk10.raw', fmt=raw size=4294967296
successfully created 'local:100/vm-100-disk10.raw'
Note The image name must conform to above naming conventions.

The real file system path is shown with:

# pvesm path local:100/vm-100-disk10.raw
/var/lib/vz/images/100/vm-100-disk10.raw

And you can remove the image with:

# pvesm free local:100/vm-100-disk10.raw

7.6. NFS Backend

Storage pool type: nfs

The NFS backend is based on the directory backend, so it shares most properties. The directory layout and the file naming conventions are the same. The main advantage is that you can directly configure the NFS server properties, so the backend can mount the share automatically. There is no need to modify /etc/fstab. The backend can also test if the server is online, and provides a method to query the server for exported shares.

7.6.1. Configuration

The backend supports all common storage properties, except the shared flag, which is always set. Additionally, the following properties are used to configure the NFS server:

server

Server IP or DNS name. To avoid DNS lookup delays, it is usually preferable to use an IP address instead of a DNS name - unless you have a very reliable DNS server, or list the server in the local /etc/hosts file.

export

NFS export path (as listed by pvesm nfsscan).

You can also set NFS mount options:

path

The local mount point (defaults to /mnt/pve/<STORAGE_ID>/).

content-dirs

Overrides for the default directory layout. Optional.

options

NFS mount options (see man nfs).

Configuration Example (/etc/pve/storage.cfg)
nfs: iso-templates
        path /mnt/pve/iso-templates
        server 10.0.0.10
        export /space/iso-templates
        options vers=3,soft
        content iso,vztmpl
Tip After an NFS request times out, NFS request are retried indefinitely by default. This can lead to unexpected hangs on the client side. For read-only content, it is worth to consider the NFS soft option, which limits the number of retries to three.

7.6.2. Storage Features

NFS does not support snapshots, but the backend uses qcow2 features to implement snapshots and cloning.

Table 5. Storage features for backend nfs
Content types Image formats Shared Snapshots Clones

images rootdir vztmpl iso backup snippets

raw qcow2 vmdk

yes

qcow2

qcow2

7.6.3. Examples

You can get a list of exported NFS shares with:

# pvesm nfsscan <server>

7.7. CIFS Backend

Storage pool type: cifs

The CIFS backend extends the directory backend, so that no manual setup of a CIFS mount is needed. Such a storage can be added directly through the Proxmox VE API or the web UI, with all our backend advantages, like server heartbeat check or comfortable selection of exported shares.

7.7.1. Configuration

The backend supports all common storage properties, except the shared flag, which is always set. Additionally, the following CIFS special properties are available:

server

Server IP or DNS name. Required.

Tip To avoid DNS lookup delays, it is usually preferable to use an IP address instead of a DNS name - unless you have a very reliable DNS server, or list the server in the local /etc/hosts file.
share

CIFS share to use (get available ones with pvesm scan cifs <address> or the web UI). Required.

username

The username for the CIFS storage. Optional, defaults to ‘guest’.

password

The user password. Optional. It will be saved in a file only readable by root (/etc/pve/priv/storage/<STORAGE-ID>.pw).

domain

Sets the user domain (workgroup) for this storage. Optional.

smbversion

SMB protocol Version. Optional, default is 3. SMB1 is not supported due to security issues.

path

The local mount point. Optional, defaults to /mnt/pve/<STORAGE_ID>/.

content-dirs

Overrides for the default directory layout. Optional.

options

Additional CIFS mount options (see man mount.cifs). Some options are set automatically and shouldn’t be set here. Proxmox VE will always set the option soft. Depending on the configuration, these options are set automatically: username, credentials, guest, domain, vers.

subdir

The subdirectory of the share to mount. Optional, defaults to the root directory of the share.

Configuration Example (/etc/pve/storage.cfg)
cifs: backup
        path /mnt/pve/backup
        server 10.0.0.11
        share VMData
        content backup
        options noserverino,echo_interval=30
        username anna
        smbversion 3
        subdir /data

7.7.2. Storage Features

CIFS does not support snapshots on a storage level. But you may use qcow2 backing files if you still want to have snapshots and cloning features available.

Table 6. Storage features for backend cifs
Content types Image formats Shared Snapshots Clones

images rootdir vztmpl iso backup snippets

raw qcow2 vmdk

yes

qcow2

qcow2

7.7.3. Examples

You can get a list of exported CIFS shares with:

# pvesm scan cifs <server> [--username <username>] [--password]

Then you could add this share as a storage to the whole Proxmox VE cluster with:

# pvesm add cifs <storagename> --server <server> --share <share> [--username <username>] [--password]

7.8. Proxmox Backup Server

Storage pool type: pbs

This backend allows direct integration of a Proxmox Backup Server into Proxmox VE like any other storage. A Proxmox Backup storage can be added directly through the Proxmox VE API, CLI or the web interface.

7.8.1. Configuration

The backend supports all common storage properties, except the shared flag, which is always set. Additionally, the following special properties to Proxmox Backup Server are available:

server

Server IP or DNS name. Required.

port

Use this port instead of the default one, i.e. 8007. Optional.

username

The username for the Proxmox Backup Server storage. Required.

Tip Do not forget to add the realm to the username. For example, root@pam or archiver@pbs.
password

The user password. The value will be saved in a file under /etc/pve/priv/storage/<STORAGE-ID>.pw with access restricted to the root user. Required.

datastore

The ID of the Proxmox Backup Server datastore to use. Required.

fingerprint

The fingerprint of the Proxmox Backup Server API TLS certificate. You can get it in the Servers Dashboard or using the proxmox-backup-manager cert info command. Required for self-signed certificates or any other one where the host does not trusts the servers CA.

encryption-key

A key to encrypt the backup data from the client side. Currently only non-password protected (no key derive function (kdf)) are supported. Will be saved in a file under /etc/pve/priv/storage/<STORAGE-ID>.enc with access restricted to the root user. Use the magic value autogen to automatically generate a new one using proxmox-backup-client key create --kdf none <path>. Optional.

master-pubkey

A public RSA key used to encrypt the backup encryption key as part of the backup task. The encrypted copy will be appended to the backup and stored on the Proxmox Backup Server instance for recovery purposes. Optional, requires encryption-key.

Configuration Example (/etc/pve/storage.cfg)
pbs: backup
        datastore main
        server enya.proxmox.com
        content backup
        fingerprint 09:54:ef:..snip..:88:af:47:fe:4c:3b:cf:8b:26:88:0b:4e:3c:b2
        prune-backups keep-all=1
        username archiver@pbs

7.8.2. Storage Features

Proxmox Backup Server only supports backups, they can be block-level or file-level based. Proxmox VE uses block-level for virtual machines and file-level for container.

Table 7. Storage features for backend pbs
Content types Image formats Shared Snapshots Clones

backup

n/a

yes

n/a

n/a

7.8.3. Encryption

screenshot/storage-pbs-encryption-with-key.png

Optionally, you can configure client-side encryption with AES-256 in GCM mode. Encryption can be configured either via the web interface, or on the CLI with the encryption-key option (see above). The key will be saved in the file /etc/pve/priv/storage/<STORAGE-ID>.enc, which is only accessible by the root user.

Warning Without their key, backups will be inaccessible. Thus, you should keep keys ordered and in a place that is separate from the contents being backed up. It can happen, for example, that you back up an entire system, using a key on that system. If the system then becomes inaccessible for any reason and needs to be restored, this will not be possible as the encryption key will be lost along with the broken system.

It is recommended that you keep your key safe, but easily accessible, in order for quick disaster recovery. For this reason, the best place to store it is in your password manager, where it is immediately recoverable. As a backup to this, you should also save the key to a USB flash drive and store that in a secure place. This way, it is detached from any system, but is still easy to recover from, in case of emergency. Finally, in preparation for the worst case scenario, you should also consider keeping a paper copy of your key locked away in a safe place. The paperkey subcommand can be used to create a QR encoded version of your key. The following command sends the output of the paperkey command to a text file, for easy printing.

# proxmox-backup-client key paperkey /etc/pve/priv/storage/<STORAGE-ID>.enc --output-format text > qrkey.txt

Additionally, it is possible to use a single RSA master key pair for key recovery purposes: configure all clients doing encrypted backups to use a single public master key, and all subsequent encrypted backups will contain a RSA-encrypted copy of the used AES encryption key. The corresponding private master key allows recovering the AES key and decrypting the backup even if the client system is no longer available.

Warning The same safe-keeping rules apply to the master key pair as to the regular encryption keys. Without a copy of the private key recovery is not possible! The paperkey command supports generating paper copies of private master keys for storage in a safe, physical location.

Because the encryption is managed on the client side, you can use the same datastore on the server for unencrypted backups and encrypted backups, even if they are encrypted with different keys. However, deduplication between backups with different keys is not possible, so it is often better to create separate datastores.

Note Do not use encryption if there is no benefit from it, for example, when you are running the server locally in a trusted network. It is always easier to recover from unencrypted backups.

7.8.4. Example: Add Storage over CLI

Then you could add this share as a storage to the whole Proxmox VE cluster with:

# pvesm add pbs <id> --server <server> --datastore <datastore> --username <username> --fingerprint 00:B4:... --password

7.9. GlusterFS Backend

Storage pool type: glusterfs

GlusterFS is a scalable network file system. The system uses a modular design, runs on commodity hardware, and can provide a highly available enterprise storage at low costs. Such system is capable of scaling to several petabytes, and can handle thousands of clients.

Note After a node/brick crash, GlusterFS does a full rsync to make sure data is consistent. This can take a very long time with large files, so this backend is not suitable to store large VM images.

7.9.1. Configuration

The backend supports all common storage properties, and adds the following GlusterFS specific options:

server

GlusterFS volfile server IP or DNS name.

server2

Backup volfile server IP or DNS name.

volume

GlusterFS Volume.

transport

GlusterFS transport: tcp, unix or rdma

Configuration Example (/etc/pve/storage.cfg)
glusterfs: Gluster
        server 10.2.3.4
        server2 10.2.3.5
        volume glustervol
        content images,iso

7.9.2. File naming conventions

The directory layout and the file naming conventions are inherited from the dir backend.

7.9.3. Storage Features

The storage provides a file level interface, but no native snapshot/clone implementation.

Table 8. Storage features for backend glusterfs
Content types Image formats Shared Snapshots Clones

images vztmpl iso backup snippets

raw qcow2 vmdk

yes

qcow2

qcow2

7.10. Local ZFS Pool Backend

Storage pool type: zfspool

This backend allows you to access local ZFS pools (or ZFS file systems inside such pools).

7.10.1. Configuration

The backend supports the common storage properties content, nodes, disable, and the following ZFS specific properties:

pool

Select the ZFS pool/filesystem. All allocations are done within that pool.

blocksize

Set ZFS blocksize parameter.

sparse

Use ZFS thin-provisioning. A sparse volume is a volume whose reservation is not equal to the volume size.

mountpoint

The mount point of the ZFS pool/filesystem. Changing this does not affect the mountpoint property of the dataset seen by zfs. Defaults to /<pool>.

Configuration Example (/etc/pve/storage.cfg)
zfspool: vmdata
        pool tank/vmdata
        content rootdir,images
        sparse

7.10.2. File naming conventions

The backend uses the following naming scheme for VM images:

vm-<VMID>-<NAME>      // normal VM images
base-<VMID>-<NAME>    // template VM image (read-only)
subvol-<VMID>-<NAME>  // subvolumes (ZFS filesystem for containers)
<VMID>

This specifies the owner VM.

<NAME>

This can be an arbitrary name (ascii) without white space. The backend uses disk[N] as default, where [N] is replaced by an integer to make the name unique.

7.10.3. Storage Features

ZFS is probably the most advanced storage type regarding snapshot and cloning. The backend uses ZFS datasets for both VM images (format raw) and container data (format subvol). ZFS properties are inherited from the parent dataset, so you can simply set defaults on the parent dataset.

Table 9. Storage features for backend zfs
Content types Image formats Shared Snapshots Clones

images rootdir

raw subvol

no

yes

yes

7.10.4. Examples

It is recommended to create an extra ZFS file system to store your VM images:

# zfs create tank/vmdata

To enable compression on that newly allocated file system:

# zfs set compression=on tank/vmdata

You can get a list of available ZFS filesystems with:

# pvesm zfsscan

7.11. LVM Backend

Storage pool type: lvm

LVM is a light software layer on top of hard disks and partitions. It can be used to split available disk space into smaller logical volumes. LVM is widely used on Linux and makes managing hard drives easier.

Another use case is to put LVM on top of a big iSCSI LUN. That way you can easily manage space on that iSCSI LUN, which would not be possible otherwise, because the iSCSI specification does not define a management interface for space allocation.

7.11.1. Configuration

The LVM backend supports the common storage properties content, nodes, disable, and the following LVM specific properties:

vgname

LVM volume group name. This must point to an existing volume group.

base

Base volume. This volume is automatically activated before accessing the storage. This is mostly useful when the LVM volume group resides on a remote iSCSI server.

saferemove

Called "Wipe Removed Volumes" in the web UI. Zero-out data when removing LVs. When removing a volume, this makes sure that all data gets erased and cannot be accessed by other LVs created later (which happen to be assigned the same physical extents). This is a costly operation, but may be required as a security measure in certain environments.

saferemove_throughput

Wipe throughput (cstream -t parameter value).

Configuration Example (/etc/pve/storage.cfg)
lvm: myspace
        vgname myspace
        content rootdir,images

7.11.2. File naming conventions

The backend use basically the same naming conventions as the ZFS pool backend.

vm-<VMID>-<NAME>      // normal VM images

7.11.3. Storage Features

LVM is a typical block storage, but this backend does not support snapshots and clones. Unfortunately, normal LVM snapshots are quite inefficient, because they interfere with all writes on the entire volume group during snapshot time.

One big advantage is that you can use it on top of a shared storage, for example, an iSCSI LUN. The backend itself implements proper cluster-wide locking.

Tip The newer LVM-thin backend allows snapshots and clones, but does not support shared storage.
Table 10. Storage features for backend lvm
Content types Image formats Shared Snapshots Clones

images rootdir

raw

possible

no

no

7.11.4. Examples

List available volume groups:

# pvesm lvmscan

7.12. LVM thin Backend

Storage pool type: lvmthin

LVM normally allocates blocks when you create a volume. LVM thin pools instead allocates blocks when they are written. This behaviour is called thin-provisioning, because volumes can be much larger than physically available space.

You can use the normal LVM command-line tools to manage and create LVM thin pools (see man lvmthin for details). Assuming you already have a LVM volume group called pve, the following commands create a new LVM thin pool (size 100G) called data:

lvcreate -L 100G -n data pve
lvconvert --type thin-pool pve/data

7.12.1. Configuration

The LVM thin backend supports the common storage properties content, nodes, disable, and the following LVM specific properties:

vgname

LVM volume group name. This must point to an existing volume group.

thinpool

The name of the LVM thin pool.

Configuration Example (/etc/pve/storage.cfg)
lvmthin: local-lvm
        thinpool data
        vgname pve
        content rootdir,images

7.12.2. File naming conventions

The backend use basically the same naming conventions as the ZFS pool backend.

vm-<VMID>-<NAME>      // normal VM images

7.12.3. Storage Features

LVM thin is a block storage, but fully supports snapshots and clones efficiently. New volumes are automatically initialized with zero.

It must be mentioned that LVM thin pools cannot be shared across multiple nodes, so you can only use them as local storage.

Table 11. Storage features for backend lvmthin
Content types Image formats Shared Snapshots Clones

images rootdir

raw

no

yes

yes

7.12.4. Examples

List available LVM thin pools on volume group pve:

# pvesm lvmthinscan pve

7.13. Open-iSCSI initiator

Storage pool type: iscsi

iSCSI is a widely employed technology used to connect to storage servers. Almost all storage vendors support iSCSI. There are also open source iSCSI target solutions available, e.g. OpenMediaVault, which is based on Debian.

To use this backend, you need to install the Open-iSCSI (open-iscsi) package. This is a standard Debian package, but it is not installed by default to save resources.

# apt-get install open-iscsi

Low-level iscsi management task can be done using the iscsiadm tool.

7.13.1. Configuration

The backend supports the common storage properties content, nodes, disable, and the following iSCSI specific properties:

portal

iSCSI portal (IP or DNS name with optional port).

target

iSCSI target.

Configuration Example (/etc/pve/storage.cfg)
iscsi: mynas
     portal 10.10.10.1
     target iqn.2006-01.openfiler.com:tsn.dcb5aaaddd
     content none
Tip If you want to use LVM on top of iSCSI, it make sense to set content none. That way it is not possible to create VMs using iSCSI LUNs directly.

7.13.2. File naming conventions

The iSCSI protocol does not define an interface to allocate or delete data. Instead, that needs to be done on the target side and is vendor specific. The target simply exports them as numbered LUNs. So Proxmox VE iSCSI volume names just encodes some information about the LUN as seen by the linux kernel.

7.13.3. Storage Features

iSCSI is a block level type storage, and provides no management interface. So it is usually best to export one big LUN, and setup LVM on top of that LUN. You can then use the LVM plugin to manage the storage on that iSCSI LUN.

Table 12. Storage features for backend iscsi
Content types Image formats Shared Snapshots Clones

images none

raw

yes

no

no

7.13.4. Examples

Scan a remote iSCSI portal, and returns a list of possible targets:

pvesm scan iscsi <HOST[:PORT]>

7.14. User Mode iSCSI Backend

Storage pool type: iscsidirect

This backend provides basically the same functionality as the Open-iSCSI backed, but uses a user-level library to implement it. You need to install the libiscsi-bin package in order to use this backend.

It should be noted that there are no kernel drivers involved, so this can be viewed as performance optimization. But this comes with the drawback that you cannot use LVM on top of such iSCSI LUN. So you need to manage all space allocations at the storage server side.

7.14.1. Configuration

The user mode iSCSI backend uses the same configuration options as the Open-iSCSI backed.

Configuration Example (/etc/pve/storage.cfg)
iscsidirect: faststore
     portal 10.10.10.1
     target iqn.2006-01.openfiler.com:tsn.dcb5aaaddd

7.14.2. Storage Features

Note This backend works with VMs only. Containers cannot use this driver.
Table 13. Storage features for backend iscsidirect
Content types Image formats Shared Snapshots Clones

images

raw

yes

no

no

7.15. Ceph RADOS Block Devices (RBD)

Storage pool type: rbd

Ceph is a distributed object store and file system designed to provide excellent performance, reliability and scalability. RADOS block devices implement a feature rich block level storage, and you get the following advantages:

  • thin provisioning

  • resizable volumes

  • distributed and redundant (striped over multiple OSDs)

  • full snapshot and clone capabilities

  • self healing

  • no single point of failure

  • scalable to the exabyte level

  • kernel and user space implementation available

Note For smaller deployments, it is also possible to run Ceph services directly on your Proxmox VE nodes. Recent hardware has plenty of CPU power and RAM, so running storage services and VMs on same node is possible.

7.15.1. Configuration

This backend supports the common storage properties nodes, disable, content, and the following rbd specific properties:

monhost

List of monitor daemon IPs. Optional, only needed if Ceph is not running on the Proxmox VE cluster.

pool

Ceph pool name.

username

RBD user ID. Optional, only needed if Ceph is not running on the Proxmox VE cluster. Note that only the user ID should be used. The "client." type prefix must be left out.

krbd

Enforce access to rados block devices through the krbd kernel module. Optional.

Note Containers will use krbd independent of the option value.
Configuration Example for a external Ceph cluster (/etc/pve/storage.cfg)
rbd: ceph-external
        monhost 10.1.1.20 10.1.1.21 10.1.1.22
        pool ceph-external
        content images
        username admin
Tip You can use the rbd utility to do low-level management tasks.

7.15.2. Authentication

Note If Ceph is installed locally on the Proxmox VE cluster, the following is done automatically when adding the storage.

If you use cephx authentication, which is enabled by default, you need to provide the keyring from the external Ceph cluster.

To configure the storage via the CLI, you first need to make the file containing the keyring available. One way is to copy the file from the external Ceph cluster directly to one of the Proxmox VE nodes. The following example will copy it to the /root directory of the node on which we run it:

# scp <external cephserver>:/etc/ceph/ceph.client.admin.keyring /root/rbd.keyring

Then use the pvesm CLI tool to configure the external RBD storage, use the --keyring parameter, which needs to be a path to the keyring file that you copied. For example:

# pvesm add rbd <name> --monhost "10.1.1.20 10.1.1.21 10.1.1.22" --content images --keyring /root/rbd.keyring

When configuring an external RBD storage via the GUI, you can copy and paste the keyring into the appropriate field.

The keyring will be stored at

# /etc/pve/priv/ceph/<STORAGE_ID>.keyring
Tip Creating a keyring with only the needed capabilities is recommend when connecting to an external cluster. For further information on Ceph user management, see the Ceph docs.
[Ceph User Management]

7.15.3. Ceph client configuration (optional)

Connecting to an external Ceph storage doesn’t always allow setting client-specific options in the config DB on the external cluster. You can add a ceph.conf beside the Ceph keyring to change the Ceph client configuration for the storage.

The ceph.conf needs to have the same name as the storage.

# /etc/pve/priv/ceph/<STORAGE_ID>.conf

See the RBD configuration reference
[RBD configuration reference https://docs.ceph.com/en/quincy/rbd/rbd-config-ref/]
for possible settings.