NVIDIA vGPU on Proxmox VE

From Proxmox VE
Jump to navigation Jump to search

Introduction

NVIDIA vGPU technology enables multiple virtual machines to use a single supported[1] physical GPU.

This article explains how to use NVIDIA vGPU on Proxmox VE. The instructions were tested using an RTX A5000.

Disclaimer

At the time of writing, Proxmox VE is not an officially supported platform for NVIDIA vGPU. This means that even with valid vGPU licenses, you may not be eligible for NVIDIA enterprise support for this use-case. However, Proxmox VE's kernel is derived from the Ubuntu kernel, which is a supported platform for NVIDIA vGPU as of 2024.

Note that although we are using some consumer hardware in this article, for optimal performance in production workloads, we recommend using appropriate enterprise-grade hardware. Please refer to NVIDIA's support page to verify hardware compatibility [2] [3] [1].

Hardware Setup

We're using the following hardware configurations for our tests:

Test Systems
CPU Intel Core i7-12700K AMD Ryzen 7 3700X
Motherboard ASUS PRIME Z690-A ASUS PRIME X570-P
Memory 128 GB DDR5 Memory: 4x Crucial CT32G48C40U5 64 GB DDR4 Memory: 4x Corsair CMK64GX4M4A2666C16
GPU PNY NVIDIA RTX A5000 PNY NVIDIA RTX A5000

Some NVIDIA GPUs do not have vGPU enabled by default, even though they support vGPU, like the RTX A5000 we tested. To enable vGPU for these models, switch the display mode using the NVIDIA Display Mode Selector Tool[4]. This will disable the display ports.

For a list of GPUs where this is necessary check their documentation[5]. Note that this should be the exception and should only be necessary for workstation GPUs like we used in our guide here.

The installation was tested on the following versions of Proxmox VE, Linux kernel, and NVIDIA drivers:

pve-manager Kernel vGPU Software Branch NVIDIA Host drivers
8.2.8 6.8.12-4-pve 17.4 550.127.06
8.2.8 6.11.0-1-pve 17.4 550.127.06
Older, now outdated tested versions.
pve-manager Kernel vGPU Software Branch NVIDIA Host drivers
7.2-7 5.15.39-2-pve 14.1 510.73.06
7.2-7 5.15.39-2-pve 14.2 510.85.03
7.4-3 5.15.107-2-pve 15.2 525.105.14
7.4-17 6.2.16-20-bpo11-pve 16.0 535.54.06
8.1.4 6.5.11-8-pve 16.3 535.154.02
8.1.4 6.5.13-1-pve 16.3 535.154.02
8.2.8 6.8.12-4-pve 17.3 550.90.05
8.2.8 6.11.0-1-pve 17.3 550.90.05
Yellowpin.svg Note: With 6.8+ based kernels / GRID version 17.3+, the lower level interface of the driver changed and requires qemu-server ≥ 8.2.6 to be installed on the host.

It is recommended to use the latest stable and supported version of Proxmox VE and NVIDIA drivers. However, newer versions in one vGPU Software Branch should also work for the same or older kernel version.

A mapping what NVIDIA vGPU software version corresponds to which driver version are available from the official documentation [6] .

Since version 16.0, certain cards are no longer supported by the NVIDIA vGPU driver, but are supported by NVIDIA AI Enterprise [1] [7]. We have tested the Enterprise AI driver with an A16 and vGPU technology and found that it behaves similarly to the vGPU driver. Therefore, the following steps should also apply. Note that vGPU and NVIDIA AI Enterprise are different products with different licenses.

Preparation

Before actually installing the host drivers, there are a few steps to be done on the Proxmox VE host.

Tip: If you need to use a root shell, you can, for example, open one by connecting via SSH or using the node shell on the Proxmox VE web interface.

Enable PCIe Passthrough

Make sure that your system is compatible with PCIe passthrough. See the PCI(e) Passthrough documentation.

Additionally, confirm that the following features are enabled in your firmware settings (BIOS/UEFI):

  • VT-d for Intel, or AMD-v for AMD (sometimes named IOMMU)
  • SR-IOV (this may not be necessary for older pre-Ampere GPU generations)
  • Above 4G decoding
  • Alternative Routing ID Interpretation (ARI) (not necessary for pre-Ampere GPUs)

The firmware of your host might use different naming. If you are unable to locate some of these options, refer to the documentation provided by your firmware or motherboard manufacturer.

Yellowpin.svg Note: It is crucial to ensure that both the IOMMU options are enabled in your firmware and the kernel.

Setup Proxmox VE Repositories

Proxmox VE's comes with the enterprise repository set up by default as this repository provides better tested software and is recommended for production use. The enterprise repository needs a valid subscription per node. For evaluation or non-production use cases you can simply switch to the public no-subscription repository. This provides the same feature-set but with a higher frequency of updates.

You can use the Repositories management panel in the Proxmox VE web UI for managing package repositories, see the documentation for details.

Update to Latest Package Versions

Proxmox VE uses a rolling release model and should be updated frequently to ensure that your Proxmox VE installation has the latest bug and security fixes, and features available.

You can update your Proxmox VE node using the update panel on the web UI.

Prepare using pve-nvidia-vgpu-helper

Since pve-manager version 8.3.4 the tool pve-nvidia-vgpu-helper is included. If you're on an older version, please upgrade to the latest version or install it manually with

apt install pve-nvidia-vgpu-helper

This tool does some necessary preparations, like blacklisting the nouveau driver, installing header packages, DKMS and so on. You can start the setup with

pve-nvidia-vgpu-helper setup

It may ask you question if you want to install packages, answer with 'y'. After it succesfully installed all necessary packages, it should say

All done, you can continue with the NVIDIA vGPU driver installation.
Yellowpin.svg Note: If you install an opt-in kernel later, you have to also install the corresponding proxmox-header-X.Y package for DKMS to work.

Host Driver Installation

Yellowpin.svg Note: The driver/file versions shown in this section are examples only; use the correct file names for the selected driver you're installing.
Yellowpin.svg Note: If you're using Secure boot, please see the Chapter Secure Boot before continuing.


To get started, you will need the appropriate host and guest drivers; see the NVIDIA Virtual GPU Software Quick Start Guide[8] for instructions on how to obtain them. Choose Linux KVM as target hypervisor when downloading.

In our case we got the following host driver file:

NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run

Copy this file over to your Proxmox VE node, for example with SCP or an SSH file copy tool.

To start the installation, you need to make the installer executable first, and then pass the --dkms option when running it, to ensure that the module is rebuilt after a kernel upgrade:

chmod +x NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run
./NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run --dkms

Follow the steps of the installer.

When you're asked if you want to register the kernel module sources with DKMS, answer 'yes'.

After the installer has finished successfully, you will need to reboot your system, either using the web interface or by executing reboot.

Enabling SR-IOV

On newer NVIDIA GPUs (those based on the Ampere architecture and beyond), you must first enable SR-IOV before being able to use vGPUs. You can do that with the sriov-manage script from NVIDIA.

The pve-nvidia-vgpu-helper package comes with a systemd service template for enabling this automatically on boot.

To enable it, use

systemctl enable --now pve-nvidia-sriov@ALL.service

You can replace ALL with a specific PCI ID (like 0000:01:00.0) if you only want to enable it for a specific card.

This will then run before the NVIDIA vGPU daemons and the Proxmox VE virtual guest auto start-up.

Verify that there are multiple virtual functions for your device with:

# lspci -d 10de:

In our case there are now 24 virtual functions in addition to the physical card (01:00.0):

01:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:00.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:00.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:00.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:00.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:03.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:03.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:03.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:03.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)

Create a PCI Resource Mapping

For convenience and privilege separation, you can now create resource mappings for PCI devices. This can contain multiple PCI IDs, such as all virtual functions. The first available ID is automatically selected when the guest is started. Go to Datacenter -> Resource mapping to create a new one. For details see Resource Mapping

For example, using an RTX A5000 card we want to check 'Use with mediated devices' and select all virtual functions:

Pve-vgpu-mapping.png

Guest Configuration

General Setup

First, set up a VM as you normally would, without adding a vGPU. This can be done either with the Virtual Machine wizard in the Web UI or via the CLI tool qm. For guest specific notes see for example Windows 11 guest best practices.

Please note that all Linux commands shown are assumed to be run as a privileged user. For example directly as the root user, or prefixed with sudo.

Remote Desktop Software

The next step is to configure remote desktop software in the guest. There are many options available:

  • VNC
    many different options, some free, some commercial
  • Remote Desktop
    built into Windows itself. Available on linux hosts via xrdp
  • Parsec
    Costs money for commercial use, allows using hardware accelerated encoding
  • NoMachine
    provides hardware accelerated encoding, but is not open source and costs money for business use
  • RustDesk
    free and open source, more equivalent to Teamviewer or Anydesk than the ones above.

There are many more options available, see [1] for more examples.

We show how to enable two examples here, Remote Desktop for Windows 10/11, and VNC (via x11vnc) on Linux:

Remote Desktop on Windows 10/11

To enable Remote Desktop on Windows 10/11, go to Settings -> System -> Remote Desktop and enable the Remote Desktop option.

VNC on Linux via x11vnc (Ubuntu/Rocky Linux)

Note that this is just an example; how you want to configure remote desktops on Linux will depend on your use case.

Ubuntu 24.04 and Rocky Linux 9 ship with GDM3 + Gnome per default, which make it a bit harder to share the screen with x11vnc. So the first step is to install a different display manager. We successfully tested LightDM here, but others may work as well.

Note that for Rocky linux you might need to enable the EPEL repository beforehand with:

# dnf install epel-release

First, we install and activate the new display manager:

Ubuntu

# apt install lightdm

Select 'LightDM' as default login manager when prompted.

Rocky Linux

# dnf install lightdm
# systemctl disable --now gdm.service
# systemctl enable --now lightdm.service

After that install x11vnc with

Ubuntu

# apt install x11vnc

Rocky Linux

# dnf install x11vnc

We then added a systemd service that starts the VNC server on the x.org server provided by LightDM in /etc/systemd/system/x11vnc.service

[Unit]
Description=Start x11vnc
After=multi-user.target

[Service]
Type=simple
ExecStart=/usr/bin/x11vnc -display :0 -auth /var/run/lightdm/root/:0 -forever -loop -repeat -rfbauth /etc/x11vnc.passwd -rfbport 5900 -shared -noxdamage

[Install]
WantedBy=multi-user.target

You can set the password by executing:

# x11vnc -storepasswd /etc/x11vnc.passwd
# chmod 0400 /etc/x11vnc.passwd

On Rocky Linux you might need to allow VNC in the firewall:

# firewall-cmd --permanent --add-port=5900/tcp

After setting up LightDM and x11vnc and restarting the VM, you should now be able to connect via VNC.

vGPU Configuration

After configuring the VM to your liking, shut down the VM and add a vGPU by selecting one of the virtual functions and selecting the appropriate mediated device type.

For example:

Via the CLI:

qm set VMID -hostpci0 01:00.4,mdev=nvidia-660

Via the web interface:

Selecting a vGPU model

To find the correct mediated device type, you can use pvesh get /nodes/NODENAME/hardware/pci/MAPPINGNAME/mdev. This will query sysfs for all supported types that can be created. Note that, depending on the driver and kernel versions in use, not all models may be visible here, but only those that are currently available.

NVIDIA Guest Driver Installation

Windows 10/11

Refer to their documentation[9]to find a compatible guest driver to host driver mapping. For example:

553.24_grid_win10_win11_server2022_dch_64bit_international.exe

Start the installer and follow the instructions, then, after it finished restart the guest as prompted.

From this point on, Proxmox VE's built-in noVNC console will no longer work, so use your desktop sharing software to connect to the Guest. Now you can use the vGPU for starting 3D applications such as Blender, 3D games, etc.


Ubuntu Desktop

To install the NVIDIA driver on Ubuntu, use apt to install the .deb package that NVIDIA provides for Ubuntu. Check the NVIDIA documentation[9] for a compatible guest driver to host driver mapping.

In our case this was nvidia-linux-grid-550_550.127.05_amd64.deb. For that to work you must prefix the relative path, for example ./ if the .deb file is located in the current directory.

# apt install ./nvidia-linux-grid-550_550.127.05_amd64.deb

Then you must use NVIDIA's tools to configure the x.org configuration with:

# nvidia-xconfig

Now you can reboot and use a VNC client to connect and use the vGPU for 3D applications.

Rocky Linux

To install the NVIDIA driver on Rocky Linux, use dnf to install the .rpm package that NVIDIA provides for Red Hat based distributions. Check the NVIDIA documentation[9] for a compatible guest driver to host driver mapping.

In our case this was nvidia-linux-grid-550-550.127.05-1.x86_64.rpm. If the file is located in the current directory, run:

# dnf install nvidia-linux-grid-550-550.127.05-1.x86_64.rpm

Then you must use NVIDIA's tools to configure the x.org configuration with:

# nvidia-xconfig

Now you can reboot and use a VNC client to connect and use the vGPU for 3D applications.

CUDA on Linux

If you want to use CUDA on a Linux Guest, you might need to install the CUDA Toolkit manually[10]. Check the NVIDIA documentation which version of CUDA is supported for your vGPU drivers.

Guest vGPU Licensing

To use the vGPU without restriction, you must adhere to NVIDIA's licensing. Check the NVIDIA vGPU documentation[11] for instructions on how to do so.

Tip: Ensure that the guest system time is properly synchronized using NTP, otherwise the guest will be unable to request a license for the vGPU.

Troubleshooting

A warning like the following might get logged by QEMU on VM startup. This usually only happens on consumer hardware which do not support PCIe AER[12] error recovery properly, it generally should not have any adverse effects on normal operation, but PCIe link errors might not be (soft-)recoverable in such cases.

 kvm: -device vfio-pci,host=0000:09:00.5,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: warning: vfio 0000:09:00.5: Could not enable error recovery for the device

Secure Boot

When booting the host with secure boot, kernel modules must be signed with a trusted key. We show here how to setup your host so that NVIDIA driver is signed and can be loaded. For more details see Secure Boot Setup. To be able to enroll the keys into the UEFI, make sure you have access to the physical display output during boot. This is necessary for confirming the key import. On servers this can usually be achieved with IPMI/iKVM/etc.

Before installing the NVIDIA Host driver, we need to install a few pre-requisites to enroll the DKMS signing key into UEFI:

apt install shim-signed grub-efi-amd64-signed mokutil

Now you can install the NVIDIA driver, but with an additional parameter:

./NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run --dkms --skip-module-load

When asked if the installer should sign the module, select 'no'.

After the installer is finished, we now want to rebuild the kernel modules with DKMS, which will sign the kernel module for us with a generated key. First, check what module version is installed with:

dkms status

Which will output a line like this:

nvidia/550.144.02, 6.8.12-6-pve, x86_64: installed 

You need to rebuild and reinstall the listed module with (replace the version with the one on your system)

dkms build -m nvidia -v 550.144.02 --force
dkms install -m nvidia -v 550.144.02 --force

This will ensure that the modules are signed with the DKMS key located in /var/lib/dkms/mok.pub If you have not already done so, enroll the DKMS key as described in Using DKMS with Secure Boot.

You should then be able to load the signed NVIDIA kernel module. You can verify this by checking if the PCI devices have their driver loaded, e.g. with

lspci -d 10de: -nnk

It should say

Kernel driver in use: nvidia

You can now continue with the next step after the driver installation.

Notes