NVIDIA vGPU on Proxmox VE: Difference between revisions

From Proxmox VE
Jump to navigation Jump to search
No edit summary
m (compat: add 17.4 vgpu branch)
 
(14 intermediate revisions by 5 users not shown)
Line 17: Line 17:
== Hardware Setup ==
== Hardware Setup ==


We're using the following hardware configuration for our test:
We're using the following hardware configurations for our tests:


{| class="wikitable"
{| class="wikitable"
|+ Test System
|+ Test Systems
|-
|-
| CPU || Intel Core i7-12700K
| CPU || Intel Core i7-12700K || AMD Ryzen 7 3700X
|-
|-
| Motherboard || ASUS PRIME Z690-A
| Motherboard || ASUS PRIME Z690-A || ASUS PRIME X570-P
|-
|-
| Memory || 128 GB DDR5 Memory: 4x Crucial CT32G48C40U5
| Memory || 128 GB DDR5 Memory: 4x Crucial CT32G48C40U5 || 64 GB DDR4 Memory: 4x Corsair CMK64GX4M4A2666C16
|-
|-
| GPU || PNY NVIDIA RTX A5000
| GPU || PNY NVIDIA RTX A5000 || PNY NVIDIA RTX A5000
|}
|}


Some supported NVIDIA GPUs don't have vGPU enabled out of the box and need to have their display ports disabled.
Some NVIDIA GPUs do not have vGPU enabled by default, even though they support vGPU, like the RTX A5000 we tested. To enable vGPU there, switch the display using the NVIDIA Display Mode Selector Tool<ref>NVIDIA Display Mode Selector Tool https://developer.nvidia.com/displaymodeselector</ref>.
This is the case with our RTX A5000, and can be achieved by using their ''display mode selector'' tool
This will disable the display ports.
<ref>NVIDIA displaymodeselector  tool https://developer.nvidia.com/displaymodeselector</ref>.
For a list of GPUs where this is necessary check their documentation.


The following Proxmox VE, kernel and driver versions were tested for installation:
For a list of GPUs where this is necessary check their documentation<ref>Latest NVIDIA vGPU user guide: Switching the Mode of a GPU that Supports Multiple Display Modes https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#displaymodeselector</ref>.


(Note that newer versions in one vGPU Software Branch should also work for the same or older kernel versions)
The installation was tested on the following versions of Proxmox VE, Linux kernel, and NVIDIA drivers:


{| class="wikitable"
{| class="wikitable"
|-
|-
! pve-manager !! kernel !! vGPU Software Branch !! NVIDIA Host drivers
! pve-manager !! Kernel !! vGPU Software Branch !! NVIDIA Host drivers
|-
|-
| 7.2-7 || 5.15.39-2-pve || 14.1 || 510.73.06
| 7.2-7 || 5.15.39-2-pve || 14.1 || 510.73.06
Line 53: Line 51:
|-
|-
| 8.1.4 || 6.5.11-8-pve || 16.3 || 535.154.02
| 8.1.4 || 6.5.11-8-pve || 16.3 || 535.154.02
|-
| 8.1.4 || 6.5.13-1-pve || 16.3 || 535.154.02
|-
| 8.2.8 || 6.8.12-4-pve || 17.3 || 550.90.05
|-
| 8.2.8 || 6.11.0-1-pve || 17.3 || 550.90.05
|-
| 8.2.8 || 6.8.12-4-pve || 17.4 || 550.127.06
|-
| 8.2.8 || 6.11.0-1-pve || 17.4 || 550.127.06
|}
|}


Generally it's recommended to use the latest stable and supported version for Proxmox VE and the NVIDIA drivers.
{{Note| With 6.8+ based kernels / GRID version 17.3+, the lower level interface of the driver changed and requires <code>qemu-server &ge; 8.2.6</code> to be installed on the host. }}
 
It is recommended to use the latest stable and supported version of Proxmox VE and NVIDIA drivers.
However, newer versions in one vGPU Software Branch should also work for the same or older kernel version.
 
A mapping what NVIDIA vGPU software version corresponds to which driver version are available from the official documentation
<ref name="vgpu-doc-16">NVIDIA vGPU Software 16.x Documentation https://docs.nvidia.com/vgpu/16.0/grid-vgpu-release-notes-generic-linux-kvm/index.html</ref>
<ref name="vgpu-doc-17">NVIDIA vGPU Software 17.x Documentation https://docs.nvidia.com/vgpu/17.0/grid-vgpu-release-notes-generic-linux-kvm/index.html</ref>
.
 
Since version 16.0, certain cards are no longer supported by the NVIDIA vGPU driver, but are supported by the Enterprise AI driver
<ref name="supported-gpus">NVIDIA GPUs supported by vGPU https://docs.nvidia.com/grid/gpus-supported-by-vgpu.html</ref>
<ref name="supported-gpus-ai">NVIDIA GPUs supported by AI Enterprise https://docs.nvidia.com/ai-enterprise/latest/product-support-matrix/index.html</ref>.
We have tested the Enterprise AI driver with an A16 and vGPU technology and found that it behaves similarly to the old vGPU driver. Therefore, the following steps also apply.
 
== Preparation ==
 
Before actually installing the host drivers, there are a few steps to be done on the Proxmox VE host.
 
'''Tip''': If you need to use a root shell, you can, for example, open one by connecting via SSH or using the node shell on the Proxmox VE web interface.


== Prerequisites ==
=== Enable PCIe Passthrough ===


You need to make sure that your system is suited for PCIe passthrough, see the
Make sure that your system is compatible with PCIe passthrough. See the [https://pve.proxmox.com/wiki/PCI(e)_Passthrough PCI(e) Passthrough] documentation.
[https://pve.proxmox.com/wiki/PCI(e)_Passthrough PCI(e) Passhtrough] documentation.


Additionally, make sure you enable the following features in your BIOS/UEFI
Additionally, confirm that the following features are enabled in your firmware settings (BIOS/UEFI):


* VT-d for Intel, or AMD-v for AMD (sometimes named IOMMU)
* VT-d for Intel, or AMD-v for AMD (sometimes named IOMMU)
* SR-IOV (may not be necessary for pre-Ampere GPUs)
* SR-IOV (this may not be necessary for older pre-Ampere GPU generations)
* Above 4G decoding
* Above 4G decoding
* PCI AER (Advanced Error Reporting)
* PCI AER (Advanced Error Reporting)
* PCI ASPM (Active State Power Management)
* PCI ASPM (Active State Power Management)


The names and location of these options may vary from BIOS to BIOS, so check your vendor's
The firmware of your host might use different naming. If you are unable to locate some of these options, refer to the documentation provided by your firmware or motherboard manufacturer.
documentation.
 
{{Note| It is crucial to ensure that both the IOMMU options are enabled in your firmware and the kernel.}}
 
=== Setup Proxmox VE Repositories ===
 
Proxmox VE's comes with the enterprise repository set up by default as this repository provides better tested software and is recommended for production use.
The enterprise repository needs a valid subscription per node. For evaluation or non-production use cases you can simply switch to the public <code>no-subscription</code> repository. This provides the same feature-set but with a higher frequency of updates.
 
You can use the <code>Repositories</code> management panel in the Proxmox VE web UI for managing package repositories, see the [https://pve.proxmox.com/wiki/Package_Repositories documentation] for details.
 
=== Update to Latest Package Versions ===


Make especially sure that the IOMMU options are activated for your kernel and in your BIOS.
Proxmox VE uses a rolling release model and should be updated frequently to ensure that your Proxmox VE installation has the latest bug and security fixes, and features available.


== Preparation ==
You can update your Proxmox VE node using the <code>update</code> panel on the web UI.


Before actually installing the host drivers, there are a few steps to be done on the host.
=== Blacklist the Nouveau Driver  ===


First, make sure that you have configured the correct [https://pve.proxmox.com/wiki/Package_Repositories repositories] and updated your hosts to the currently supported versions.
Next, you want to blacklist the open source <code>nouveau</code> kernel module to avoid it from interfering with the one from NVIDIA.


Next, you want to blacklist the <code>nouveau</code> drivers.
To do that, add a line with <code>blacklist nouveau</code> to a file in the <code>/etc/modprobe.d/</code> directory.
To do that add <code>blacklist nouveau</code> to a file in <code>/etc/modprobe.d</code>, e.g. by doing
For example, open a root shell and execute:
  echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
  echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf


After that [https://pve.proxmox.com/wiki/PCI(e)_Passthrough#qm_pci_passthrough_update_initramfs update your initramfs] and reboot your host.
Then, [https://pve.proxmox.com/wiki/PCI(e)_Passthrough#qm_pci_passthrough_update_initramfs update your initramfs], to ensure that the module is blocked from loading at early boot, and then reboot your host.
 
=== Setup DKMS ===
 
Because the NVIDIA module is separate from the kernel, it must be rebuilt with Dynamic Kernel Module Support (DKMS) for each new kernel update.
 
To set up DKMS, you must install the headers package for the kernel and the DKMS helper package. In a root shell, run


As we want to install the driver with dkms support, you must also install the dkms prerequisistes with:
  apt update
  apt update
  apt install dkms libc6-dev proxmox-default-headers --no-install-recommends
  apt install dkms libc6-dev proxmox-default-headers --no-install-recommends


{{Note| If you don't have the default kernel version installed, but e.g. an opt-in kernel, use the appropriate <code>proxmox-headers-X.Y</code> package instead of <code>proxmox-default-headers</code>.
{{ Note| If you do not have the default kernel version installed, but for example an opt-in kernel, you must install the appropriate <code>proxmox-headers-X.Y</code> package instead of <code>proxmox-default-headers</code>.}}
|warn}}


== Host Driver Installation ==
== Host Driver Installation ==


{{Note| The driver/file versions depicted in this section are only an example, use the correct file names for the chosen driver you're installing.
{{ Note| The driver/file versions shown in this section are examples only; use the correct file names for the selected driver you're installing.}}
|warn}}


To begin you need the appropriate host and guest drivers. See their support page on how to get them
To get started, you will need the appropriate host and guest drivers; see the NVIDIA Virtual GPU Software Quick Start Guide<ref>Getting your NVIDIA GRID Software: https://docs.nvidia.com/grid/latest/grid-software-quick-start-guide/index.html#getting-your-nvidia-grid-software</ref> for instructions on how to obtain them.
<ref>Getting your NVIDIA GRID Software. https://docs.nvidia.com/grid/latest/grid-software-quick-start-guide/index.html#getting-your-nvidia-grid-software</ref>.
Choose <code>Linux KVM</code> as target hypervisor when downloading.
Choose <code>Linux KVM</code> as target hypervisor.


In our case we got the host driver:
In our case we got the following host driver file:
  NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run
  NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run
Copy this file over to your Proxmox VE node.


To start the installation, you need to make the installer executable first, and then pass the <code>--dkms</code> option when running it, to ensure that the module is rebuilt after a kernel upgrade:
chmod +x NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run
./NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run --dkms
Follow the steps of the installer.


 
After the installer has finished successfully, you will need to reboot your system, either using the web interface or by executing <code>reboot</code>.
To start the installation run the installer, we recommend using the <code>--dkms</code> option, ensuring that the module gets rebuilt after a kernel upgrade:
 
# chmod +x NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run
# ./NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run --dkms
 
Follow the steps of the installer, after that you need to reboot your system, for example with
 
# reboot


=== Enabling SR-IOV ===
=== Enabling SR-IOV ===


On some NVIDIA GPUs (for example those based on the Ampere architecture), you must first enable SR-IOV before
On some NVIDIA GPUs (for example, those based on the Ampere architecture), you must first enable SR-IOV before being able to use vGPUs.
you can use vGPUs. You can do that with the <code>sriov-manage</code> script from NVIDIA.
You can do that with the <code>sriov-manage</code> script from NVIDIA.


  # /usr/lib/nvidia/sriov-manage -e <pciid|ALL>
  /usr/lib/nvidia/sriov-manage -e <pciid|ALL>
<!-- TODO: describe what <pci-id|ALL> is and why not always all, ...  -->


Since that setting gets lost on reboot, it might be a good idea to write a cronjob or systemd service to enable it on reboot.
Since that setting gets lost on reboot, it might be a good idea to write a cronjob or systemd service to enable it on reboot.
Line 136: Line 172:
[Service]
[Service]
Type=oneshot
Type=oneshot
#ExecStartPre=/bin/sleep 5
ExecStart=/usr/lib/nvidia/sriov-manage -e ALL
ExecStart=/usr/lib/nvidia/sriov-manage -e ALL


Line 142: Line 179:
</pre>
</pre>


Depending on the actual hardware, it might be necessary to give the <code>nvidia-vgpud.service</code> a bit more time to start, you can do that by adding
Depending on the actual hardware, it might be necessary to give the <code>nvidia-vgpud.service</code> a bit more time to start. This can be done by removing the '#' at the beginning of
ExecStartPre=/bin/sleep 5
the <code>ExecStartPre</code> line of the above service file, replacing '5' by an appropriate amount of seconds.
just before the <code>ExecStart</code> line in the service file (replace '5' by an appropriate amount of seconds)


This will run after the nvidia-daemons were started, but before the Proxmox VE virtual guest auto start-up.
You can save this in <code>/usr/local/lib/systemd/system/nvidia-sriov.service</code>, then enable and start it by running:


You can save this in <code>/etc/systemd/system/nvidia-sriov.service</code>. Then enable and start it with:
systemctl daemon-reload
systemctl enable --now nvidia-sriov.service


# systemctl daemon-reload
This will then run after the NVIDIA vGPU daemons got started, but before the Proxmox VE virtual guest auto start-up.
# systemctl enable --now nvidia-sriov.service


Verify that there are multiple virtual functions for your device with:
Verify that there are multiple virtual functions for your device with:
Line 189: Line 225:
=== General Setup ===
=== General Setup ===


First, set up a VM like you normally would, without adding a vGPU.
First, set up a VM as you normally would, without adding a vGPU.


After you have configured the VM to your liking, shut down the VM and add a vGPU by
After configuring the VM to your liking, shut down the VM and add a vGPU by selecting one of the virtual functions and selecting the appropriate mediated device type.
selecting one of the virtual functions, and selecting the appropriate mediated device type.


For example:
For example:


CLI:
Via the CLI:


  # qm set VMID -hostpci0 01:00.4,mdev=nvidia-660
  qm set VMID -hostpci0 01:00.4,mdev=nvidia-660


GUI:
Via the web interface:


[[File:Pve select vgpu.png|none|Selecting a vGPU model]]
[[File:Pve select vgpu.png|500px|none|Selecting a vGPU model]]


To find the correct mediated device type, you can use <code>sysfs</code>. Here is an example shell
To find the correct mediated device type, you can use <code>sysfs</code>.
script to print the type, then the name (which corresponds to the NVIDIA documentation) and the
Here is a sample shell script that prints the type, then the name (which corresponds to the NVIDIA documentation) and the description, which contains helpful information (such as the maximum number of instances available).
description, which contains helpful information (such as the maximum available instances).
Adjust the PCI path to your needs:
Adapt the PCI path to your needs:


<pre>
<pre>
Line 228: Line 262:
  # qm set VMID -args '-uuid <UUID-OF-THE-MDEV>'
  # qm set VMID -args '-uuid <UUID-OF-THE-MDEV>'


The UUID of the mediated device is auto-generated from the VMID and the hostpciX index of the
The UUID of the mediated device is automatically generated from the VMID and the <code>hostpciX</code> index of the config, where the host PCI index is used as the first part and the VMID as the last part.
config, where the hostpci index gets set as the first part and the VMID as the last. If you use
For example, if you configure <code>hostpci2</code> for VM with VMID <code>12345</code>, the generated UUID will be
hostpci2 on VM 12345 for example, the generated UUID will be:


  00000002-0000-0000-0000-000000012345
  00000002-0000-0000-0000-000000012345


After that, you can start the VM and continue with the guest configuration. We installed
You can now start the VM and continue configuring the guest from within.
a Windows 10 and Ubuntu 22.04 VM, but it's similar for other supported operating systems.
 
We tested a Windows 10 and Ubuntu 22.04 installation, but the setup will be similar for other supported operating systems.


=== Windows 10 ===
=== Windows 10 ===
Line 283: Line 317:
* '''RustDesk'''<br />free and open source, but relatively new as of 2022
* '''RustDesk'''<br />free and open source, but relatively new as of 2022


We installed x11vnc in this example. While we're showing how to install and configure it, this is
We installed x11vnc in this example. While we're showing how to install and configure it, this is not the only way to achieve the goal of having properly configured desktop sharing.
not the only way to achieve the goal of having properly configured desktop sharing.


Since Ubuntu 22.04 ships GDM3 + Gnome + Wayland per default, you first need to switch the login manager to one that uses X.org.
Since Ubuntu 22.04 ships GDM3 + Gnome + Wayland per default, you first need to switch the login manager to one that uses X.org.
Line 295: Line 328:
  # apt install x11vnc
  # apt install x11vnc


We then added a systemd service that starts the vnc server on the x.org server provided by lightm in <code>/etc/systemd/system/x11vnc.service</code>
We then added a systemd service that starts the VNC server on the x.org server provided by LightDM in <code>/etc/systemd/system/x11vnc.service</code>


<pre>
<pre>
Line 325: Line 358:
  # apt install ./nvidia-linux-grid-525_525.105.17_amd64.deb
  # apt install ./nvidia-linux-grid-525_525.105.17_amd64.deb


Then you must use NVIDIA's tools to configure the x.org confiuration with:
Then you must use NVIDIA's tools to configure the x.org configuration with:


  # nvidia-xconfig
  # nvidia-xconfig
Line 347: Line 380:
=== Guest vGPU Licensing ===
=== Guest vGPU Licensing ===


To use the vGPU unrestricted, you must adhere to NVIDIA's licensing. Check the NVIDIA
To use the vGPU without restriction, you must adhere to NVIDIA's licensing.
documentation<ref>NVIDIA GRID Licensing User Guide. https://docs.nvidia.com/grid/latest/grid-licensing-user-guide/index.html</ref> for how to do that.
Check the NVIDIA vGPU documentation<ref>NVIDIA GRID Licensing User Guide: https://docs.nvidia.com/grid/latest/grid-licensing-user-guide/index.html</ref> for instructions on how to do so.
Make sure the guest time is properly synced, otherwise the guest will not be able to request a license for the vGPU.
 
'''Tip''': Ensure that the guest system time is properly synchronized using NTP, otherwise the guest will be unable to request a license for the vGPU.
 
=== Troubleshooting ===
 
A warning like the following might get logged by QEMU on VM startup. This usually only happens on consumer hardware which do not support PCIe AER<ref>PCI Express Advanced Error Reporting Driver Guide: https://www.kernel.org/doc/html/v6.12-rc4/PCI/pcieaer-howto.html</ref> error recovery properly, it generally should not have any adverse effects on normal operation, but PCIe link errors might not be (soft-)recoverable in such cases.
 
  kvm: -device vfio-pci,host=0000:09:00.5,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: warning: vfio 0000:09:00.5: Could not enable error recovery for the device


== Notes ==
== Notes ==

Latest revision as of 13:01, 19 November 2024

Introduction

NVIDIA vGPU technology enables multiple virtual machines to use a single supported[1] physical GPU.

This article explains how to use NVIDIA vGPU on Proxmox VE. The instructions were tested using an RTX A5000.

Disclaimer

At the time of writing, Proxmox VE is not an officially supported platform for NVIDIA vGPU. This means that even with valid vGPU licenses, you may not be eligible for NVIDIA enterprise support for this use-case. However, Proxmox VE's kernel is derived from the Ubuntu kernel, which is a supported platform for NVIDIA vGPU as of 2024.

Note that although we are using some consumer hardware in this article, for optimal performance in production workloads, we recommend using appropriate enterprise-grade hardware. Please refer to NVIDIA's support page to verify hardware compatibility [2] [1].

Hardware Setup

We're using the following hardware configurations for our tests:

Test Systems
CPU Intel Core i7-12700K AMD Ryzen 7 3700X
Motherboard ASUS PRIME Z690-A ASUS PRIME X570-P
Memory 128 GB DDR5 Memory: 4x Crucial CT32G48C40U5 64 GB DDR4 Memory: 4x Corsair CMK64GX4M4A2666C16
GPU PNY NVIDIA RTX A5000 PNY NVIDIA RTX A5000

Some NVIDIA GPUs do not have vGPU enabled by default, even though they support vGPU, like the RTX A5000 we tested. To enable vGPU there, switch the display using the NVIDIA Display Mode Selector Tool[3]. This will disable the display ports.

For a list of GPUs where this is necessary check their documentation[4].

The installation was tested on the following versions of Proxmox VE, Linux kernel, and NVIDIA drivers:

pve-manager Kernel vGPU Software Branch NVIDIA Host drivers
7.2-7 5.15.39-2-pve 14.1 510.73.06
7.2-7 5.15.39-2-pve 14.2 510.85.03
7.4-3 5.15.107-2-pve 15.2 525.105.14
7.4-17 6.2.16-20-bpo11-pve 16.0 535.54.06
8.1.4 6.5.11-8-pve 16.3 535.154.02
8.1.4 6.5.13-1-pve 16.3 535.154.02
8.2.8 6.8.12-4-pve 17.3 550.90.05
8.2.8 6.11.0-1-pve 17.3 550.90.05
8.2.8 6.8.12-4-pve 17.4 550.127.06
8.2.8 6.11.0-1-pve 17.4 550.127.06
Yellowpin.svg Note: With 6.8+ based kernels / GRID version 17.3+, the lower level interface of the driver changed and requires qemu-server ≥ 8.2.6 to be installed on the host.

It is recommended to use the latest stable and supported version of Proxmox VE and NVIDIA drivers. However, newer versions in one vGPU Software Branch should also work for the same or older kernel version.

A mapping what NVIDIA vGPU software version corresponds to which driver version are available from the official documentation [5] [6] .

Since version 16.0, certain cards are no longer supported by the NVIDIA vGPU driver, but are supported by the Enterprise AI driver [1] [7]. We have tested the Enterprise AI driver with an A16 and vGPU technology and found that it behaves similarly to the old vGPU driver. Therefore, the following steps also apply.

Preparation

Before actually installing the host drivers, there are a few steps to be done on the Proxmox VE host.

Tip: If you need to use a root shell, you can, for example, open one by connecting via SSH or using the node shell on the Proxmox VE web interface.

Enable PCIe Passthrough

Make sure that your system is compatible with PCIe passthrough. See the PCI(e) Passthrough documentation.

Additionally, confirm that the following features are enabled in your firmware settings (BIOS/UEFI):

  • VT-d for Intel, or AMD-v for AMD (sometimes named IOMMU)
  • SR-IOV (this may not be necessary for older pre-Ampere GPU generations)
  • Above 4G decoding
  • PCI AER (Advanced Error Reporting)
  • PCI ASPM (Active State Power Management)

The firmware of your host might use different naming. If you are unable to locate some of these options, refer to the documentation provided by your firmware or motherboard manufacturer.

Yellowpin.svg Note: It is crucial to ensure that both the IOMMU options are enabled in your firmware and the kernel.

Setup Proxmox VE Repositories

Proxmox VE's comes with the enterprise repository set up by default as this repository provides better tested software and is recommended for production use. The enterprise repository needs a valid subscription per node. For evaluation or non-production use cases you can simply switch to the public no-subscription repository. This provides the same feature-set but with a higher frequency of updates.

You can use the Repositories management panel in the Proxmox VE web UI for managing package repositories, see the documentation for details.

Update to Latest Package Versions

Proxmox VE uses a rolling release model and should be updated frequently to ensure that your Proxmox VE installation has the latest bug and security fixes, and features available.

You can update your Proxmox VE node using the update panel on the web UI.

Blacklist the Nouveau Driver

Next, you want to blacklist the open source nouveau kernel module to avoid it from interfering with the one from NVIDIA.

To do that, add a line with blacklist nouveau to a file in the /etc/modprobe.d/ directory. For example, open a root shell and execute:

echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf

Then, update your initramfs, to ensure that the module is blocked from loading at early boot, and then reboot your host.

Setup DKMS

Because the NVIDIA module is separate from the kernel, it must be rebuilt with Dynamic Kernel Module Support (DKMS) for each new kernel update.

To set up DKMS, you must install the headers package for the kernel and the DKMS helper package. In a root shell, run

apt update
apt install dkms libc6-dev proxmox-default-headers --no-install-recommends
Yellowpin.svg Note: If you do not have the default kernel version installed, but for example an opt-in kernel, you must install the appropriate proxmox-headers-X.Y package instead of proxmox-default-headers.

Host Driver Installation

Yellowpin.svg Note: The driver/file versions shown in this section are examples only; use the correct file names for the selected driver you're installing.

To get started, you will need the appropriate host and guest drivers; see the NVIDIA Virtual GPU Software Quick Start Guide[8] for instructions on how to obtain them. Choose Linux KVM as target hypervisor when downloading.

In our case we got the following host driver file:

NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run

Copy this file over to your Proxmox VE node.

To start the installation, you need to make the installer executable first, and then pass the --dkms option when running it, to ensure that the module is rebuilt after a kernel upgrade:

chmod +x NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run
./NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run --dkms

Follow the steps of the installer.

After the installer has finished successfully, you will need to reboot your system, either using the web interface or by executing reboot.

Enabling SR-IOV

On some NVIDIA GPUs (for example, those based on the Ampere architecture), you must first enable SR-IOV before being able to use vGPUs. You can do that with the sriov-manage script from NVIDIA.

/usr/lib/nvidia/sriov-manage -e <pciid|ALL>

Since that setting gets lost on reboot, it might be a good idea to write a cronjob or systemd service to enable it on reboot.

Here is an example systemd service for enabling SR-IOV on all found NVIDIA GPUs:

[Unit]
Description=Enable NVIDIA SR-IOV
After=network.target nvidia-vgpud.service nvidia-vgpu-mgr.service
Before=pve-guests.service

[Service]
Type=oneshot
#ExecStartPre=/bin/sleep 5
ExecStart=/usr/lib/nvidia/sriov-manage -e ALL

[Install]
WantedBy=multi-user.target

Depending on the actual hardware, it might be necessary to give the nvidia-vgpud.service a bit more time to start. This can be done by removing the '#' at the beginning of the ExecStartPre line of the above service file, replacing '5' by an appropriate amount of seconds.

You can save this in /usr/local/lib/systemd/system/nvidia-sriov.service, then enable and start it by running:

systemctl daemon-reload
systemctl enable --now nvidia-sriov.service

This will then run after the NVIDIA vGPU daemons got started, but before the Proxmox VE virtual guest auto start-up.

Verify that there are multiple virtual functions for your device with:

# lspci -d 10de:

In our case there are now 24 virtual functions in addition to the physical card (01:00.0):

01:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:00.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:00.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:00.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:00.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:01.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:02.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:03.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:03.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:03.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
01:03.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)

Guest Configuration

General Setup

First, set up a VM as you normally would, without adding a vGPU.

After configuring the VM to your liking, shut down the VM and add a vGPU by selecting one of the virtual functions and selecting the appropriate mediated device type.

For example:

Via the CLI:

qm set VMID -hostpci0 01:00.4,mdev=nvidia-660

Via the web interface:

Selecting a vGPU model

To find the correct mediated device type, you can use sysfs. Here is a sample shell script that prints the type, then the name (which corresponds to the NVIDIA documentation) and the description, which contains helpful information (such as the maximum number of instances available). Adjust the PCI path to your needs:

#!/bin/sh
set -e

for i in /sys/bus/pci/devices/0000:01:00.4/mdev_supported_types/*; do
    basename "$i"
    cat "$i/name"
    cat "$i/description"
    echo
done

Since pve-manager version 7.2-8 and libpve-common-perl version 7.2-3, the GUI shows the correct name for the type.

If your qemu-server version is below 7.2-4, you must add an additional parameter to the vm:

# qm set VMID -args '-uuid <UUID-OF-THE-MDEV>'

The UUID of the mediated device is automatically generated from the VMID and the hostpciX index of the config, where the host PCI index is used as the first part and the VMID as the last part. For example, if you configure hostpci2 for VM with VMID 12345, the generated UUID will be

00000002-0000-0000-0000-000000012345

You can now start the VM and continue configuring the guest from within.

We tested a Windows 10 and Ubuntu 22.04 installation, but the setup will be similar for other supported operating systems.

Windows 10

First install and configure a desktop sharing software that matches your requirements. Some examples of such software include:

  • VNC
    many different options, some free, some commercial
  • Remote Desktop
    built into Windows itself
  • Parsec
    Costs money for commercial use, allows using hardware accelerated encoding
  • RustDesk
    free and open source, but relatively new as of 2022

We used simple Windows built-in remote desktop for testing.

Enabling Remote Desktop in Windows 10

Then you can install the Windows guest driver that is published by NVIDIA. Refer to their documentation[9]to find a compatible guest driver to host driver mapping. In our case this was the file

528.89_grid_win10_win11_server2019_server2022_dch_64bit_international.exe

Start the installer and follow the instructions, then, after it finished restart the guest as prompted.

From this point on, the integrated noVNC console of PVE will not be usable anymore, so use your desktop sharing software to connect to the guest. Now you can use the vGPU for starting 3D applications such as Blender, 3D games, etc.

Ubuntu 22.04 Desktop

Before installing the guest driver, install and configure a desktop sharing software, for example:

  • VNC
    many options. We use x11vnc here, which is free and open source, but does not currently provide hardware accelerated encoding
  • NoMachine
    provides hardware accelerated encoding, but is not open source and costs money for business use
  • RustDesk
    free and open source, but relatively new as of 2022

We installed x11vnc in this example. While we're showing how to install and configure it, this is not the only way to achieve the goal of having properly configured desktop sharing.

Since Ubuntu 22.04 ships GDM3 + Gnome + Wayland per default, you first need to switch the login manager to one that uses X.org. We successfully tested LightDM here, but others may work as well.

# apt install lightdm

Select 'LightDM' as default login manager when prompted. After that install x11vnc with

# apt install x11vnc

We then added a systemd service that starts the VNC server on the x.org server provided by LightDM in /etc/systemd/system/x11vnc.service

[Unit]
Description=Start x11vnc
After=multi-user.target

[Service]
Type=simple
ExecStart=/usr/bin/x11vnc -display :0 -auth /var/run/lightdm/root/:0 -forever -loop -repeat -rfbauth /etc/x11vnc.passwd -rfbport 5900 -shared -noxdamage

[Install]
WantedBy=multi-user.target

You can set the password by executing:

# x11vnc -storepasswd /etc/x11vnc.passwd
# chmod 0400 /etc/x11vnc.passwd

After setting up LightDM and x11vnc and restarting the VM, you can connect via VNC.

Now, install the .deb package that NVIDIA provides for Ubuntu. Check the NVIDIA documentation[9] for a compatible guest driver to host driver mapping.

In our case this was nvidia-linux-grid-525_525.105.17_amd64.deb, and we directly installed from the local file using apt. For that to work you must prefix the relative path, for example ./ if the .deb file is located in the current directory.

# apt install ./nvidia-linux-grid-525_525.105.17_amd64.deb

Then you must use NVIDIA's tools to configure the x.org configuration with:

# nvidia-xconfig

Now you can reboot and use a VNC client to connect and use the vGPU for 3D applications.

Yellowpin.svg Note: If you want to use CUDA on a Linux Guest, you must install the CUDA Toolkit manually[10].

Check the NVIDIA documentation which version of CUDA is supported for your vGPU drivers.

In our case we needed to install CUDA 11.6 (only the toolkit, not the driver) with the file:

cuda_11.6.2_510.47.03_linux.run 

Guest vGPU Licensing

To use the vGPU without restriction, you must adhere to NVIDIA's licensing. Check the NVIDIA vGPU documentation[11] for instructions on how to do so.

Tip: Ensure that the guest system time is properly synchronized using NTP, otherwise the guest will be unable to request a license for the vGPU.

Troubleshooting

A warning like the following might get logged by QEMU on VM startup. This usually only happens on consumer hardware which do not support PCIe AER[12] error recovery properly, it generally should not have any adverse effects on normal operation, but PCIe link errors might not be (soft-)recoverable in such cases.

 kvm: -device vfio-pci,host=0000:09:00.5,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: warning: vfio 0000:09:00.5: Could not enable error recovery for the device

Notes