Recover From Grub Failure: Difference between revisions

From Proxmox VE
Jump to navigation Jump to search
(add bind-mount for /run and use <code> tags for readability of commands)
(rework chroot to cover ZFS systems as well→‎General advice)
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
During to the upgrade from 3.x to 4.x, I found myself without a working grub and unable to boot. Monitor shows:
== General advice ==
*<code>grub rescue ></code>


You can use Proxmox installation ISO in verison 5.4 or newer, and select debug mode.
The following article provides pointers how to prepare a <code>chroot</code>
On the second prompt you'll have the full Linux tools, including LVM, ZFS, ..., available.
environment for Proxmox VE systems, when repairing issues with boot-loaders.
If you exit that prompt you will come to the installation screens, simply hit abort there.
One example is finding oneself confronted with:


Alternatively, one can use a 64 bit version of Ubuntu or Debian Rescue CD.
* <code>grub rescue ></code>


Boot Proxmox VE in debug mode, or the Ubuntu/Debian off the ISO.  We do not want to install Ubuntu/Debian, just run it live off the ISO/DVD.
You can use the current Proxmox installation ISO and select debug mode.
On the second prompt you'll have the full Linux tools, including LVM, ZFS, ..., available
mounting your filesystems and entering a <code>chroot</code> for repair.
(After you exit that prompt (using Ctrl+D or <code>exit</code>) you will come to the installation screens, simply hit abort there and
reset the system).


First We need to activate LVM and mount the the root partition that is inside the LVM container.
Alternatively, one can use a 64 bit version of Ubuntu or Debian Rescue CD, if you
*<code>sudo vgscan</code>
do not use ZFS as root filesystem (as this is usually not available in most rescue CDs).
*<code>sudo vgchange -ay</code>


Mount all the filesystems that are already there so we can upgrade/install grub. Your paths may vary depending on your drive configuration.
The following commands need to be run as <code>root</code> or using <code>sudo</code> or similar.
*<code>sudo mkdir /media/RESCUE</code>
We will use <code>/media/RESCUE</code> as mountpoint for the root-fs, and <code>/dev/sdX</code>
*<code>sudo mount /dev/pve/root /media/RESCUE/</code>
as device on which Proxmox VE is installed in the examples.
*<code>sudo mount /dev/sda1 /media/RESCUE/boot</code>
*<code>sudo mount -t proc proc /media/RESCUE/proc</code>
*<code>sudo mount -t sysfs sys /media/RESCUE/sys</code>
*<code>sudo mount -o bind /dev /media/RESCUE/dev</code>
*<code>sudo mount -o bind /run /media/RESCUE/run</code>


Chroot into your proxmox install.
Create the mountpoint:
*chroot /media/RESCUE</code>
mkdir /media/RESCUE


Then update grub and install it.
=== LVM (Ext4/XFS) based systems ===
*<code>update-grub</code>
Enable the volume-group and all LVs within:
*<code>grub-install /dev/sda</code>
vgscan
vgchange -ay
 
Mount the relevant filesystems Your paths will vary depending on your drive configuration.
mount /dev/pve/root /media/RESCUE/
 
=== ZFS based systems ===
Import the pool with alternative root:
zpool import -f -R /media/RESCUE rpool
 
As the <code>hostid</code> in the installer is different, you will need
to run <code>zpool import -f rpool</code> in the initramfs once after
booting back into your system.
 
=== Mount relevant filesystems and hostpaths ===
 
mount -o rbind /proc /media/RESCUE/proc
mount -o rbind /sys /media/RESCUE/sys
mount -o rbind /dev /media/RESCUE/dev
mount -o rbind /run /media/RESCUE/run
 
=== Chroot and repair ===
Chroot into your install.
chroot /media/RESCUE
 
Inside the <code>chroot</code> first check if your system is using
<code>proxmox-boot-tool</code>:
proxmox-boot-tool status
 
If it is not used it will print:
E: /etc/kernel/proxmox-boot-uuids does not exist.
 
* If <code>proxmox-boot-tool</code> is used then run:
proxmox-boot-tool reinit
 
* If not - then mount the ESP and reinistall grub:
mount /dev/sdX2 /boot/efi
grub-install /dev/sdX


If there are no error messages, you should be able to reboot now.
If there are no error messages, you should be able to reboot now.


Credit: https://www.nerdoncoffee.com/operating-systems/re-install-grub-on-proxmox/
Credit: https://www.nerdoncoffee.com/operating-systems/re-install-grub-on-proxmox/
== Recovering from grub "disk not found" error when booting from LVM ==
This section applies to the following setups:
* PVE 7.4 (or earlier) hosts with their boot disk on LVM
* PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7
In these setups, the host might end up in a state in which grub fails to boot and prints an error <code>disk `lvmid/<vg uuid>/<lv uuid>` not found</code>. An example (of course, the UUIDs vary):
<nowiki>
Welcome to GRUB!
error: disk `lvmid/p3y5O2-jync-R2Ao-Gtlj-It3j-FZXE-ipEDYG/bApewq-qSRB-zYqT-mzvP-pGiV-VQaf-di4Rcz` not found.
grub rescue> </nowiki>
This error "disk `...` not found" error is originally [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987008 caused by a grub bug]. LVM metadata is stored on-disk in a ring buffer, so occasionally the current metadata will wrap around the end of the ring buffer. However, if there is a wraparound in the ring buffer, grub fails to parse the metadata and fails to boot with the above error.
The recommended steps differ between the PVE 7.4 and PVE 8.
=== PVE 7.x ===
This subsection applies to PVE 7.4 (or earlier) hosts with their boot disk on LVM.
PVE 7.4 ships <code>grub 2.06-3~deb11u5</code> which is affected by the [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987008 bug] (though earlier versions may also be affected). This was also reported multiple times in the forum already, see [https://forum.proxmox.com/threads/98761/ here] and [https://forum.proxmox.com/threads/123512/ here].
==== Temporary Workaround ====
In order to '''temporarily''' work around this bug and get the host to a bootable state again, it is sufficient to trigger an LVM metadata update. The updated metadata will reside in one contiguous section of the metadata ring buffer, so no wraparound occurs anymore. grub will then be able to parse the metadata correctly and boot again.
One simple way to trigger an LVM metadata update is to create a small logical volume:
* Boot from a live USB/CD/DVD with LVM support, e.g. [https://grml.org/ grml]
* Run <code>vgscan</code>
* Create a 4MB logical volume named <code>grubtemp</code> in the <code>pve</code> volume group: <code>lvcreate -L 4M pve -n grubtemp</code>
* Reboot. PVE should boot normally again.
* You can now remove the <code>grubtemp</code> volume: <code>lvremove pve/grubtemp</code>
Note that there are many other options for triggering a metadata update, e.g. using <code>lvchange</code> to extend an existing logical volume or add a tag to an existing logical volume.
The workaround is only temporary: If the host is (re)booted at a time when there is again a wraparound in the metadata ring buffer, grub will fail to boot again.
On a running PVE system, you can check whether there is a wraparound in the metadata ring buffer using the following command:
<nowiki>
vgscan -vvv 2>&1 | grep "Reading metadata" </nowiki>
If the output lines end with <code>(+0)</code>, there is no wraparound. If they end with <code>(+N)</code> for any other number <code>N</code>, there is a wraparound and the grub will most likely fail to boot after a reboot.
==== Permanent Fix ====
The only '''permanent''' fix for PVE 7.x is:
* Apply the temporary workaround to be able to boot PVE again
* Upgrade to PVE 8 by following the [[Upgrade_from_7_to_8|upgrade guide]].
=== PVE 8 ===
This subsection applies to PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7.
PVE 8 ships <code>grub 2.06-13</code> in which the [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987008 grub bug] is fixed. However, on hosts that boot in UEFI mode and were upgraded from PVE 7, it can happen that the updated grub 2.06-13 EFI binary is not installed to the EFI system partition (ESP) at <code>/boot/efi/EFI/proxmox/grubx64.efi</code>. As a result, when booting in UEFI mode, the host still runs the older <code>grub 2.06-3~deb11u5</code> binary that is affected by the grub bug. To find out whether this is the case, check its mtime using <code>ls -l /boot/efi/EFI/proxmox/grubx64.efi</code>. If it is older than the time of the upgrade from PVE 7 to 8, the host still runs the older grub binary when booting in UEFI mode.
==== Temporary Workaround ====
The temporary workaround for PVE 8 to get the host in a bootable state [[#Temporary_Workaround|is the same as for PVE 7.x (see above)]].
==== Permanent Fix ====
The issue can be fixed permanently on PVE 8 by installing the correct grub metapackage for UEFI and choosing the correct UEFI boot entry.
First, apply the [[#Temporary_Workaround|temporary workaround]] to be able to boot into PVE 8 again. When booted into PVE 8, run the following command. It checks if the host is indeed booted in UEFI mode, and if yes, installs the correct grub metapackage for UEFI:
<nowiki>
[ -d /sys/firmware/efi ] && apt install grub-efi-amd64 </nowiki>
This will remove the <code>grub-pc</code> package, and update the binary on the ESP. You can verify that the mtime of <code>/boot/efi/EFI/proxmox/grubx64.efi</code> was updated.
Note that this will not update the default EFI binary at <code>/boot/efi/EFI/BOOT/BOOTx64.EFI</code>, which might still be the grub binary that is affected by the bug. Consequently, make sure that you select the <code>proxmox</code> boot entry when booting in UEFI mode. If needed, you can adjust the boot order directly in the UEFI firmware or using the <code>efibootmgr</code> tool (see [https://manpages.debian.org/stable/efibootmgr/efibootmgr.8.en.html#Changing_the_boot_order its manpage]).

Latest revision as of 09:52, 13 August 2025

General advice

The following article provides pointers how to prepare a chroot environment for Proxmox VE systems, when repairing issues with boot-loaders. One example is finding oneself confronted with:

  • grub rescue >

You can use the current Proxmox installation ISO and select debug mode. On the second prompt you'll have the full Linux tools, including LVM, ZFS, ..., available mounting your filesystems and entering a chroot for repair. (After you exit that prompt (using Ctrl+D or exit) you will come to the installation screens, simply hit abort there and reset the system).

Alternatively, one can use a 64 bit version of Ubuntu or Debian Rescue CD, if you do not use ZFS as root filesystem (as this is usually not available in most rescue CDs).

The following commands need to be run as root or using sudo or similar. We will use /media/RESCUE as mountpoint for the root-fs, and /dev/sdX as device on which Proxmox VE is installed in the examples.

Create the mountpoint:

mkdir /media/RESCUE

LVM (Ext4/XFS) based systems

Enable the volume-group and all LVs within:

vgscan
vgchange -ay

Mount the relevant filesystems Your paths will vary depending on your drive configuration.

mount /dev/pve/root /media/RESCUE/

ZFS based systems

Import the pool with alternative root:

zpool import -f -R /media/RESCUE rpool

As the hostid in the installer is different, you will need to run zpool import -f rpool in the initramfs once after booting back into your system.

Mount relevant filesystems and hostpaths

mount -o rbind /proc /media/RESCUE/proc
mount -o rbind /sys /media/RESCUE/sys
mount -o rbind /dev /media/RESCUE/dev
mount -o rbind /run /media/RESCUE/run

Chroot and repair

Chroot into your install.

chroot /media/RESCUE

Inside the chroot first check if your system is using proxmox-boot-tool:

proxmox-boot-tool status

If it is not used it will print:

E: /etc/kernel/proxmox-boot-uuids does not exist.
  • If proxmox-boot-tool is used then run:
proxmox-boot-tool reinit
  • If not - then mount the ESP and reinistall grub:
mount /dev/sdX2 /boot/efi
grub-install /dev/sdX

If there are no error messages, you should be able to reboot now.

Credit: https://www.nerdoncoffee.com/operating-systems/re-install-grub-on-proxmox/

Recovering from grub "disk not found" error when booting from LVM

This section applies to the following setups:

  • PVE 7.4 (or earlier) hosts with their boot disk on LVM
  • PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7

In these setups, the host might end up in a state in which grub fails to boot and prints an error disk `lvmid/<vg uuid>/<lv uuid>` not found. An example (of course, the UUIDs vary):

Welcome to GRUB!

error: disk `lvmid/p3y5O2-jync-R2Ao-Gtlj-It3j-FZXE-ipEDYG/bApewq-qSRB-zYqT-mzvP-pGiV-VQaf-di4Rcz` not found.
grub rescue> 

This error "disk `...` not found" error is originally caused by a grub bug. LVM metadata is stored on-disk in a ring buffer, so occasionally the current metadata will wrap around the end of the ring buffer. However, if there is a wraparound in the ring buffer, grub fails to parse the metadata and fails to boot with the above error.

The recommended steps differ between the PVE 7.4 and PVE 8.

PVE 7.x

This subsection applies to PVE 7.4 (or earlier) hosts with their boot disk on LVM.

PVE 7.4 ships grub 2.06-3~deb11u5 which is affected by the bug (though earlier versions may also be affected). This was also reported multiple times in the forum already, see here and here.

Temporary Workaround

In order to temporarily work around this bug and get the host to a bootable state again, it is sufficient to trigger an LVM metadata update. The updated metadata will reside in one contiguous section of the metadata ring buffer, so no wraparound occurs anymore. grub will then be able to parse the metadata correctly and boot again.

One simple way to trigger an LVM metadata update is to create a small logical volume:

  • Boot from a live USB/CD/DVD with LVM support, e.g. grml
  • Run vgscan
  • Create a 4MB logical volume named grubtemp in the pve volume group: lvcreate -L 4M pve -n grubtemp
  • Reboot. PVE should boot normally again.
  • You can now remove the grubtemp volume: lvremove pve/grubtemp

Note that there are many other options for triggering a metadata update, e.g. using lvchange to extend an existing logical volume or add a tag to an existing logical volume.

The workaround is only temporary: If the host is (re)booted at a time when there is again a wraparound in the metadata ring buffer, grub will fail to boot again.

On a running PVE system, you can check whether there is a wraparound in the metadata ring buffer using the following command:

vgscan -vvv 2>&1 | grep "Reading metadata" 

If the output lines end with (+0), there is no wraparound. If they end with (+N) for any other number N, there is a wraparound and the grub will most likely fail to boot after a reboot.

Permanent Fix

The only permanent fix for PVE 7.x is:

  • Apply the temporary workaround to be able to boot PVE again
  • Upgrade to PVE 8 by following the upgrade guide.

PVE 8

This subsection applies to PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7.

PVE 8 ships grub 2.06-13 in which the grub bug is fixed. However, on hosts that boot in UEFI mode and were upgraded from PVE 7, it can happen that the updated grub 2.06-13 EFI binary is not installed to the EFI system partition (ESP) at /boot/efi/EFI/proxmox/grubx64.efi. As a result, when booting in UEFI mode, the host still runs the older grub 2.06-3~deb11u5 binary that is affected by the grub bug. To find out whether this is the case, check its mtime using ls -l /boot/efi/EFI/proxmox/grubx64.efi. If it is older than the time of the upgrade from PVE 7 to 8, the host still runs the older grub binary when booting in UEFI mode.

Temporary Workaround

The temporary workaround for PVE 8 to get the host in a bootable state is the same as for PVE 7.x (see above).

Permanent Fix

The issue can be fixed permanently on PVE 8 by installing the correct grub metapackage for UEFI and choosing the correct UEFI boot entry.

First, apply the temporary workaround to be able to boot into PVE 8 again. When booted into PVE 8, run the following command. It checks if the host is indeed booted in UEFI mode, and if yes, installs the correct grub metapackage for UEFI:

[ -d /sys/firmware/efi ] && apt install grub-efi-amd64 

This will remove the grub-pc package, and update the binary on the ESP. You can verify that the mtime of /boot/efi/EFI/proxmox/grubx64.efi was updated.

Note that this will not update the default EFI binary at /boot/efi/EFI/BOOT/BOOTx64.EFI, which might still be the grub binary that is affected by the bug. Consequently, make sure that you select the proxmox boot entry when booting in UEFI mode. If needed, you can adjust the boot order directly in the UEFI firmware or using the efibootmgr tool (see its manpage).