Recover From Grub Failure

From Proxmox VE
Revision as of 07:39, 23 October 2023 by Fweber (talk | contribs) (update section on "disk not found" error: separate instructions for PVE 7 and PVE 8)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

General advice

During to the upgrade from 3.x to 4.x, I found myself without a working grub and unable to boot. Monitor shows:

  • grub rescue >

You can use Proxmox installation ISO in verison 5.4 or newer, and select debug mode. On the second prompt you'll have the full Linux tools, including LVM, ZFS, ..., available. If you exit that prompt you will come to the installation screens, simply hit abort there.

Alternatively, one can use a 64 bit version of Ubuntu or Debian Rescue CD.

Boot Proxmox VE in debug mode, or the Ubuntu/Debian off the ISO. We do not want to install Ubuntu/Debian, just run it live off the ISO/DVD.

First We need to activate LVM and mount the the root partition that is inside the LVM container.

  • sudo vgscan
  • sudo vgchange -ay

Mount all the filesystems that are already there so we can upgrade/install grub. Your paths may vary depending on your drive configuration.

  • sudo mkdir /media/RESCUE
  • sudo mount /dev/pve/root /media/RESCUE/
  • sudo mount /dev/sda1 /media/RESCUE/boot
  • sudo mount -t proc proc /media/RESCUE/proc
  • sudo mount -t sysfs sys /media/RESCUE/sys
  • sudo mount -o bind /dev /media/RESCUE/dev
  • sudo mount -o bind /run /media/RESCUE/run

Chroot into your proxmox install.

  • chroot /media/RESCUE

Then update grub and install it.

  • update-grub
  • grub-install /dev/sda

If there are no error messages, you should be able to reboot now.

Credit: https://www.nerdoncoffee.com/operating-systems/re-install-grub-on-proxmox/

Recovering from grub "disk not found" error when booting from LVM

This section applies to the following setups:

  • PVE 7.4 (or earlier) hosts with their boot disk on LVM
  • PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7

In these setups, the host might end up in a state in which grub fails to boot and prints an error disk `lvmid/<vg uuid>/<lv uuid>` not found. An example (of course, the UUIDs vary):

Welcome to GRUB!

error: disk `lvmid/p3y5O2-jync-R2Ao-Gtlj-It3j-FZXE-ipEDYG/bApewq-qSRB-zYqT-mzvP-pGiV-VQaf-di4Rcz` not found.
grub rescue> 

This error "disk `...` not found" error is originally caused by a grub bug. LVM metadata is stored on-disk in a ring buffer, so occasionally the current metadata will wrap around the end of the ring buffer. However, if there is a wraparound in the ring buffer, grub fails to parse the metadata and fails to boot with the above error.

The recommended steps differ between the PVE 7.4 and PVE 8.

PVE 7.x

This subsection applies to PVE 7.4 (or earlier) hosts with their boot disk on LVM.

PVE 7.4 ships grub 2.06-3~deb11u5 which is affected by the bug (though earlier versions may also be affected). This was also reported multiple times in the forum already, see here and here.

Temporary Workaround

In order to temporarily work around this bug and get the host to a bootable state again, it is sufficient to trigger an LVM metadata update. The updated metadata will reside in one contiguous section of the metadata ring buffer, so no wraparound occurs anymore. grub will then be able to parse the metadata correctly and boot again.

One simple way to trigger an LVM metadata update is to create a small logical volume:

  • Boot from a live USB/CD/DVD with LVM support, e.g. grml
  • Run vgscan
  • Create a 4MB logical volume named grubtemp in the pve volume group: lvcreate -L 4M pve -n grubtemp
  • Reboot. PVE should boot normally again.
  • You can now remove the grubtemp volume: lvremove pve/grubtemp

Note that there are many other options for triggering a metadata update, e.g. using lvchange to extend an existing logical volume or add a tag to an existing logical volume.

The workaround is only temporary: If the host is (re)booted at a time when there is again a wraparound in the metadata ring buffer, grub will fail to boot again.

On a running PVE system, you can check whether there is a wraparound in the metadata ring buffer using the following command:

vgscan -vvv 2>&1 | grep "Reading metadata" 

If the output lines end with (+0), there is no wraparound. If they end with (+N) for any other number N, there is a wraparound and the grub will most likely fail to boot after a reboot.

Permanent Fix

The only permanent fix for PVE 7.x is:

  • Apply the temporary workaround to be able to boot PVE again
  • Upgrade to PVE 8 by following the upgrade guide.

PVE 8

This subsection applies to PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7.

PVE 8 ships grub 2.06-13 in which the grub bug is fixed. However, on hosts that boot in UEFI mode and were upgraded from PVE 7, it can happen that the updated grub 2.06-13 EFI binary is not installed to the EFI system partition (ESP) at /boot/efi/EFI/proxmox/grubx64.efi. As a result, when booting in UEFI mode, the host still runs the older grub 2.06-3~deb11u5 binary that is affected by the grub bug. To find out whether this is the case, check its mtime using ls -l /boot/efi/EFI/proxmox/grubx64.efi. If it is older than the time of the upgrade from PVE 7 to 8, the host still runs the older grub binary when booting in UEFI mode.

Temporary Workaround

The temporary workaround for PVE 8 to get the host in a bootable state is the same as for PVE 7.x (see above).

Permanent Fix

The issue can be fixed permanently on PVE 8 by installing the correct grub metapackage for UEFI and choosing the correct UEFI boot entry.

First, apply the temporary workaround to be able to boot into PVE 8 again. When booted into PVE 8, run the following command. It checks if the host is indeed booted in UEFI mode, and if yes, installs the correct grub metapackage for UEFI:

[ -d /sys/firmware/efi ] && apt install grub-efi-amd64 

This will remove the grub-pc package, and update the binary on the ESP. You can verify that the mtime of /boot/efi/EFI/proxmox/grubx64.efi was updated.

Note that this will not update the default EFI binary at /boot/efi/EFI/BOOT/BOOTx64.EFI, which might still be the grub binary that is affected by the bug. Consequently, make sure that you select the proxmox boot entry when booting in UEFI mode. If needed, you can adjust the boot order directly in the UEFI firmware or using the efibootmgr tool (see its manpage).