Difference between revisions of "Recover From Grub Failure"

From Proxmox VE
Jump to navigation Jump to search
(update section on "disk not found" error: separate instructions for PVE 7 and PVE 8)
 
(8 intermediate revisions by 4 users not shown)
Line 1: Line 1:
During to the upgrade from 3.x to 4.x, I found myself without a working grub and unable to boot.
+
== General advice ==
I attempted to use the proxmox 4.1 install disk, but found it had a bug where the prompt would not accept input.
 
  
You'll need a ISO for a 64 bit version of Ubuntu, I used 14.04 LTS.
+
During to the upgrade from 3.x to 4.x, I found myself without a working grub and unable to boot. Monitor shows:
 +
*<code>grub rescue ></code>
  
Boot ubuntu off the ISO.  We do not want to install ubuntu, just run it live off the ISO/DVD.
+
You can use Proxmox installation ISO in verison 5.4 or newer, and select debug mode.
 +
On the second prompt you'll have the full Linux tools, including LVM, ZFS, ..., available.
 +
If you exit that prompt you will come to the installation screens, simply hit abort there.
 +
 
 +
Alternatively, one can use a 64 bit version of Ubuntu or Debian Rescue CD.
 +
 
 +
Boot Proxmox VE in debug mode, or the Ubuntu/Debian off the ISO.  We do not want to install Ubuntu/Debian, just run it live off the ISO/DVD.
  
 
First We need to activate LVM and mount the the root partition that is inside the LVM container.
 
First We need to activate LVM and mount the the root partition that is inside the LVM container.
*sudo vgscan
+
*<code>sudo vgscan</code>
*sudo vgchange -ay
+
*<code>sudo vgchange -ay</code>
  
Mount all the filesystems that are already there so we can upgrade/install grub. Your paths may vary depending on your drive configuration.
+
Mount all the filesystems that are already there so we can upgrade/install grub. Your paths may vary depending on your drive configuration.
*sudo mkdir /media/USB
+
*<code>sudo mkdir /media/RESCUE</code>
*sudo mount /dev/pve/root /media/USB/
+
*<code>sudo mount /dev/pve/root /media/RESCUE/</code>
*sudo mount /dev/sda1 /media/USB/boot
+
*<code>sudo mount /dev/sda1 /media/RESCUE/boot</code>
*sudo mount -t proc proc /media/USB/proc
+
*<code>sudo mount -t proc proc /media/RESCUE/proc</code>
*sudo mount -t sysfs sys /media/USB/sys
+
*<code>sudo mount -t sysfs sys /media/RESCUE/sys</code>
*sudo mount -o bind /dev /media/USB/dev
+
*<code>sudo mount -o bind /dev /media/RESCUE/dev</code>
 +
*<code>sudo mount -o bind /run /media/RESCUE/run</code>
  
 
Chroot into your proxmox install.
 
Chroot into your proxmox install.
*chroot /media/USB
+
*<code>chroot /media/RESCUE</code>
  
Then upgrade grub and install it.
+
Then update grub and install it.
*grub-upgrade
+
*<code>update-grub</code>
*grub-install /dev/sda
+
*<code>grub-install /dev/sda</code>
  
 
If there are no error messages, you should be able to reboot now.
 
If there are no error messages, you should be able to reboot now.
  
 
Credit: https://www.nerdoncoffee.com/operating-systems/re-install-grub-on-proxmox/
 
Credit: https://www.nerdoncoffee.com/operating-systems/re-install-grub-on-proxmox/
 +
 +
== Recovering from grub "disk not found" error when booting from LVM ==
 +
 +
This section applies to the following setups:
 +
 +
* PVE 7.4 (or earlier) hosts with their boot disk on LVM
 +
* PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7
 +
 +
In these setups, the host might end up in a state in which grub fails to boot and prints an error <code>disk `lvmid/<vg uuid>/<lv uuid>` not found</code>. An example (of course, the UUIDs vary):
 +
 +
<nowiki>
 +
Welcome to GRUB!
 +
 +
error: disk `lvmid/p3y5O2-jync-R2Ao-Gtlj-It3j-FZXE-ipEDYG/bApewq-qSRB-zYqT-mzvP-pGiV-VQaf-di4Rcz` not found.
 +
grub rescue> </nowiki>
 +
 +
This error "disk `...` not found" error is originally [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987008 caused by a grub bug]. LVM metadata is stored on-disk in a ring buffer, so occasionally the current metadata will wrap around the end of the ring buffer. However, if there is a wraparound in the ring buffer, grub fails to parse the metadata and fails to boot with the above error.
 +
 +
The recommended steps differ between the PVE 7.4 and PVE 8.
 +
 +
=== PVE 7.x ===
 +
 +
This subsection applies to PVE 7.4 (or earlier) hosts with their boot disk on LVM.
 +
 +
PVE 7.4 ships <code>grub 2.06-3~deb11u5</code> which is affected by the [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987008 bug] (though earlier versions may also be affected). This was also reported multiple times in the forum already, see [https://forum.proxmox.com/threads/98761/ here] and [https://forum.proxmox.com/threads/123512/ here].
 +
 +
==== Temporary Workaround ====
 +
 +
In order to '''temporarily''' work around this bug and get the host to a bootable state again, it is sufficient to trigger an LVM metadata update. The updated metadata will reside in one contiguous section of the metadata ring buffer, so no wraparound occurs anymore. grub will then be able to parse the metadata correctly and boot again.
 +
 +
One simple way to trigger an LVM metadata update is to create a small logical volume:
 +
* Boot from a live USB/CD/DVD with LVM support, e.g. [https://grml.org/ grml]
 +
* Run <code>vgscan</code>
 +
* Create a 4MB logical volume named <code>grubtemp</code> in the <code>pve</code> volume group: <code>lvcreate -L 4M pve -n grubtemp</code>
 +
* Reboot. PVE should boot normally again.
 +
* You can now remove the <code>grubtemp</code> volume: <code>lvremove pve/grubtemp</code>
 +
 +
Note that there are many other options for triggering a metadata update, e.g. using <code>lvchange</code> to extend an existing logical volume or add a tag to an existing logical volume.
 +
 +
The workaround is only temporary: If the host is (re)booted at a time when there is again a wraparound in the metadata ring buffer, grub will fail to boot again.
 +
 +
On a running PVE system, you can check whether there is a wraparound in the metadata ring buffer using the following command:
 +
 +
<nowiki>
 +
vgscan -vvv 2>&1 | grep "Reading metadata" </nowiki>
 +
 +
If the output lines end with <code>(+0)</code>, there is no wraparound. If they end with <code>(+N)</code> for any other number <code>N</code>, there is a wraparound and the grub will most likely fail to boot after a reboot.
 +
 +
==== Permanent Fix ====
 +
 +
The only '''permanent''' fix for PVE 7.x is:
 +
* Apply the temporary workaround to be able to boot PVE again
 +
* Upgrade to PVE 8 by following the [[Upgrade_from_7_to_8|upgrade guide]].
 +
 +
=== PVE 8 ===
 +
 +
This subsection applies to PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7.
 +
 +
PVE 8 ships <code>grub 2.06-13</code> in which the [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987008 grub bug] is fixed. However, on hosts that boot in UEFI mode and were upgraded from PVE 7, it can happen that the updated grub 2.06-13 EFI binary is not installed to the EFI system partition (ESP) at <code>/boot/efi/EFI/proxmox/grubx64.efi</code>. As a result, when booting in UEFI mode, the host still runs the older <code>grub 2.06-3~deb11u5</code> binary that is affected by the grub bug. To find out whether this is the case, check its mtime using <code>ls -l /boot/efi/EFI/proxmox/grubx64.efi</code>. If it is older than the time of the upgrade from PVE 7 to 8, the host still runs the older grub binary when booting in UEFI mode.
 +
 +
==== Temporary Workaround ====
 +
 +
The temporary workaround for PVE 8 to get the host in a bootable state [[#Temporary_Workaround|is the same as for PVE 7.x (see above)]].
 +
 +
==== Permanent Fix ====
 +
 +
The issue can be fixed permanently on PVE 8 by installing the correct grub metapackage for UEFI and choosing the correct UEFI boot entry.
 +
 +
First, apply the [[#Temporary_Workaround|temporary workaround]] to be able to boot into PVE 8 again. When booted into PVE 8, run the following command. It checks if the host is indeed booted in UEFI mode, and if yes, installs the correct grub metapackage for UEFI:
 +
 +
<nowiki>
 +
[ -d /sys/firmware/efi ] && apt install grub-efi-amd64 </nowiki>
 +
 +
This will remove the <code>grub-pc</code> package, and update the binary on the ESP. You can verify that the mtime of <code>/boot/efi/EFI/proxmox/grubx64.efi</code> was updated.
 +
 +
Note that this will not update the default EFI binary at <code>/boot/efi/EFI/BOOT/BOOTx64.EFI</code>, which might still be the grub binary that is affected by the bug. Consequently, make sure that you select the <code>proxmox</code> boot entry when booting in UEFI mode. If needed, you can adjust the boot order directly in the UEFI firmware or using the <code>efibootmgr</code> tool (see [https://manpages.debian.org/stable/efibootmgr/efibootmgr.8.en.html#Changing_the_boot_order its manpage]).

Latest revision as of 07:39, 23 October 2023

General advice

During to the upgrade from 3.x to 4.x, I found myself without a working grub and unable to boot. Monitor shows:

  • grub rescue >

You can use Proxmox installation ISO in verison 5.4 or newer, and select debug mode. On the second prompt you'll have the full Linux tools, including LVM, ZFS, ..., available. If you exit that prompt you will come to the installation screens, simply hit abort there.

Alternatively, one can use a 64 bit version of Ubuntu or Debian Rescue CD.

Boot Proxmox VE in debug mode, or the Ubuntu/Debian off the ISO. We do not want to install Ubuntu/Debian, just run it live off the ISO/DVD.

First We need to activate LVM and mount the the root partition that is inside the LVM container.

  • sudo vgscan
  • sudo vgchange -ay

Mount all the filesystems that are already there so we can upgrade/install grub. Your paths may vary depending on your drive configuration.

  • sudo mkdir /media/RESCUE
  • sudo mount /dev/pve/root /media/RESCUE/
  • sudo mount /dev/sda1 /media/RESCUE/boot
  • sudo mount -t proc proc /media/RESCUE/proc
  • sudo mount -t sysfs sys /media/RESCUE/sys
  • sudo mount -o bind /dev /media/RESCUE/dev
  • sudo mount -o bind /run /media/RESCUE/run

Chroot into your proxmox install.

  • chroot /media/RESCUE

Then update grub and install it.

  • update-grub
  • grub-install /dev/sda

If there are no error messages, you should be able to reboot now.

Credit: https://www.nerdoncoffee.com/operating-systems/re-install-grub-on-proxmox/

Recovering from grub "disk not found" error when booting from LVM

This section applies to the following setups:

  • PVE 7.4 (or earlier) hosts with their boot disk on LVM
  • PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7

In these setups, the host might end up in a state in which grub fails to boot and prints an error disk `lvmid/<vg uuid>/<lv uuid>` not found. An example (of course, the UUIDs vary):

Welcome to GRUB!

error: disk `lvmid/p3y5O2-jync-R2Ao-Gtlj-It3j-FZXE-ipEDYG/bApewq-qSRB-zYqT-mzvP-pGiV-VQaf-di4Rcz` not found.
grub rescue> 

This error "disk `...` not found" error is originally caused by a grub bug. LVM metadata is stored on-disk in a ring buffer, so occasionally the current metadata will wrap around the end of the ring buffer. However, if there is a wraparound in the ring buffer, grub fails to parse the metadata and fails to boot with the above error.

The recommended steps differ between the PVE 7.4 and PVE 8.

PVE 7.x

This subsection applies to PVE 7.4 (or earlier) hosts with their boot disk on LVM.

PVE 7.4 ships grub 2.06-3~deb11u5 which is affected by the bug (though earlier versions may also be affected). This was also reported multiple times in the forum already, see here and here.

Temporary Workaround

In order to temporarily work around this bug and get the host to a bootable state again, it is sufficient to trigger an LVM metadata update. The updated metadata will reside in one contiguous section of the metadata ring buffer, so no wraparound occurs anymore. grub will then be able to parse the metadata correctly and boot again.

One simple way to trigger an LVM metadata update is to create a small logical volume:

  • Boot from a live USB/CD/DVD with LVM support, e.g. grml
  • Run vgscan
  • Create a 4MB logical volume named grubtemp in the pve volume group: lvcreate -L 4M pve -n grubtemp
  • Reboot. PVE should boot normally again.
  • You can now remove the grubtemp volume: lvremove pve/grubtemp

Note that there are many other options for triggering a metadata update, e.g. using lvchange to extend an existing logical volume or add a tag to an existing logical volume.

The workaround is only temporary: If the host is (re)booted at a time when there is again a wraparound in the metadata ring buffer, grub will fail to boot again.

On a running PVE system, you can check whether there is a wraparound in the metadata ring buffer using the following command:

vgscan -vvv 2>&1 | grep "Reading metadata" 

If the output lines end with (+0), there is no wraparound. If they end with (+N) for any other number N, there is a wraparound and the grub will most likely fail to boot after a reboot.

Permanent Fix

The only permanent fix for PVE 7.x is:

  • Apply the temporary workaround to be able to boot PVE again
  • Upgrade to PVE 8 by following the upgrade guide.

PVE 8

This subsection applies to PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7.

PVE 8 ships grub 2.06-13 in which the grub bug is fixed. However, on hosts that boot in UEFI mode and were upgraded from PVE 7, it can happen that the updated grub 2.06-13 EFI binary is not installed to the EFI system partition (ESP) at /boot/efi/EFI/proxmox/grubx64.efi. As a result, when booting in UEFI mode, the host still runs the older grub 2.06-3~deb11u5 binary that is affected by the grub bug. To find out whether this is the case, check its mtime using ls -l /boot/efi/EFI/proxmox/grubx64.efi. If it is older than the time of the upgrade from PVE 7 to 8, the host still runs the older grub binary when booting in UEFI mode.

Temporary Workaround

The temporary workaround for PVE 8 to get the host in a bootable state is the same as for PVE 7.x (see above).

Permanent Fix

The issue can be fixed permanently on PVE 8 by installing the correct grub metapackage for UEFI and choosing the correct UEFI boot entry.

First, apply the temporary workaround to be able to boot into PVE 8 again. When booted into PVE 8, run the following command. It checks if the host is indeed booted in UEFI mode, and if yes, installs the correct grub metapackage for UEFI:

[ -d /sys/firmware/efi ] && apt install grub-efi-amd64 

This will remove the grub-pc package, and update the binary on the ESP. You can verify that the mtime of /boot/efi/EFI/proxmox/grubx64.efi was updated.

Note that this will not update the default EFI binary at /boot/efi/EFI/BOOT/BOOTx64.EFI, which might still be the grub binary that is affected by the bug. Consequently, make sure that you select the proxmox boot entry when booting in UEFI mode. If needed, you can adjust the boot order directly in the UEFI firmware or using the efibootmgr tool (see its manpage).