Recover From Grub Failure: Difference between revisions
(Add section on grub LVM parsing bug) |
(update section on "disk not found" error: separate instructions for PVE 7 and PVE 8) |
||
(2 intermediate revisions by the same user not shown) | |||
Line 38: | Line 38: | ||
== Recovering from grub "disk not found" error when booting from LVM == | == Recovering from grub "disk not found" error when booting from LVM == | ||
This section applies to | This section applies to the following setups: | ||
An example (of course, the UUIDs vary): | * PVE 7.4 (or earlier) hosts with their boot disk on LVM | ||
* PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7 | |||
In these setups, the host might end up in a state in which grub fails to boot and prints an error <code>disk `lvmid/<vg uuid>/<lv uuid>` not found</code>. An example (of course, the UUIDs vary): | |||
<nowiki> | <nowiki> | ||
Line 48: | Line 51: | ||
grub rescue> </nowiki> | grub rescue> </nowiki> | ||
This error | This error "disk `...` not found" error is originally [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987008 caused by a grub bug]. LVM metadata is stored on-disk in a ring buffer, so occasionally the current metadata will wrap around the end of the ring buffer. However, if there is a wraparound in the ring buffer, grub fails to parse the metadata and fails to boot with the above error. | ||
The recommended steps differ between the PVE 7.4 and PVE 8. | |||
=== PVE 7.x === | |||
In order to work around this bug and get the host to a bootable state again, it is sufficient to trigger an LVM metadata update. The updated metadata will reside in one contiguous section of the metadata ring buffer, so no wraparound occurs anymore. grub will then be able to parse the metadata correctly and boot again. | This subsection applies to PVE 7.4 (or earlier) hosts with their boot disk on LVM. | ||
PVE 7.4 ships <code>grub 2.06-3~deb11u5</code> which is affected by the [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987008 bug] (though earlier versions may also be affected). This was also reported multiple times in the forum already, see [https://forum.proxmox.com/threads/98761/ here] and [https://forum.proxmox.com/threads/123512/ here]. | |||
==== Temporary Workaround ==== | |||
In order to '''temporarily''' work around this bug and get the host to a bootable state again, it is sufficient to trigger an LVM metadata update. The updated metadata will reside in one contiguous section of the metadata ring buffer, so no wraparound occurs anymore. grub will then be able to parse the metadata correctly and boot again. | |||
One simple way to trigger an LVM metadata update is to create a small logical volume: | One simple way to trigger an LVM metadata update is to create a small logical volume: | ||
* Boot from a live USB/CD/DVD with LVM support, e.g. [https://grml.org/ grml] | * Boot from a live USB/CD/DVD with LVM support, e.g. [https://grml.org/ grml] | ||
* Run <code>vgscan | * Run <code>vgscan</code> | ||
* Create a 4MB logical volume named <code>grubtemp</code> in the <code>pve</code> volume group: <code>lvcreate -L 4M pve -n grubtemp</code> | * Create a 4MB logical volume named <code>grubtemp</code> in the <code>pve</code> volume group: <code>lvcreate -L 4M pve -n grubtemp</code> | ||
* Reboot. PVE should boot normally again. | * Reboot. PVE should boot normally again. | ||
Line 60: | Line 73: | ||
Note that there are many other options for triggering a metadata update, e.g. using <code>lvchange</code> to extend an existing logical volume or add a tag to an existing logical volume. | Note that there are many other options for triggering a metadata update, e.g. using <code>lvchange</code> to extend an existing logical volume or add a tag to an existing logical volume. | ||
The workaround is only temporary: If the host is (re)booted at a time when there is again a wraparound in the metadata ring buffer, grub will fail to boot again. | |||
On a running PVE system, you can check whether there is a wraparound in the metadata ring buffer using the following command: | |||
<nowiki> | |||
vgscan -vvv 2>&1 | grep "Reading metadata" </nowiki> | |||
If the output lines end with <code>(+0)</code>, there is no wraparound. If they end with <code>(+N)</code> for any other number <code>N</code>, there is a wraparound and the grub will most likely fail to boot after a reboot. | |||
==== Permanent Fix ==== | |||
The only '''permanent''' fix for PVE 7.x is: | |||
* Apply the temporary workaround to be able to boot PVE again | |||
* Upgrade to PVE 8 by following the [[Upgrade_from_7_to_8|upgrade guide]]. | |||
=== PVE 8 === | |||
This subsection applies to PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7. | |||
PVE 8 ships <code>grub 2.06-13</code> in which the [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987008 grub bug] is fixed. However, on hosts that boot in UEFI mode and were upgraded from PVE 7, it can happen that the updated grub 2.06-13 EFI binary is not installed to the EFI system partition (ESP) at <code>/boot/efi/EFI/proxmox/grubx64.efi</code>. As a result, when booting in UEFI mode, the host still runs the older <code>grub 2.06-3~deb11u5</code> binary that is affected by the grub bug. To find out whether this is the case, check its mtime using <code>ls -l /boot/efi/EFI/proxmox/grubx64.efi</code>. If it is older than the time of the upgrade from PVE 7 to 8, the host still runs the older grub binary when booting in UEFI mode. | |||
==== Temporary Workaround ==== | |||
The temporary workaround for PVE 8 to get the host in a bootable state [[#Temporary_Workaround|is the same as for PVE 7.x (see above)]]. | |||
==== Permanent Fix ==== | |||
The issue can be fixed permanently on PVE 8 by installing the correct grub metapackage for UEFI and choosing the correct UEFI boot entry. | |||
First, apply the [[#Temporary_Workaround|temporary workaround]] to be able to boot into PVE 8 again. When booted into PVE 8, run the following command. It checks if the host is indeed booted in UEFI mode, and if yes, installs the correct grub metapackage for UEFI: | |||
<nowiki> | |||
[ -d /sys/firmware/efi ] && apt install grub-efi-amd64 </nowiki> | |||
This will remove the <code>grub-pc</code> package, and update the binary on the ESP. You can verify that the mtime of <code>/boot/efi/EFI/proxmox/grubx64.efi</code> was updated. | |||
Note that this will not update the default EFI binary at <code>/boot/efi/EFI/BOOT/BOOTx64.EFI</code>, which might still be the grub binary that is affected by the bug. Consequently, make sure that you select the <code>proxmox</code> boot entry when booting in UEFI mode. If needed, you can adjust the boot order directly in the UEFI firmware or using the <code>efibootmgr</code> tool (see [https://manpages.debian.org/stable/efibootmgr/efibootmgr.8.en.html#Changing_the_boot_order its manpage]). |
Latest revision as of 07:39, 23 October 2023
General advice
During to the upgrade from 3.x to 4.x, I found myself without a working grub and unable to boot. Monitor shows:
grub rescue >
You can use Proxmox installation ISO in verison 5.4 or newer, and select debug mode. On the second prompt you'll have the full Linux tools, including LVM, ZFS, ..., available. If you exit that prompt you will come to the installation screens, simply hit abort there.
Alternatively, one can use a 64 bit version of Ubuntu or Debian Rescue CD.
Boot Proxmox VE in debug mode, or the Ubuntu/Debian off the ISO. We do not want to install Ubuntu/Debian, just run it live off the ISO/DVD.
First We need to activate LVM and mount the the root partition that is inside the LVM container.
sudo vgscan
sudo vgchange -ay
Mount all the filesystems that are already there so we can upgrade/install grub. Your paths may vary depending on your drive configuration.
sudo mkdir /media/RESCUE
sudo mount /dev/pve/root /media/RESCUE/
sudo mount /dev/sda1 /media/RESCUE/boot
sudo mount -t proc proc /media/RESCUE/proc
sudo mount -t sysfs sys /media/RESCUE/sys
sudo mount -o bind /dev /media/RESCUE/dev
sudo mount -o bind /run /media/RESCUE/run
Chroot into your proxmox install.
chroot /media/RESCUE
Then update grub and install it.
update-grub
grub-install /dev/sda
If there are no error messages, you should be able to reboot now.
Credit: https://www.nerdoncoffee.com/operating-systems/re-install-grub-on-proxmox/
Recovering from grub "disk not found" error when booting from LVM
This section applies to the following setups:
- PVE 7.4 (or earlier) hosts with their boot disk on LVM
- PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7
In these setups, the host might end up in a state in which grub fails to boot and prints an error disk `lvmid/<vg uuid>/<lv uuid>` not found
. An example (of course, the UUIDs vary):
Welcome to GRUB! error: disk `lvmid/p3y5O2-jync-R2Ao-Gtlj-It3j-FZXE-ipEDYG/bApewq-qSRB-zYqT-mzvP-pGiV-VQaf-di4Rcz` not found. grub rescue>
This error "disk `...` not found" error is originally caused by a grub bug. LVM metadata is stored on-disk in a ring buffer, so occasionally the current metadata will wrap around the end of the ring buffer. However, if there is a wraparound in the ring buffer, grub fails to parse the metadata and fails to boot with the above error.
The recommended steps differ between the PVE 7.4 and PVE 8.
PVE 7.x
This subsection applies to PVE 7.4 (or earlier) hosts with their boot disk on LVM.
PVE 7.4 ships grub 2.06-3~deb11u5
which is affected by the bug (though earlier versions may also be affected). This was also reported multiple times in the forum already, see here and here.
Temporary Workaround
In order to temporarily work around this bug and get the host to a bootable state again, it is sufficient to trigger an LVM metadata update. The updated metadata will reside in one contiguous section of the metadata ring buffer, so no wraparound occurs anymore. grub will then be able to parse the metadata correctly and boot again.
One simple way to trigger an LVM metadata update is to create a small logical volume:
- Boot from a live USB/CD/DVD with LVM support, e.g. grml
- Run
vgscan
- Create a 4MB logical volume named
grubtemp
in thepve
volume group:lvcreate -L 4M pve -n grubtemp
- Reboot. PVE should boot normally again.
- You can now remove the
grubtemp
volume:lvremove pve/grubtemp
Note that there are many other options for triggering a metadata update, e.g. using lvchange
to extend an existing logical volume or add a tag to an existing logical volume.
The workaround is only temporary: If the host is (re)booted at a time when there is again a wraparound in the metadata ring buffer, grub will fail to boot again.
On a running PVE system, you can check whether there is a wraparound in the metadata ring buffer using the following command:
vgscan -vvv 2>&1 | grep "Reading metadata"
If the output lines end with (+0)
, there is no wraparound. If they end with (+N)
for any other number N
, there is a wraparound and the grub will most likely fail to boot after a reboot.
Permanent Fix
The only permanent fix for PVE 7.x is:
- Apply the temporary workaround to be able to boot PVE again
- Upgrade to PVE 8 by following the upgrade guide.
PVE 8
This subsection applies to PVE 8 hosts that have their boot disk on LVM, boot in UEFI mode and were upgraded from PVE 7.
PVE 8 ships grub 2.06-13
in which the grub bug is fixed. However, on hosts that boot in UEFI mode and were upgraded from PVE 7, it can happen that the updated grub 2.06-13 EFI binary is not installed to the EFI system partition (ESP) at /boot/efi/EFI/proxmox/grubx64.efi
. As a result, when booting in UEFI mode, the host still runs the older grub 2.06-3~deb11u5
binary that is affected by the grub bug. To find out whether this is the case, check its mtime using ls -l /boot/efi/EFI/proxmox/grubx64.efi
. If it is older than the time of the upgrade from PVE 7 to 8, the host still runs the older grub binary when booting in UEFI mode.
Temporary Workaround
The temporary workaround for PVE 8 to get the host in a bootable state is the same as for PVE 7.x (see above).
Permanent Fix
The issue can be fixed permanently on PVE 8 by installing the correct grub metapackage for UEFI and choosing the correct UEFI boot entry.
First, apply the temporary workaround to be able to boot into PVE 8 again. When booted into PVE 8, run the following command. It checks if the host is indeed booted in UEFI mode, and if yes, installs the correct grub metapackage for UEFI:
[ -d /sys/firmware/efi ] && apt install grub-efi-amd64
This will remove the grub-pc
package, and update the binary on the ESP. You can verify that the mtime of /boot/efi/EFI/proxmox/grubx64.efi
was updated.
Note that this will not update the default EFI binary at /boot/efi/EFI/BOOT/BOOTx64.EFI
, which might still be the grub binary that is affected by the bug. Consequently, make sure that you select the proxmox
boot entry when booting in UEFI mode. If needed, you can adjust the boot order directly in the UEFI firmware or using the efibootmgr
tool (see its manpage).