[PVE-User] Boot disk corruption after Ceph OSD destroy with cleanup

Fri Mar 22 15:04:31 CET 2019

On Fri, Mar 22, 2019 at 10:40:17AM +0100, Eneko Lacunza wrote:
> Hi,
> 
> El 22/3/19 a las 9:59, Alwin Antreich escribió:
> > On Fri, Mar 22, 2019 at 09:03:22AM +0100, Eneko Lacunza wrote:
> > > El 22/3/19 a las 8:35, Alwin Antreich escribió:
> > > > On Thu, Mar 21, 2019 at 03:58:53PM +0100, Eneko Lacunza wrote:
> > > > > We have removed an OSD disk from a server in our office cluster, removing
> > > > > partitions (with --cleanup 1) and that has made the server unable to boot
> > > > > (we have seen this in 2 servers in a row...)
> > > > > 
> > > > > Looking at the command output:
> > > > > 
> > > > > --- cut ---
> > > > > root at sanmarko:~# pveceph osd destroy 5 --cleanup 1
> > > > > destroy OSD osd.5
> > > > > Remove osd.5 from the CRUSH map
> > > > > Remove the osd.5 authentication key.
> > > > > Remove OSD osd.5
> > > > > Unmount OSD osd.5 from  /var/lib/ceph/osd/ceph-5
> > > > > remove partition /dev/sda1 (disk '/dev/sda', partnum 1)
> > > > > The operation has completed successfully.
> > > > > remove partition /dev/sdd7 (disk '/dev/sdd', partnum 7)
> > > > > Warning: The kernel is still using the old partition table.
> > > > > The new table will be used at the next reboot or after you
> > > > > run partprobe(8) or kpartx(8)
> > > > > The operation has completed successfully.
> > > > > wipe disk: /dev/sda
> > > > > 200+0 records in
> > > > > 200+0 records out
> > > > > 209715200 bytes (210 MB, 200 MiB) copied, 1.29266 s, 162 MB/s
> > > > > wipe disk: /dev/sdd
> > > > > 200+0 records in
> > > > > 200+0 records out
> > > > > 209715200 bytes (210 MB, 200 MiB) copied, 1.00753 s, 208 MB/s
> > > > > --- cut ---
> > > > > 
> > > > > Boot disk is SSD, look that scripts says it is wiping /dev/sdd!! It should
> > > > > do that to the journal partition? (dev/sdd7)
> > > > > 
> > > > > This cluster is on PVE 5.3 .
> > > > Can you please update, I suppose you don't have the pve-manager with
> > > > version 5.3-10 or newer installed yet. There the issue has been fixed.
> > > > 
> > > > But if you do and the issue still persists, then please post the
> > > > 'pveversion -v'.
> > > Seems both servers were on 5.3-8, thanks for the hint.
> > > 
> > > Maybe it would be helpful if you can publish some release notes for each
> > > package push made to pve-enterprise/pve-non-subscription (maybe capturing
> > > changed package's changelog?), so that this kind of (maybe corner but) grave
> > > problems are better communicated when the fix isn't first released on a
> > > point release.
> > I am not quiet sure what you mean by that, but each package ships with a
> > changelog, see pve-manager_5.3-11.
> > http://download.proxmox.com/debian/pve/dists/stretch/pve-no-subscription/binary-amd64/pve-manager_5.3-11.changelog
> > 
> > There is also the possibility to subscribe to bug reports and get
> > notifications. See the coresponding bug report for this issue.
> > https://bugzilla.proxmox.com/show_bug.cgi?id=2051
> Right know it isn't announced when a repository is updated, nor what
> changes/fixes/improvements have been included. (It's only done for point
> releases, not for new package uploads).
On a point release, a ISO is generated and the release info is needed
for that.

The volume of package updates alone makes a separate announcment of
changes sensless. The changelog shows what changed from one version to
the other and with an 'apt update' and 'apt list --upgradable' one can
see what packages have upgrades. And if needed, with a little bit of
shell scripting you can get all the changelogs directly from the repo
server.

--
Cheers,
Alwin