[PVE-User] Ceph jewel to luminous upgrade problem

Mon Nov 13 16:26:31 CET 2017

Hi all,

We're in the process of upgrading our office Proxmox v4.4 cluster to v5.1 .

For that we first have followed instructions in
https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous

to upgrade Ceph Jewel to Luminous.

Upgrade was apparently a success:
# ceph -s
   cluster:
     id:     8ee074d4-005c-4bd6-a077-85eddde543b5
     health: HEALTH_OK

   services:
     mon: 3 daemons, quorum 0,2,3
     mgr: butroe(active), standbys: guadalupe, sanmarko
     osd: 12 osds: 12 up, 12 in

   data:
     pools:   2 pools, 640 pgs
     objects: 518k objects, 1966 GB
     usage:   4120 GB used, 7052 GB / 11172 GB avail
     pgs:     640 active+clean

   io:
     client:   644 kB/s rd, 3299 kB/s wr, 61 op/s rd, 166 op/s wr

And versions seem good too:
# ceph mon versions
{
     "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
luminous (stable)": 3
}
# ceph osd versions
{
     "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
luminous (stable)": 12
}

But this weeked there were problems backing up some VMs, all with the 
same error:
no such volume 'ceph-proxmox:vm-120-disk-1'

The "missing" volumes don't show in storage content, but they DO if we 
do a "rbd -p proxmox ls".

When we try an info command we get an error though:
# rbd -p proxmox info vm-120-disk-1
2017-11-13 16:04:02.979006 7f99d8ff9700 -1 librbd::image::OpenRequest: 
failed to retreive immutable metadata: (2) No such file or directory
rbd: error opening image vm-120-disk-1: (2) No such file or directory

Other VM disk images behave normally:
# rbd -p proxmox info vm-119-disk-1
rbd image 'vm-119-disk-1':
     size 3072 MB in 768 objects
     order 22 (4096 kB objects)
     block_name_prefix: rbd_data.575762ae8944a
     format: 2
     features: layering
     flags:

I don't really know what to look at to further diagnose this. I recall 
that there was a version 1 format for rbd, but I doubt "missing" disk 
images are in that old format (and really don't know how to check that 
if info doesn't work...)

Some of the missing VMs continue to be used by "old" running qemu 
processes and work correctly; but if we stop the VM, then it won't start 
again with the error reported above. I can start and stop VMs with 
non-"missing" disk images normally.

Any hints about what to try next?

OSDs are filestore with XFS (created from GUI).

# pveversion -v
proxmox-ve: 4.4-96 (running kernel: 4.4.83-1-pve)
pve-manager: 4.4-18 (running version: 4.4-18/ef2610e8)
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.76-1-pve: 4.4.76-94
pve-kernel-4.4.83-1-pve: 4.4.83-96
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-53
qemu-server: 4.0-113
pve-firmware: 1.1-11
libpve-common-perl: 4.0-96
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.9.0-5~pve4
pve-container: 1.0-101
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 12.2.1-1~bpo80+1

Thanks a lot
Eneko

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es