[PVE-User] Ceph: Some trouble creating OSD with journal on a sotware raid device...

Marco Gaiarin gaio at sv.lnf.it
Thu Oct 13 12:13:18 CEST 2016


I'm a bit confused.

I'm trying to create 4 OSD on a server, where the SO reside on a
raid-1. On the same (couple of) disk there's 4 50MB partition for
the journal (the two disks are SSD).
Better with a command:

 root at vedovanera:~# blkid 
 /dev/sdf1: UUID="75103d23-83a6-9f5d-eb1e-f021e729041b" UUID_SUB="70aa73ab-585c-5df1-dfef-bbc847766504" LABEL="vedovanera:0" TYPE="linux_raid_member" PARTUUID="180187f1-01"
 /dev/sdf2: UUID="e21df9d5-3230-f991-d70d-f948704a7594" UUID_SUB="fe105e88-252a-97ab-d543-c2c2a89499d0" LABEL="vedovanera:1" TYPE="linux_raid_member" PARTUUID="180187f1-02"
 /dev/sdf5: UUID="ba35e389-814d-dc29-8818-c9e86d9d8f08" UUID_SUB="66104965-2a53-c142-43fb-1bc35f66bf41" LABEL="vedovanera:2" TYPE="linux_raid_member" PARTUUID="180187f1-05"
 /dev/sdf6: UUID="90778432-b426-51e9-b0a2-48d76ef24364" UUID_SUB="00db7dd7-f0fd-ea54-e52f-0a8725ed7866" LABEL="vedovanera:3" TYPE="linux_raid_member" PARTUUID="180187f1-06"
 /dev/sdf7: UUID="09be7173-4edc-1e14-5e06-dfdcd677943c" UUID_SUB="876a79e0-be59-6153-cede-97aefcdec849" LABEL="vedovanera:4" TYPE="linux_raid_member" PARTUUID="180187f1-07"
 /dev/sdf8: UUID="fd54393a-2969-7f9f-8e29-f4120dc4ab00" UUID_SUB="d576b4c8-dfc5-8ddd-25f2-9b0da0c7241c" LABEL="vedovanera:5" TYPE="linux_raid_member" PARTUUID="180187f1-08"
 /dev/sda: PTUUID="cf6dccb4-4f6f-472a-9f1a-5945de4f1703" PTTYPE="gpt"
 /dev/sdc: PTUUID="3ecc2e48-b12d-4cb1-add8-87f0e611b7e8" PTTYPE="gpt"
 /dev/sde1: UUID="75103d23-83a6-9f5d-eb1e-f021e729041b" UUID_SUB="ab4416c0-a715-ef87-466a-6a58096eb2b9" LABEL="vedovanera:0" TYPE="linux_raid_member" PARTUUID="03210f34-01"
 /dev/sde2: UUID="e21df9d5-3230-f991-d70d-f948704a7594" UUID_SUB="2355caea-4102-7269-38be-22779790c388" LABEL="vedovanera:1" TYPE="linux_raid_member" PARTUUID="03210f34-02"
 /dev/sde5: UUID="ba35e389-814d-dc29-8818-c9e86d9d8f08" UUID_SUB="b3211065-8c5d-3fa5-8f57-2a50ef461a34" LABEL="vedovanera:2" TYPE="linux_raid_member" PARTUUID="03210f34-05"
 /dev/sde6: UUID="90778432-b426-51e9-b0a2-48d76ef24364" UUID_SUB="296a78cf-0e97-62f6-d136-cefb9abffa3e" LABEL="vedovanera:3" TYPE="linux_raid_member" PARTUUID="03210f34-06"
 /dev/sde7: UUID="09be7173-4edc-1e14-5e06-dfdcd677943c" UUID_SUB="36667e33-d801-c114-cb59-8770b66fc98d" LABEL="vedovanera:4" TYPE="linux_raid_member" PARTUUID="03210f34-07"
 /dev/sde8: UUID="fd54393a-2969-7f9f-8e29-f4120dc4ab00" UUID_SUB="b5eac45a-2693-e2c5-3e00-2b8a33658a00" LABEL="vedovanera:5" TYPE="linux_raid_member" PARTUUID="03210f34-08"
 /dev/sdb: PTUUID="000e025c" PTTYPE="dos"
 /dev/md0: UUID="a751e134-b3ed-450c-b694-664d80f07c68" TYPE="ext4"
 /dev/sdd: PTUUID="000b1250" PTTYPE="dos"
 /dev/md1: UUID="8bd0c899-0317-4d20-a781-ff662e92b0b1" TYPE="swap"
 /dev/md2: PTUUID="a7eb14f0-d2f9-4552-8e2d-b5165e654ea8" PTTYPE="gpt"
 /dev/md3: PTUUID="ba4073c3-fab2-41e9-9612-28d28ae6468d" PTTYPE="gpt"
 /dev/md4: PTUUID="c3dfbbfa-28da-4bc8-88fd-b49785e7e212" PTTYPE="gpt"
 /dev/md5: PTUUID="c616bbf8-41f0-4e62-b77f-b0e8eeb624e2" PTTYPE="gpt"

'md0' is /, 'md1' the swap, md2-5 the cache partition, sda-d the disks
for OSDs.


The proxmox correctly see the 4 OSD candidate disks, but does not see the
journal partition. So i've used commandline:

 root at vedovanera:~# pveceph createosd /dev/sda --journal_dev /dev/md2
 command '/sbin/zpool list -HPLv' failed: open3: exec of /sbin/zpool list -HPLv failed at /usr/share/perl5/PVE/Tools.pm line 409.
 
 create OSD on /dev/sda (xfs)
 using device '/dev/md2' for journal
 Caution: invalid backup GPT header, but valid main header; regenerating
 backup header from main header.
 
 ****************************************************************************
 Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
 verification and recovery are STRONGLY recommended.
 ****************************************************************************
 GPT data structures destroyed! You may now partition the disk using fdisk or
 other utilities.
 Creating new GPT entries.
 The operation has completed successfully.
 WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same device as the osd data
 Setting name!
 partNum is 0
 REALLY setting name!
 The operation has completed successfully.
 Setting name!
 partNum is 0
 REALLY setting name!
 The operation has completed successfully.
 meta-data=/dev/sda1              isize=2048   agcount=4, agsize=122094597 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=0        finobt=0
 data     =                       bsize=4096   blocks=488378385, imaxpct=5
          =                       sunit=0      swidth=0 blks
 naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
 log      =internal log           bsize=4096   blocks=238466, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
 realtime =none                   extsz=4096   blocks=0, rtextents=0
 Warning: The kernel is still using the old partition table.
 The new table will be used at the next reboot.
 The operation has completed successfully.

so, seems all went well... but OSD does not show up on web interface, but seems
''counted'' (i've 2 OSD working on another server):

 root at vedovanera:~# ceph -s
    cluster 8794c124-c2ec-4e81-8631-742992159bd6
     health HEALTH_WARN
            64 pgs degraded
            64 pgs stale
            64 pgs stuck degraded
            64 pgs stuck stale
            64 pgs stuck unclean
            64 pgs stuck undersized
            64 pgs undersized
            noout flag(s) set
     monmap e2: 2 mons at {0=10.27.251.7:6789/0,1=10.27.251.8:6789/0}
            election epoch 6, quorum 0,1 0,1
     osdmap e29: 3 osds: 2 up, 2 in
            flags noout
      pgmap v42: 64 pgs, 1 pools, 0 bytes data, 0 objects
            67200 kB used, 3724 GB / 3724 GB avail
                  64 stale+active+undersized+degraded

the previous command create also a partition (of 5GB) on the md2:

 root at vedovanera:~# blkid | grep md2
 /dev/md2: PTUUID="a7eb14f0-d2f9-4552-8e2d-b5165e654ea8" PTTYPE="gpt"
 /dev/md2p1: PARTLABEL="ceph journal" PARTUUID="d1ccfdb2-539e-4e6a-ad60-be100304832b"

Now, if i destroy the OSD:

 root at vedovanera:~# pveceph destroyosd 2
 destroy OSD osd.2
 /etc/init.d/ceph: osd.2 not found (/etc/pve/ceph.conf defines mon.0 mon.1, /var/lib/ceph defines )
 command 'setsid service ceph -c /etc/pve/ceph.conf stop osd.2' failed: exit code 1
 Remove osd.2 from the CRUSH map
 Remove the osd.2 authentication key.
 Remove OSD osd.2
 Unmount OSD osd.2 from  /var/lib/ceph/osd/ceph-2
 umount: /var/lib/ceph/osd/ceph-2: mountpoint not found
 command 'umount /var/lib/ceph/osd/ceph-2' failed: exit code 32

delete the /dev/md2p1 partition and recreate (type Linux) of 50GB, zap the sda disk,
and i redo the OSD creation, it works, with some strange ''warning'':

 root at vedovanera:~# pveceph createosd /dev/sda --journal_dev /dev/md2p1
 command '/sbin/zpool list -HPLv' failed: open3: exec of /sbin/zpool list -HPLv failed at /usr/share/perl5/PVE/Tools.pm line 409.
 
 create OSD on /dev/sda (xfs)
 using device '/dev/md2p1' for journal
 Caution: invalid backup GPT header, but valid main header; regenerating
 backup header from main header.
 
 ****************************************************************************
 Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
 verification and recovery are STRONGLY recommended.
 ****************************************************************************
 GPT data structures destroyed! You may now partition the disk using fdisk or
 other utilities.
 Creating new GPT entries.
 The operation has completed successfully.
 WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same device as the osd data
 WARNING:ceph-disk:Journal /dev/md2p1 was not prepared with ceph-disk. Symlinking directly.
 Setting name!
 partNum is 0
 REALLY setting name!
 The operation has completed successfully.
 meta-data=/dev/sda1              isize=2048   agcount=4, agsize=122094597 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=0        finobt=0
 data     =                       bsize=4096   blocks=488378385, imaxpct=5
          =                       sunit=0      swidth=0 blks
 naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
 log      =internal log           bsize=4096   blocks=238466, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
 realtime =none                   extsz=4096   blocks=0, rtextents=0
 Warning: The kernel is still using the old partition table.
 The new table will be used at the next reboot.
 The operation has completed successfully.

Now OSD show up on pve web interface, and seems to work as expected.

I've also tried to ''reformat'' the jounal, eg stop the OSD, flush ad recreate:

 root at vedovanera:~# ceph-osd -i 2 --flush-journal
  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 2016-10-13 12:06:14.209250 7ffb7c596880 -1 flushed journal /var/lib/ceph/osd/ceph-2/journal for object store /var/lib/ceph/osd/ceph-2
 root at vedovanera:~# ceph-osd -i 2 --mkjournal
  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 2016-10-13 12:06:45.034323 7f774cef7880 -1 created new journal /var/lib/ceph/osd/ceph-2/journal for object store /var/lib/ceph/osd/ceph-2

OSD restart correctly, but i'm still in doubt i'm doing something
wrong...


Thanks.

-- 
dott. Marco Gaiarin				        GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''          http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

		Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
    http://www.lanostrafamiglia.it/25/index.php/component/k2/item/123
	(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)



More information about the pve-user mailing list