[PVE-User] sharing zfs experience

Tue Jul 9 12:34:03 CEST 2019

Stoiko hi ,

thank you for your reply

Now I'm even more worried after reading this recent thread you sent me 
:(  ...  I'm not sure any more what to expect after next reboot :)

So the question is how to avoid such scenarios in the future ? ...

My pools seems to be fully correct ... zpool status shows no errors at 
all. I think something went wrong on container level ... How come that 
disk-1 survived and disk-0 did not?

I can send some reports like syslog  or something so just please tell me 
which ?

Thank you very much in advance

/srdačan pozdrav / best regards

/
Tonči Stipičević, dipl. ing. elektr.
/direktor / manager/**
**

d.o.o.
ltd.

*podrška / upravljanje
**IT*/ sustavima za male i srednje tvrtke/

/Small & Medium Business
/*IT*//*support / management*

Badalićeva 27 / 10000 Zagreb / Hrvatska – Croatia
url: www.suma-informatika.hr
mob: +385 91 1234003
fax: +385 1  5560007

On 08. 07. 2019. 20:19, Stoiko Ivanov wrote:
> hi,
> Plus Hosting<support at plus.hr>
>
> On Mon, 8 Jul 2019 18:50:07 +0200
> Tonči Stipičević<tonci at suma-informatika.hr>  wrote:
>
>> Hi to all,
>>
>> A customer of mine runs two clusters :
>>
>> 1. 2node with ibm v370 san as shared strage  (hared lvm)
>>
>> 2.  3node cluster all nodes run zfs ...  no shared storage
>>
>>
>> Couple days ago he had an power outage and during that period of time
>> I was kind a worrying how apcupsd & proxmox will handle this
>> situation.
>>
>> 1. Both nodes were properly shut down but one of 2 them dies ,
>> independent from power outage :) but just in the same time. I booted
>> up remaining node , adjusted "votes" and started all vm-s residing on
>> the shared lvm storage ...  No further questions ... prox handled
>> that correctly
>>
>> 2. all 3 nodes started up but the most important lxc conteiner cloud
>> not start.
>>
>> Reason: Job forpve-container at 104.service  failed because the control
>> process exited with error code. See "systemctl status
>> pve-container at 104.service" and "journalctl -xe" for details. TASK
>> ERROR: command 'systemctl start pve-container at 104' failed: exit code 1
>>
>> Upgrading, restarting etc etc did not helped at all. The problem was
>> that rootfs from this contaier  was completely empty ( it contained
>> only /dev/ and /mnt/  dirs . Fortunately second mount point (aka 2nd
>> disk) with 2T of data was pretty healthy and visible. So one option
>> was to restore it from backup but zfs list command showed that this
>> data set still holds data as much as it should (disk 0)
> This somehow reminds me of a recent thread in the forum:
> https://forum.proxmox.com/threads/reboot-of-pve-host-breaks-lxc-container-startup.55486/#post-255641
>
> did the rpool get imported completely - or are there some errors in the
> journal while the system booted?
>
> In any case - glad you manged to resolve the issue!
>
>
>> root at pve01-hrz-zm:~# ls -al /rpool/data/subvol-104-disk-0/
>> total 10
>> drwxr-xr-x 4 root root 4 Srp  4 14:07 .
>> drwxr-xr-x 9 root root 9 Srp  4 23:17 ..
>> drwxr-xr-x 2 root root 2 Srp  4 14:07 dev
>> drwxr-xr-x 3 root root 3 Srp  4 14:07 mnt
>>
>> root at pve01-hrz-zm:~# zfs list
>> NAME                            USED  AVAIL  REFER  MOUNTPOINT
>> rpool                          2,15T  1,36T   104K  /rpool
>> rpool/data                     2,15T  1,36T   128K  /rpool/data
>> rpool/data/subvol-104-disk-0    751M  15,3G   751M
>> /rpool/data/subvol-104-disk-0
>> rpool/data/subvol-104-disk-1   2,15T   894G  2,15T
>> /rpool/data/subvol-104-disk-1
>>
>>
>> Interesting was that both lcx containers from this node had "empty"
>> disk-0  (but the other one was not that big, it had only disk-0) and
>> none of them could start.
>>
>> After many tries I decided to migrate this little container to other
>> just to see what will happen :  migration was successfull and
>> starting up as well .  OK (true relief finally :). then I tried to
>> make backup of this vm just to see what will happen. No, backup was
>> not successfull ... backup archive was only 1.7KB big. Ok, let's get
>> back to migration scenario. So, the final conclusion was that
>> migration itself was not the solution but snapshot was the right one.
>> Snapshot was the step that revived this disk-0.
>>
>> So , at the end I just made snapshot of the 104-disk-0, cloned it
>> back right after to 1044-disk-0 and then just change the reference in
>> lxc configuration. After that lxc started successfully.
>>
>>
>> I'm very wondering why this happened but am also very happy that
>> above simple steps saved my day.
>>
>> Hopefully this information helps somebody that will run into same
>> problem , but in the same time I truly hope that it won't happen :)
>>
>>
>> BR
>>
>> Tonci Stipicevic
>>
>>
>>
>> 	
>> 	
>> 	
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at pve.proxmox.com
>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user