[PVE-User] Proxmox Kernel / Ceph Integration

Thomas Lamprecht t.lamprecht at proxmox.com
Fri Jul 27 11:12:24 CEST 2018


Hi,

Am 07/27/2018 um 11:02 AM schrieb Marcus Haarmann:
> Hi experts,
> 
> we are using a Proxmox cluster with an underlying ceph storage.
> Versions are pve 5.2-2 with kernel 4.15.18-1-pve and ceph luminous 12.2.5
> We are running a couple of VM and also Containers there.
> 3 virtual NIC (as bond balance-alb), ceph uses 2 bonded 10GBit interfaces (public/cluster separated)
> 
> It occurs during nightly backup that backup stalls. In parallel, we get lots of messages in the dmesg:
> [137612.371311] libceph: mon0 192.168.16.31:6789 session established
> [137643.090541] libceph: mon0 192.168.16.31:6789 session lost, hunting for new mon
> [137643.091383] libceph: mon1 192.168.16.32:6789 session established
> [137673.810526] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
> [137673.811388] libceph: mon2 192.168.16.34:6789 session established
> [137704.530567] libceph: mon2 192.168.16.34:6789 session lost, hunting for new mon
> [137704.531363] libceph: mon0 192.168.16.31:6789 session established
> [137735.250593] libceph: mon0 192.168.16.31:6789 session lost, hunting for new mon
> [137735.251352] libceph: mon1 192.168.16.32:6789 session established
> [137765.970608] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
> [137765.971544] libceph: mon0 192.168.16.31:6789 session established
> [137796.690605] libceph: mon0 192.168.16.31:6789 session lost, hunting for new mon
> [137796.691412] libceph: mon1 192.168.16.32:6789 session established
> 
> We are searching for the issue for a while, since the blocking backup is not easy to overcome (unblocking does not help,
> only stop and migrate to a different server, since the rbd device seems to block).
> It seems to be related to the ceph messages.
> We found the following patch related to these messages (which may lead to a blocking state in kernel):
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7b4c443d139f1d2b5570da475f7a9cbcef86740c
> 
> We have tried to patch the kernel ourselfes, but this was not successful.
> 

Porting the patch was not successful or the patch did not worked as
expected?

> Although I presume the real error situation is related to a network problem, it would be nice to have an
> official backport of this patch in the pve kernel.
> Anybody can do that ? (only one line of code)
> 

A single line can also wreak havoc just fine ;-)
But this one seems/sounds harmless, regression-wise. But it would be
really good to first know if the patch addresses the issue at all.

> We are trying to modify the bonding mode because the network connection seems to be unstable,
> maybe this solves the issue.
> 

Sounds like it's worth a shot, if you already know that the network may
not be fully stable, as you may want to do something about that sooner
or later anyway.

cheers,
Thomas




More information about the pve-user mailing list