[PVE-User] lots of 'heartbeat_check: no reply from ...' in the logs

Fri Feb 8 09:07:09 CET 2019

Hi Alwin,

Thanks for your reply! Appreciated.

> These messages are not necessarily caused by a network issue. It might
> well be that the daemon osd.18 can not react to heartbeat messages.

The thing is: the two OSDs are on the same host. I checked 
ceph-osd.18.log, and it contains just regular ceph stuff, nothing 
special, like this:

I noticed on host pm2 there are multiple kworker pids running with 100% 
CPU utilisation. Also swap usage is 100%, while regular RAM usage (from 
proxmox gui) is only 54%.

No idea what to make of that...

> Check the logs on the host of osd.18.

Here they are:

> 2019-02-08 08:44:01.953390 7f6dc08b4700  1 leveldb: Level-0 table #1432303: started
> 2019-02-08 08:44:02.108622 7f6dc08b4700  1 leveldb: Level-0 table #1432303: 1299359 bytes OK
> 2019-02-08 08:44:02.181135 7f6dc08b4700  1 leveldb: Delete type=0 #1432295

Also ceph-mon.1.log contains nothing special, except the regular stuff.

> The cluster is doing scrubbing too, this is an intensive operation and
> taxes your OSDs. This intensify the issue. But in general, you need to
> find out what caused the slow requests. Ceph is able to throttle and
> tries to get IOs done, even under pressure.

Yes, I turned off noscrub and nodeep-scrub again, after the issues of 
yesterdaymorning were resolved. The system has been running HEALTH_OK 
24hrs, with no issues. (except the worrying loglines appearing every 
second)

> If you describe your system further (eg. osd tree, crush map, system
> specs) then we may be able to point you in the right direction. ;)

Here you go:

> root at pm2:/var/log/ceph# ceph osd tree
> ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF 
> -1       87.35376 root default                           
> -2       29.11688     host pm1                           
>  0   hdd  3.64000         osd.0      up  1.00000 1.00000 
>  1   hdd  3.64000         osd.1      up  1.00000 1.00000 
>  2   hdd  3.63689         osd.2      up  1.00000 1.00000 
>  3   hdd  3.64000         osd.3      up  1.00000 1.00000 
> 12   hdd  3.64000         osd.12     up  1.00000 1.00000 
> 13   hdd  3.64000         osd.13     up  1.00000 1.00000 
> 14   hdd  3.64000         osd.14     up  1.00000 1.00000 
> 15   hdd  3.64000         osd.15     up  1.00000 1.00000 
> -3       29.12000     host pm2                           
>  4   hdd  3.64000         osd.4      up  1.00000 1.00000 
>  5   hdd  3.64000         osd.5      up  1.00000 1.00000 
>  6   hdd  3.64000         osd.6      up  1.00000 1.00000 
>  7   hdd  3.64000         osd.7      up  1.00000 1.00000 
> 16   hdd  3.64000         osd.16     up  1.00000 1.00000 
> 17   hdd  3.64000         osd.17     up  1.00000 1.00000 
> 18   hdd  3.64000         osd.18     up  1.00000 1.00000 
> 19   hdd  3.64000         osd.19     up  1.00000 1.00000 
> -4       29.11688     host pm3                           
>  8   hdd  3.64000         osd.8      up  1.00000 1.00000 
>  9   hdd  3.64000         osd.9      up  1.00000 1.00000 
> 10   hdd  3.64000         osd.10     up  1.00000 1.00000 
> 11   hdd  3.64000         osd.11     up  1.00000 1.00000 
> 20   hdd  3.64000         osd.20     up  1.00000 1.00000 
> 21   hdd  3.64000         osd.21     up  1.00000 1.00000 
> 22   hdd  3.64000         osd.22     up  1.00000 1.00000 
> 23   hdd  3.63689         osd.23     up  1.00000 1.00000

We have journals on SSD.

The crush map:

> root at pm2:/var/log/ceph# cat /tmp/decomp 
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable chooseleaf_stable 1
> tunable straw_calc_version 1
> tunable allowed_bucket_algs 54
> 
> # devices
> device 0 osd.0 class hdd
> device 1 osd.1 class hdd
> device 2 osd.2 class hdd
> device 3 osd.3 class hdd
> device 4 osd.4 class hdd
> device 5 osd.5 class hdd
> device 6 osd.6 class hdd
> device 7 osd.7 class hdd
> device 8 osd.8 class hdd
> device 9 osd.9 class hdd
> device 10 osd.10 class hdd
> device 11 osd.11 class hdd
> device 12 osd.12 class hdd
> device 13 osd.13 class hdd
> device 14 osd.14 class hdd
> device 15 osd.15 class hdd
> device 16 osd.16 class hdd
> device 17 osd.17 class hdd
> device 18 osd.18 class hdd
> device 19 osd.19 class hdd
> device 20 osd.20 class hdd
> device 21 osd.21 class hdd
> device 22 osd.22 class hdd
> device 23 osd.23 class hdd
> 
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
> 
> # buckets
> host pm1 {
> 	id -2		# do not change unnecessarily
> 	id -5 class hdd		# do not change unnecessarily
> 	# weight 29.117
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.0 weight 3.640
> 	item osd.1 weight 3.640
> 	item osd.3 weight 3.640
> 	item osd.12 weight 3.640
> 	item osd.13 weight 3.640
> 	item osd.14 weight 3.640
> 	item osd.15 weight 3.640
> 	item osd.2 weight 3.637
> }
> host pm2 {
> 	id -3		# do not change unnecessarily
> 	id -6 class hdd		# do not change unnecessarily
> 	# weight 29.120
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.4 weight 3.640
> 	item osd.5 weight 3.640
> 	item osd.6 weight 3.640
> 	item osd.7 weight 3.640
> 	item osd.16 weight 3.640
> 	item osd.17 weight 3.640
> 	item osd.18 weight 3.640
> 	item osd.19 weight 3.640
> }
> host pm3 {
> 	id -4		# do not change unnecessarily
> 	id -7 class hdd		# do not change unnecessarily
> 	# weight 29.117
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.8 weight 3.640
> 	item osd.9 weight 3.640
> 	item osd.10 weight 3.640
> 	item osd.11 weight 3.640
> 	item osd.20 weight 3.640
> 	item osd.21 weight 3.640
> 	item osd.22 weight 3.640
> 	item osd.23 weight 3.637
> }
> root default {
> 	id -1		# do not change unnecessarily
> 	id -8 class hdd		# do not change unnecessarily
> 	# weight 87.354
> 	alg straw
> 	hash 0	# rjenkins1
> 	item pm1 weight 29.117
> 	item pm2 weight 29.120
> 	item pm3 weight 29.117
> }
> 
> # rules
> rule replicated_ruleset {
> 	id 0
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step chooseleaf firstn 0 type host
> 	step emit
> }
> 
> # end crush map

The three servers are identical: 128GB memory (50% used), dual Xeon(R) 
CPU E5-2630 v4 @ 2.20GHz, pve 5.3,

Any ideas where to look? I could of course try a reboot of that node 
pm2, to see if that makes the issue go away, but I'd rather understand 
why osd.18 does not respond to heartbeat messages, why swap usage is 
100%, and why there are multiple high-cpu kworker threads running on 
this host only.

MJ