[PVE-User] lots of 'heartbeat_check: no reply from ...' in the logs

Alwin Antreich a.antreich at proxmox.com
Fri Feb 8 07:48:31 CET 2019


Hello Mj,

On Thu, Feb 07, 2019 at 08:15:52PM +0100, mj wrote:
> Hi,
> 
> We are getting continuous lines like in our logs, between osd.19 and osd.18,
> both are on the same host pm2:
> 
> > 2019-02-07T19:59:24.724447+01:00 pm2 ceph-osd 3093 - - 2019-02-07 19:59:24.723800 7f902e9f0700 -1 osd.19 15136 heartbeat_check: no reply from 10.10.89.2:6807 osd.18 ever on either front or back, first ping sent 2019-02-07 07:58:32.526903 (cutoff 2019-02-07 19:59:04.723796)
> 
> I can ping the ip address 10.10.89.2 from host pm2, plus also nc confirms
> that the post is listening:
> 
> > root at pm2:~# nc -vz 10.10.89.2 6807
> > nc: 10.10.89.2 (10.10.89.2) 6807 [6807] open
These messages are not necessarily caused by a network issue. It might
well be that the daemon osd.18 can not react to heartbeat messages.
Check the logs on the host of osd.18.

> 
> We had some trouble this morning, doing too many things at the same time,
> causing slow requests, etc, but the system recovered, and has been up and
> running the whole day, no issues anymore. However, this is when these
> messages started appearing.
> 
> Some info on our system, consisting of three identical nodes:
> 
> > root at pm2:~# ceph -v
> > ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable)
> > root at pm2:~# ceph health detail
> > HEALTH_OK
> > root at pm2:~# ceph -s
> >   cluster:
> >     id:     1397f1dc-7d94-43ea-ab1xxxxxxxc1
> >     health: HEALTH_OK
> >   services:
> >     mon: 3 daemons, quorum 0,1,2
> >     mgr: pm1(active), standbys: pm3, pm2
> >     osd: 24 osds: 24 up, 24 in
> >   data:
> >     pools:   2 pools, 1088 pgs
> >     objects: 4.50M objects, 17.1TiB
> >     usage:   51.5TiB used, 35.8TiB / 87.3TiB avail
> >     pgs:     1085 active+clean
> >              2    active+clean+scrubbing
> >              1    active+clean+scrubbing+deep
> >   io:
> >     client:   18.3MiB/s rd, 39.8MiB/s wr, 87op/s rd, 539op/s wr
> 
> What can I do to get rid of these messages..? They sound serious. More info
> required, just let me know...!
The cluster is doing scrubbing too, this is an intensive operation and
taxes your OSDs. This intensify the issue. But in general, you need to
find out what caused the slow requests. Ceph is able to throttle and
tries to get IOs done, even under pressure.

If you describe your system further (eg. osd tree, crush map, system
specs) then we may be able to point you in the right direction. ;)

--
Cheers,
Alwin




More information about the pve-user mailing list