[pve-devel] pveceph createmon timeout in probing state

Fri Oct 19 20:13:35 CEST 2018

Hi,

This is about a 3-node PVE ceph cluster where one monitor failed and a
3-day struggle to revive it again.

First day
I just waited and hoped that ceph will do its magic and heal itself.
Sure enough I checked everything else (firewall, network, time, moved
around VMs and rebooted nodes). Nothing.

Second day
I scraped the failing monitor and attempted to recreate it. Repeatedly.

pveceph destroymon --exclude-manager 1 <nodename>
pveceph createmon --exclude-manager 1
admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
INFO:ceph-create-keys:ceph-mon admin socket not ready yet.
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
[...]

This went on until it eventually failed after 10 minutes. I checked
logs and looked for hints in bugreports, mailinglists and forum posts.
I lost a couple of hours digging into cephx because of the following
log line which suggests authentication problems:

2018-10-19 16:14:16.162125 XXX  0 mon.<leader-name>@0(leader) e4 ms_verify_authorizer bad authorizer from mon <node-ip>:6789/0
2018-10-19 16:14:16.162179 XXX  0 -- <leader-ip>:6789/0 >> <node-ip>:6789/0 conn(<con-id> :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: challenging authorizer

The "bad authorizer" line does not imply that the prospective monitor
did send bad credentials. The relevant line is the second one
indicating that the leader node responds to the prospective monitor with
a challenge.

At least it was clear by now that the prospective monitor is actually
communicating with the other nodes. But other than that, still nothing.

Third day
Spent a lot of time producing an ansible script which bumps up
loglevels, captures network traffic, collects logs and pcap files on
all nodes at the same time. This made it a little bit easier to
correlate debug information. The following log lines are key to
understand the problem:

on leader:
2018-10-19 16:14:16.162506 XXX 10 -- <leader-ip>:6789/0 >> <node-ip>:6789/0 conn(<con-id> :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept setting up session_security.
2018-10-19 16:14:16.162516 XXX 10 -- <leader-ip>:6789/0 >> <node-ip>:6789/0 conn(<con-id> :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept existing <other-con-id>.gseq 0 <= 2, looks ok
2018-10-19 16:14:16.162521 XXX  0 -- <leader-ip>:6789/0 >> <node-ip>:6789/0 conn(<con-id> :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY
2018-10-19 16:14:16.162533 XXX 10 -- <leader-ip>:6789/0 >> <node-ip>:6789/0 conn(<con-id> :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connection race, existing <other-con-id>.cseq 0 == 0, sending WAIT

on prospective monitor:
2018-10-19 16:14:16.162724 XXX  1 -- <node-ip>:6789/0 >> <leader-ip>:6789/0 conn(<con-id> :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got WAIT (connection race)
2018-10-19 16:14:16.162735 XXX  1 -- <node-ip>:6789/0 >> <leader-ip>:6789/0 conn(<con-id> :-1 s=STATE_WAIT pgs=0 cs=0 l=0).process enter wait state, failing

The leader instructs the prospective monitor to reuse a connection
which is supposed to be already open. However, this is the very first
attempt of a freshly created monitor to talk to the leader. There
simply is no connection which could be reused.

Interestingly it is possible to inspect open connections/sessions using
the following command:

ceph daemon mon.<leader-name> sessions

There should be some sessions for osds, other monitors and managers. If
the output lists many sessions which are not obviously related to any
of the aforementioned components, this might indicate stale
connections. I have no idea what circumstances are leading to that
situation.

Long story short, in order to persuade the forlorn monitor to rejoin
its pals it was enough to simply restart the ceph-mon processes on the
other nodes shortly before calling pveceph createmon.

I'm not sure whether this merits a bug report. I have no idea where to
report it in the first place since it is not clear to me if this is a
problem in PVE, ceph or any of the gazillion cluster components.

Cheers,
Lorenz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.proxmox.com/pipermail/pve-devel/attachments/20181019/db2c7ea2/attachment.sig>