[pve-devel] [PATCH ha-manager 4/6] lrm: handle an error during service_status update

Thomas Lamprecht t.lamprecht at proxmox.com
Tue Nov 7 15:27:11 CET 2017


we may get an error here if the cluster filesystem is (temporarily)
unavailable here, this error resulted in stopping the whole CRM
service immediately, which then triggered a node reset (if happened
on the current master), even if we had still time left to retry and
thus, for example, handle a update of pve-cluster gracefully.

Add a method which wraps the status read in an eval and logs an
eventual error, but does not abort the service. Instead we rely on
our get_protected_ha_agent_lock method to detect a problem and switch
to the lost_agent_lock state.

If the pmxcfs outage was really short, so that the manager status
read failed but the lock update worked again we update also always
before doing real work when in the 'active' state. If this update
fails we return from the eval and try next round again, as no point
in doing anything without consistent state.

Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---
 src/PVE/HA/LRM.pm | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
index 49e9f68..f076735 100644
--- a/src/PVE/HA/LRM.pm
+++ b/src/PVE/HA/LRM.pm
@@ -136,6 +136,21 @@ sub update_lrm_status {
     return 1;
 }
 
+sub update_service_status {
+    my ($self) = @_;
+
+    my $haenv = $self->{haenv};
+
+    my $ms = eval { $haenv->read_manager_status(); };
+    if (my $err = $@) {
+	$haenv->log('err', "updating service status from manager failed: $err");
+	return undef;
+    } else {
+	$self->{service_status} =  $ms->{service_status} || {};
+	return 1;
+    }
+}
+
 sub get_protected_ha_agent_lock {
     my ($self) = @_;
 
@@ -215,8 +230,7 @@ sub do_one_iteration {
     my $status = $self->get_local_status();
     my $state = $status->{state};
 
-    my $ms = $haenv->read_manager_status();
-    $self->{service_status} =  $ms->{service_status} || {};
+    $self->update_service_status();
 
     my $fence_request = PVE::HA::Tools::count_fenced_services($self->{service_status}, $haenv->nodename());
     
@@ -277,6 +291,10 @@ sub do_one_iteration {
 	eval {
 	    # fixme: set alert timer
 
+	    # if we could not get the current service status there's no point
+	    # in doing anything, try again next round.
+	    return if !$self->update_service_status();
+
 	    if ($self->{shutdown_request}) {
 
 		if ($self->{mode} eq 'restart') {
-- 
2.11.0





More information about the pve-devel mailing list