[pve-devel] [PATCH ha-manager 2/4] LRM: do not count erroneous service as active

Thomas Lamprecht t.lamprecht at proxmox.com
Tue May 3 10:03:15 CEST 2016


Fixes an infinite hang on updating or restarting a nodes LRM when a
service within the error state is located on it.

We fix that by not counting those services towards the "active
service count" as an service in an error state is *not* managed by
HA, thus also not active.

As we use the active service count in three places this change has
the following effects:

* when determining if we should try to acquire our agent_lock and
make the state transition from wait_for_agent_lock => active. Here
its safe as we cannot do anything with the erroneous states anyhow,
and as the only way to move out from error state is disabling the
service (thus marking it as stopped) has the result that it still
is not counted as active, as we skip stopped also, so this is safe
and also wanted.

* when restarting the LRM to see if it has frozen all active
services and thus it's allowed to finish the restart. This is the
primary intent of this patch as it allows the LRM to restart safely
witouth hanging forever (and then be killed by systemd which results
in a watchdog node reset).

* when shutting down a LRM in lost_agent_lock state, here it has the
effect that with this patch the node won't get shot by its watchdog
if it has erroneous services but not active ones, this is not only
save but wanted. It makes no sense letting the watchdog trigger when
we have no active services besides the one in the error state, as
when restarting it does not make anything better (error services
won't get touched after all).

Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---
 src/PVE/HA/LRM.pm | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
index d49873f..a49410f 100644
--- a/src/PVE/HA/LRM.pm
+++ b/src/PVE/HA/LRM.pm
@@ -180,6 +180,8 @@ sub active_service_count {
 	next if !defined($req_state);
 	next if $req_state eq 'stopped';
 	next if $req_state eq 'freeze';
+	# erroneous services are not managed by HA, don't count them as active
+	next if $req_state eq 'error';
 
 	$count++;
     }
-- 
2.1.4





More information about the pve-devel mailing list