[pve-devel] [PATCH ha-manager] fix inf. loop error on orphaned workers

Thomas Lamprecht t.lamprecht at proxmox.com
Fri Feb 5 15:56:26 CET 2016


When we have a running job for a service which gets removed from
HA it can result in an error. This is normally not problematic if
the worker was already started (=has a PID) else we may trigger a
loop of errors when alrteady "$max_workers" are active and we
remove a service with a queued crm command.

Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
---


Steps to reproduce it:

Add > 4 resources to HA and wait until started, then disable
them all at once, and directly after that remove them from HA.

All resources < 4 will get the
"missing resource configuration for '$sid'"
error until the workers finished.

But the remaining, already queued, workers (number 5 and upwards)
will end in infinite "resource config missing" errors which
then result in a failed LRM (if you try to restart it)


 src/PVE/HA/LRM.pm | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
index 1894f3c..44940db 100644
--- a/src/PVE/HA/LRM.pm
+++ b/src/PVE/HA/LRM.pm
@@ -378,7 +378,16 @@ sub run_workers {
 	    my $w = $self->{workers}->{$sid};
 	    my $cd = $sc->{$sid};
 	    if (!$cd) {
-		$haenv->log('err', "missing resource configuration for '$sid'");
+		# if not already started don't start the worker at all,
+		# as the service was removed from HA management, else warn
+		if (!$w->{pid}) {
+		    delete $self->{workers}->{$sid};
+		    $haenv->log('err', "missing resource configuration for " .
+				"'$sid' - do not start worker [$w->{state}]");
+		} else {
+		    $haenv->log('err', "orphaned active worker [$w->{stater}] for" .
+				" service '$sid' with no resource configuration");
+		}
 		next;
 	    }
 	    if (!$w->{pid}) {
-- 
2.1.4





More information about the pve-devel mailing list