[pve-devel] [PATCH ha-manager] fix inf. loop error on orphaned workers

Thomas Lamprecht t.lamprecht at proxmox.com
Fri Feb 5 16:00:51 CET 2016


My test were done the following:
Prerequisites some (dummy) VMs with continuous VMIDs (makes it easier)

for i in {1000..1010}; do ha-manager add "vm:$i"; done

## wait until all are started
for i in {1000..1010}; do ha-manager disable "vm:$i"; done

## wait a few seconds until the first series of workers is started.
for i in {1000..1010}; do ha-manager remove "vm:$i"; done

Then look at
journalctl -f
to see the errors

On 02/05/2016 03:56 PM, Thomas Lamprecht wrote:
> When we have a running job for a service which gets removed from
> HA it can result in an error. This is normally not problematic if
> the worker was already started (=has a PID) else we may trigger a
> loop of errors when alrteady "$max_workers" are active and we
> remove a service with a queued crm command.
>
> Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
> ---
>
>
> Steps to reproduce it:
>
> Add > 4 resources to HA and wait until started, then disable
> them all at once, and directly after that remove them from HA.
>
> All resources < 4 will get the
> "missing resource configuration for '$sid'"
> error until the workers finished.
>
> But the remaining, already queued, workers (number 5 and upwards)
> will end in infinite "resource config missing" errors which
> then result in a failed LRM (if you try to restart it)
>
>
>   src/PVE/HA/LRM.pm | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
> index 1894f3c..44940db 100644
> --- a/src/PVE/HA/LRM.pm
> +++ b/src/PVE/HA/LRM.pm
> @@ -378,7 +378,16 @@ sub run_workers {
>   	    my $w = $self->{workers}->{$sid};
>   	    my $cd = $sc->{$sid};
>   	    if (!$cd) {
> -		$haenv->log('err', "missing resource configuration for '$sid'");
> +		# if not already started don't start the worker at all,
> +		# as the service was removed from HA management, else warn
> +		if (!$w->{pid}) {
> +		    delete $self->{workers}->{$sid};
> +		    $haenv->log('err', "missing resource configuration for " .
> +				"'$sid' - do not start worker [$w->{state}]");
> +		} else {
> +		    $haenv->log('err', "orphaned active worker [$w->{stater}] for" .
> +				" service '$sid' with no resource configuration");
> +		}
>   		next;
>   	    }
>   	    if (!$w->{pid}) {





More information about the pve-devel mailing list