[pve-devel] [PATCH ha-manager 2/3] fix #1378: allow to specify a service shutdown policy

Fabian Grünbichler f.gruenbichler at proxmox.com
Wed Dec 19 13:10:17 CET 2018


On Wed, Dec 19, 2018 at 11:39:40AM +0100, Thomas Lamprecht wrote:
> Allow an admin to set a datacenter wide HA policy which can change
> the way we handle services on a node shutdown.
> 
> There's:
> 
> * freeze: always freeze servivces, independent of the shutdown type
>   (reboot, poweroff)
> * failover: never freeze services, this means that a service will get
>   recovered to another node if possible and if the current node does
>   not comes back up in the grace period of 1 minute.

s/comes/come

why 1 minute here, and 2 minutes in the other patch? ;)

> * default: this is the current behavior, freeze on reboot but do not
>   freeze on poweroff
> 
> Add to tests, shutdown-policy1 which is based of the reboot1 test,
> but enforces no freeze with a failover policy, and shutdown-policy2
> which is based on the shutdown1 test but with a explicit freeze
> policy. You can compare (diff) each tests log result to the test it's
> based on to see what changes.
> 
> Signed-off-by: Thomas Lamprecht <t.lamprecht at proxmox.com>
> ---
>  src/PVE/HA/LRM.pm                             | 22 +++++++++-
>  src/test/test-shutdown-policy1/cmdlist        |  4 ++
>  src/test/test-shutdown-policy1/datacenter.cfg |  5 +++
>  .../test-shutdown-policy1/hardware_status     |  5 +++
>  src/test/test-shutdown-policy1/log.expect     | 40 +++++++++++++++++++
>  src/test/test-shutdown-policy1/manager_status |  1 +
>  src/test/test-shutdown-policy1/service_config |  3 ++
>  src/test/test-shutdown-policy2/cmdlist        |  4 ++
>  src/test/test-shutdown-policy2/datacenter.cfg |  5 +++
>  .../test-shutdown-policy2/hardware_status     |  5 +++
>  src/test/test-shutdown-policy2/log.expect     | 34 ++++++++++++++++
>  src/test/test-shutdown-policy2/manager_status |  1 +
>  src/test/test-shutdown-policy2/service_config |  3 ++
>  13 files changed, 130 insertions(+), 2 deletions(-)
>  create mode 100644 src/test/test-shutdown-policy1/cmdlist
>  create mode 100644 src/test/test-shutdown-policy1/datacenter.cfg
>  create mode 100644 src/test/test-shutdown-policy1/hardware_status
>  create mode 100644 src/test/test-shutdown-policy1/log.expect
>  create mode 100644 src/test/test-shutdown-policy1/manager_status
>  create mode 100644 src/test/test-shutdown-policy1/service_config
>  create mode 100644 src/test/test-shutdown-policy2/cmdlist
>  create mode 100644 src/test/test-shutdown-policy2/datacenter.cfg
>  create mode 100644 src/test/test-shutdown-policy2/hardware_status
>  create mode 100644 src/test/test-shutdown-policy2/log.expect
>  create mode 100644 src/test/test-shutdown-policy2/manager_status
>  create mode 100644 src/test/test-shutdown-policy2/service_config
> 
> diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm
> index dda82eb..fc560fa 100644
> --- a/src/PVE/HA/LRM.pm
> +++ b/src/PVE/HA/LRM.pm
> @@ -53,6 +53,20 @@ sub shutdown_request {
>  
>      my ($shutdown, $reboot) = $haenv->is_node_shutdown();
>  
> +    my $dc_ha_cfg = $haenv->get_ha_settings();
> +    my $shutdown_policy = $dc_ha_cfg->{shutdown_policy} // 'default';
> +
> +    my $freeze_all = $reboot;
> +    if ($shutdown_policy eq 'default') {
> +	$freeze_all = $reboot;
> +    } elsif ($shutdown_policy eq 'freeze') {
> +	$freeze_all = 1;
> +    } elsif ($shutdown_policy eq 'failover') {
> +	$freeze_all = 0;
> +    } else {
> +	$haenv->log('err', "unkown shutdown policy '$shutdown_policy', fall back to default");
> +    }

s/unkown/unknown/

this is a bit weird to read IMHO. how about

my $freeze_all;
if ($shutdown_policy eq 'freeze') {
  freeze_all = 1;
} elsif ($shutdown_policy eq 'failover') {
  freeze_all = 0;
} else {
  env->log('err', "unknown shutdown policy '$shutdown_policy', fall back to default")
    if $shutdown_policy ne 'default';
  freeze_all = $reboot;
}

> +
>      if ($shutdown) {
>  	# *always* queue stop jobs for all services if the node shuts down,
>  	# independent if it's a reboot or a poweroff, else we may corrupt
> @@ -69,8 +83,12 @@ sub shutdown_request {
>      }
>  
>      if ($shutdown) {
> -	if ($reboot) {
> -	    $haenv->log('info', "reboot LRM, stop and freeze all services");
> +	if ($freeze_all) {
> +	    if ($shutdown_policy eq 'default') {

wouldn't $reboot be more appropriate here instead of matching on the
policy?

alternatively, we could just log shutdown vs. reboot and policy
separately (the latter is patch #3 anyway), and drop this combined
output? or just log shutdown type + policy?

> +		$haenv->log('info', "reboot LRM, stop and freeze all services");
> +	    } else {
> +		$haenv->log('info', "shutdown LRM, stop and freeze all services");
> +	    }
>  	    $self->{mode} = 'restart';
>  	} else {
>  	    $haenv->log('info', "shutdown LRM, stop all services");

isn't this missing the second case as well? i.e., reboot + failover/stop?

> diff --git a/src/test/test-shutdown-policy1/cmdlist b/src/test/test-shutdown-policy1/cmdlist
> new file mode 100644
> index 0000000..8558351
> --- /dev/null
> +++ b/src/test/test-shutdown-policy1/cmdlist
> @@ -0,0 +1,4 @@
> +[
> +    [ "power node1 on", "power node2 on", "power node3 on"],
> +    [ "reboot node3" ]
> +]
> diff --git a/src/test/test-shutdown-policy1/datacenter.cfg b/src/test/test-shutdown-policy1/datacenter.cfg
> new file mode 100644
> index 0000000..6108ece
> --- /dev/null
> +++ b/src/test/test-shutdown-policy1/datacenter.cfg
> @@ -0,0 +1,5 @@
> +{
> +    "ha": {
> +        "shutdown_policy": "failover"
> +    }
> +}
> diff --git a/src/test/test-shutdown-policy1/hardware_status b/src/test/test-shutdown-policy1/hardware_status
> new file mode 100644
> index 0000000..451beb1
> --- /dev/null
> +++ b/src/test/test-shutdown-policy1/hardware_status
> @@ -0,0 +1,5 @@
> +{
> +  "node1": { "power": "off", "network": "off" },
> +  "node2": { "power": "off", "network": "off" },
> +  "node3": { "power": "off", "network": "off" }
> +}
> diff --git a/src/test/test-shutdown-policy1/log.expect b/src/test/test-shutdown-policy1/log.expect
> new file mode 100644
> index 0000000..385b07a
> --- /dev/null
> +++ b/src/test/test-shutdown-policy1/log.expect
> @@ -0,0 +1,40 @@
> +info      0     hardware: starting simulation
> +info     20      cmdlist: execute power node1 on
> +info     20    node1/crm: status change startup => wait_for_quorum
> +info     20    node1/lrm: status change startup => wait_for_agent_lock
> +info     20      cmdlist: execute power node2 on
> +info     20    node2/crm: status change startup => wait_for_quorum
> +info     20    node2/lrm: status change startup => wait_for_agent_lock
> +info     20      cmdlist: execute power node3 on
> +info     20    node3/crm: status change startup => wait_for_quorum
> +info     20    node3/lrm: status change startup => wait_for_agent_lock
> +info     20    node1/crm: got lock 'ha_manager_lock'
> +info     20    node1/crm: status change wait_for_quorum => master
> +info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
> +info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
> +info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
> +info     20    node1/crm: adding new service 'vm:103' on node 'node3'
> +info     22    node2/crm: status change wait_for_quorum => slave
> +info     24    node3/crm: status change wait_for_quorum => slave
> +info     25    node3/lrm: got lock 'ha_agent_node3_lock'
> +info     25    node3/lrm: status change wait_for_agent_lock => active
> +info     25    node3/lrm: starting service vm:103
> +info     25    node3/lrm: service status vm:103 started
> +info    120      cmdlist: execute reboot node3
> +info    120    node3/lrm: shutdown LRM, stop all services
> +info    125    node3/lrm: stopping service vm:103
> +info    125    node3/lrm: service status vm:103 stopped
> +info    126    node3/lrm: exit (loop end)
> +info    126       reboot: execute crm node3 stop
> +info    125    node3/crm: server received shutdown request
> +info    145    node3/crm: exit (loop end)
> +info    145       reboot: execute power node3 off
> +info    145       reboot: execute power node3 on
> +info    145    node3/crm: status change startup => wait_for_quorum
> +info    140    node3/lrm: status change startup => wait_for_agent_lock
> +info    145    node3/lrm: got lock 'ha_agent_node3_lock'
> +info    145    node3/lrm: status change wait_for_agent_lock => active
> +info    145    node3/lrm: starting service vm:103
> +info    145    node3/lrm: service status vm:103 started
> +info    164    node3/crm: status change wait_for_quorum => slave
> +info    720     hardware: exit simulation - done
> diff --git a/src/test/test-shutdown-policy1/manager_status b/src/test/test-shutdown-policy1/manager_status
> new file mode 100644
> index 0000000..0967ef4
> --- /dev/null
> +++ b/src/test/test-shutdown-policy1/manager_status
> @@ -0,0 +1 @@
> +{}
> diff --git a/src/test/test-shutdown-policy1/service_config b/src/test/test-shutdown-policy1/service_config
> new file mode 100644
> index 0000000..c6860e7
> --- /dev/null
> +++ b/src/test/test-shutdown-policy1/service_config
> @@ -0,0 +1,3 @@
> +{
> +    "vm:103": { "node": "node3", "state": "enabled" }
> +}
> diff --git a/src/test/test-shutdown-policy2/cmdlist b/src/test/test-shutdown-policy2/cmdlist
> new file mode 100644
> index 0000000..a86b9e2
> --- /dev/null
> +++ b/src/test/test-shutdown-policy2/cmdlist
> @@ -0,0 +1,4 @@
> +[
> +    [ "power node1 on", "power node2 on", "power node3 on"],
> +    [ "shutdown node3" ]
> +]
> diff --git a/src/test/test-shutdown-policy2/datacenter.cfg b/src/test/test-shutdown-policy2/datacenter.cfg
> new file mode 100644
> index 0000000..0d411c1
> --- /dev/null
> +++ b/src/test/test-shutdown-policy2/datacenter.cfg
> @@ -0,0 +1,5 @@
> +{
> +    "ha": {
> +        "shutdown_policy": "freeze"
> +    }
> +}
> diff --git a/src/test/test-shutdown-policy2/hardware_status b/src/test/test-shutdown-policy2/hardware_status
> new file mode 100644
> index 0000000..451beb1
> --- /dev/null
> +++ b/src/test/test-shutdown-policy2/hardware_status
> @@ -0,0 +1,5 @@
> +{
> +  "node1": { "power": "off", "network": "off" },
> +  "node2": { "power": "off", "network": "off" },
> +  "node3": { "power": "off", "network": "off" }
> +}
> diff --git a/src/test/test-shutdown-policy2/log.expect b/src/test/test-shutdown-policy2/log.expect
> new file mode 100644
> index 0000000..a36c628
> --- /dev/null
> +++ b/src/test/test-shutdown-policy2/log.expect
> @@ -0,0 +1,34 @@
> +info      0     hardware: starting simulation
> +info     20      cmdlist: execute power node1 on
> +info     20    node1/crm: status change startup => wait_for_quorum
> +info     20    node1/lrm: status change startup => wait_for_agent_lock
> +info     20      cmdlist: execute power node2 on
> +info     20    node2/crm: status change startup => wait_for_quorum
> +info     20    node2/lrm: status change startup => wait_for_agent_lock
> +info     20      cmdlist: execute power node3 on
> +info     20    node3/crm: status change startup => wait_for_quorum
> +info     20    node3/lrm: status change startup => wait_for_agent_lock
> +info     20    node1/crm: got lock 'ha_manager_lock'
> +info     20    node1/crm: status change wait_for_quorum => master
> +info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
> +info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
> +info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
> +info     20    node1/crm: adding new service 'vm:103' on node 'node3'
> +info     22    node2/crm: status change wait_for_quorum => slave
> +info     24    node3/crm: status change wait_for_quorum => slave
> +info     25    node3/lrm: got lock 'ha_agent_node3_lock'
> +info     25    node3/lrm: status change wait_for_agent_lock => active
> +info     25    node3/lrm: starting service vm:103
> +info     25    node3/lrm: service status vm:103 started
> +info    120      cmdlist: execute shutdown node3
> +info    120    node3/lrm: shutdown LRM, stop and freeze all services
> +info    120    node1/crm: service 'vm:103': state changed from 'started' to 'freeze'
> +info    125    node3/lrm: stopping service vm:103
> +info    125    node3/lrm: service status vm:103 stopped
> +info    126    node3/lrm: exit (loop end)
> +info    126     shutdown: execute crm node3 stop
> +info    125    node3/crm: server received shutdown request
> +info    145    node3/crm: exit (loop end)
> +info    145     shutdown: execute power node3 off
> +info    160    node1/crm: node 'node3': state changed from 'online' => 'unknown'
> +info    720     hardware: exit simulation - done
> diff --git a/src/test/test-shutdown-policy2/manager_status b/src/test/test-shutdown-policy2/manager_status
> new file mode 100644
> index 0000000..0967ef4
> --- /dev/null
> +++ b/src/test/test-shutdown-policy2/manager_status
> @@ -0,0 +1 @@
> +{}
> diff --git a/src/test/test-shutdown-policy2/service_config b/src/test/test-shutdown-policy2/service_config
> new file mode 100644
> index 0000000..c6860e7
> --- /dev/null
> +++ b/src/test/test-shutdown-policy2/service_config
> @@ -0,0 +1,3 @@
> +{
> +    "vm:103": { "node": "node3", "state": "enabled" }
> +}
> -- 
> 2.19.2
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel




More information about the pve-devel mailing list