[pve-devel] [PATCH manager 1/4] ceph: add perf data cache helpers

Tue Nov 5 18:33:56 CET 2019

On 11/5/19 1:51 PM, Dominik Csapak wrote:
> add a helper to cache the ceph performance data inside pmxcfs
> with broadcast_node_kv, and also a helper to read it out
> 
> merge the data from all nodes that sent performance data
> 
> the '$perf_cache' variable actually serves two purposes,
> the writer (will be pvestatd) uses it to broadcast only its values,
> and the reader (pvedaemon) uses it to cache all nodes data
> 
> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
> ---
> merging the data on read seems like a good idea, since we have the data
> and it should make little sense to throw any way, but i noticed some
> weird glitches when the pvestat update calls are near each other, since
> the timestamps are then e.g. ..01, ..02, ..03, ..11, ..12, etc.
> 
> this looks slightly weird on the extjs charts since you have some
> clustered data point at some timestamps but not others...
> 
> we could of course always only use the data from the current node,
> this way we would have a more consistent interval
> 
>  PVE/Ceph/Services.pm | 54 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
> 
> diff --git a/PVE/Ceph/Services.pm b/PVE/Ceph/Services.pm
> index 45eb6c3f..70cb9b9a 100644
> --- a/PVE/Ceph/Services.pm
> +++ b/PVE/Ceph/Services.pm
> @@ -360,4 +360,58 @@ sub destroy_mgr {
>      return undef;
>  }
>  
> +my $perf_cache = [];

rather pass that to the method as parameter, so that the caller has it in scope,
not the module..

> +
> +sub cache_perf_data {
> +    my ($keep_seconds) = @_;
> +
> +    $keep_seconds //= 5*60; # 5 minutes
> +
> +    my $rados = PVE::RADOS->new();

wouldn't it make sense to cache this? Else we fork on each call of this[0],
which doesn't seems to performant, fork isn't particularly cheap..

[0]: https://git.proxmox.com/?p=librados2-perl.git;a=blob;f=PVE/RADOS.pm;h=11af8a69ee1f3564538c856464908810cfa24eec;hb=HEAD#l124

> +    my $status = $rados->mon_command({ prefix => "status" });

quite a bit of information, but the mgr(s) caches it anyway, and there
seem no easy way to get just the PGDigest::print_summary info easily over
librados2 directly, at least I found none, albeit the API docs seem to
be pretty lacking.. So for now I'd say this is OK

> +
> +    my $pgmap = $status->{pgmap};
> +
> +    my $time = time();

it's a bit a shame that the iostats in pgmap have no timestamp.. :/

> +    my $entry = {
> +	time => $time,
> +	ops_r => $pgmap->{read_op_per_sec},
> +	ops_w => $pgmap->{write_op_per_sec},
> +	bytes_r => $pgmap->{read_bytes_sec},
> +	bytes_w => $pgmap->{write_bytes_sec},

can all be undefined if no activity is on the ceph cluster

> +    };
> +
> +    push @$perf_cache, $entry;
> +
> +    # remove entries older than $keep_seconds
> +    $perf_cache = [grep { $time - $_->{time} < $keep_seconds } @$perf_cache ];

why not just keep N entries, and remove all those which are  outside
of that length restriction, something like (untested):

my $cache_depth = 512;

my $cache_depth = scalar @$cache;
if ($cache_depth > $max_cache_depth) {
    splice(@$cache, 0, $cache_depth - $max_cache_depth);
}

probably faster and limits this on what we actually want, maximal
entry count. As each entry has the time saved anyway a client can
still cut-off after a certain time, if desired.

> +
> +    my $data = encode_json($perf_cache);
> +    PVE::Cluster::broadcast_node_kv("ceph-perf", $data);

not sure if we want this, I mean those are pve ceph-server stats and so
their the same on the whole cluster.. So we're uselessly broadcasting
this (nodecount - 1) times..

A lockless design to have only one sender per node would be to just let
the node with the highest node-id from a quorate partition broadcast this
in a cluster wide status hashtable?
If we can broadcast we're quorate, and thus we can trust the membership
info, and even for the short race possibility where a node with higher
priority becomes quorate we or it will effectively just overwrite other,
also valid values, to this hash.

> +}
> +
> +sub get_cached_perf_data {

> +
> +    # only get new data if the already cached one is older than 10 seconds
> +    if (scalar(@$perf_cache) > 0 && (time() - $perf_cache->[-1]->{time}) < 10) {
> +	return $perf_cache;
> +    }
> +
> +    my $raw = PVE::Cluster::get_node_kv("ceph-perf");
> +
> +    my $res = [];
> +    my $times = {};
> +
> +    for my $host (keys %$raw) {

why multi host? Those stats are the same ceph-clusterwide, AFAICT, distributed
through MgrStatMonitor PAXOS child class. E.g., I never saw different values if
I executed the following command cluster wide at the same time:

perl -we 'use PVE::RADOS; PVE::RPCEnvironment->setup_default_cli_env();
my $rados = PVE::RADOS->new();
my $stat = $rados->mon_command({ prefix => "status" })->{pgmap};
print "w/r: $stat->{write_bytes_sec}/$stat->{read_bytes_sec}\n";'

> +	my $tmp = eval { decode_json($raw->{$host}) };
> +	for my $entry (@$tmp) {
> +	    my $etime = $entry->{time};
> +	    push @$res, $entry if !$times->{$etime}++;
> +	}
> +    }
> +
> +    $perf_cache = [sort { $a->{time} <=> $b->{time} } @$res];
> +    return $perf_cache;
> +}
> +
>  1;
>