[pve-devel] [Patch V2 guest-common] fix #1694: Replication risks permanently losing sync in high loads due to timeout bug

Thu Apr 12 11:33:52 CEST 2018

> Dietmar Maurer <dietmar at proxmox.com> hat am 12. April 2018 um 11:06 geschrieben:
> 
> 
> > diff --git a/PVE/Replication.pm b/PVE/Replication.pm
> > index 9bc4e61..d8ccfaf 100644
> > --- a/PVE/Replication.pm
> > +++ b/PVE/Replication.pm
> > @@ -136,8 +136,18 @@ sub prepare {
> >  		$last_snapshots->{$volid}->{$snap} = 1;
> >  	    } elsif ($snap =~ m/^\Q$prefix\E/) {
> >  		$logfunc->("delete stale replication snapshot '$snap' on $volid");
> > -		PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> > -		$cleaned_replicated_volumes->{$volid} = 1;
> > +
> > +		eval {
> > +		    PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> > +		    $cleaned_replicated_volumes->{$volid} = 1;
> > +		};
> > +
> > +		# If deleting the snapshot fails, we can not be sure if it was due to an
> > error or a timeout.
> > +		# The likelihood that the delete has worked out is high at a timeout.
> > +		# If it really fails, it will try to remove on the next run.
> > +		warn $@ if $@;
> > +
> > +		$logfunc->("delete stale replication snapshot error: $@") if $@;
> 
> why do we need this in prepare?
Because we have here the same Problem.
If the ZFS pool is under load the snapshot delete will run in a timeout.