Coda File System

Re: crash in rvmlib_free during repair

From: Piotr Isajew <pki_at_ex.com.pl>
Date: Fri, 5 Jul 2013 06:50:55 +0200
On Thu, Jul 04, 2013 at 02:35:41AM -0400, Jan Harkes wrote:

[...]
> 
> Odd, I don't think I have seen such a crash before, the usual cases that
> I see involve the server crashing because it ran out of available
> resolution log entries and the next mutating operation sent to the
> server triggers an assertion.

I have a great talent for bringing down such things :(. I promise
to pay more attention next time, so maybe we will be able to
reproduce this.

> > After restarting everything I still have the conflict in the
> > same
> > node or it's parent node depending on the situation.
> 
> Was this directory by any chance moved from one directory to another?

No, it was just copy -> crash -> repair sequence. The conflict
started deep down the directory tree. I tried to repair it by
removedir, but that crashed non-SCM, repaired to SCM, and
conflict just propagated up the path.

I copied with cp -a, so maybe there was attribute inconsistency
between two replicas, but for sure no renames nor moves.
> 
> With files I've seen rename related conflicts where the default repair
> suggestion when the source directory is resolved is to recreate the
> renamed object but repair then fails because the server already has that
> same object in the not-yet resolved destination directory. But this is
> different since it is a directory, and it is a remove.

I tried to remove conflicting directory on both replicas, but
that also crashed non-SCM.
> 
> [...]
> Instead of removing the directory, recreate it on the other replica.

That worked to some point. At the end I had a situation where
two replicas contained same files, comparedirs generated empty
fix, but complained about vectors being different.

> If that doesn't work, and it is reliably only one server that
> crashes,
> you can try to repair the conflict with only the other server running.
> If that works you can bring the crashed server back up, extract all the
> volume replica information with volutil info volumename.0 or .1 and then
> remove and recreate the corrupted replica and then repopulate the volume
> through runt resolution by doing a 'find /coda/path/to/volume -noleaf'.

That was the way to solve the problem.

> Good luck,
> 
> Jan

Thank you, Jan. I very much appreciate your help.
Received on 2013-07-05 00:51:09