Coda File System

Re: Unable to do beginrepair...

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Thu, 12 Dec 2002 10:52:53 -0500
On Thu, Dec 12, 2002 at 09:32:09AM -0500, Greg Troxel wrote:
>   If global is a symlink to the volumeid, the client hasn't been able to
>   mount the volume. Perhaps it timed out during the volume lookup or
>   something. Another reason would be a server-server conflict, but as you
>   only have a single server that probably isn't the case.
> 
> This seemed to persist, even though I could do 'cfs lv'.  My real
> problem isn't that something bad happened, it's that I could not
> recover.

Actually both are a problem. You have a single-client, single-server
setup and should not get any conflicts. period. And then the recovery
from these unwanted conflicts is unreliable.

>   Yup, norton is only a server thing. 'cfs fl' is in many cases an evil
>   operation, it flushes the data of cached objects in the specified
>   subtree. It shouldn't touch 'dirty' objects (i.e. the ones with
>   associated CML entries), but you never know.
> 
> So really 'cfs fl' should only discard cached data that is still in
> the 'read-only' state, and thus should be safe at any time.  If not,
> it probably should be fixed.

It does, at least, it never flushes any object that has the 'dirty flag'
set, or has an active refcount. And any object with a pending CML
operation should have this dirty flag set.

There could still be something wrong (just theorizing here) when an
object has several pending operations and only a few are reintegrated.
I'm not sure _where_ we actually reset this dirty flag.

> Do you mean
> 
>   modification on client while WD
>   try to reintegrate
>   Backfetch times out, causing Store to fail
>   pause
>   try to retintegrate the same Store
>   Backfetch works this time
>   [successful store with no conflict]
> 
> is what you think happens with the current code?

Not think, know! I've tested that scenario hundreds of times, it even
was a part of the demo I gave to highlight some of Coda's features. It
is really nice to have a laptop chugging along, physically pulling the
network cable out of the wall, and having the client simply switch to
disconnected mode. Put the cable back, 'cfs cs', and it picks up where
it left off.

> What's the state of the realms branch and the future repair changes?
> It seems like repair (venus's representation of stuff) is bletcherous
> now.  But, it may be that the problem is in the NetBSD kernel code.

Right now I'm trying to freeze the tree for a release. Some of the new
repair code is latently already in, but it isn't active. I will need
some time to get the initial mount right while avoiding to get into the
same mess that local-global repair managed to get into.

> I wonder if putting some more aggressive cache flushing into
> venus/netbsd would help.  I'd take not losing over performance
> happily, and then we'd know where to fix.  I admit I have assumed that
> the problem is in venus, and that isn't necessairly clear.

In a way that is one of the fun (and admittedly difficult) things about
working with distributed systems. There are so many possible problems
and strange interactions between different components
(kernel<>venus<>servers).

Jan
Received on 2002-12-12 10:55:12