Coda File System

Re: volume connection timeouts

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Tue, 3 May 2005 20:43:25 -0400
On Tue, May 03, 2005 at 05:05:22PM -0600, Patrick Walsh wrote:
> > > 	Unfortunately, the conflict resolution process was frustrating because
> > > it would show identical files in "local" and "global" down to the file
> > > size and time stamp.  The checklocal command inside the repiar utility
> > 
> > cfs getfid would probably show a different version-vector/store-id for
> > the local and global files. Coda never uses things like size or
> > timestamp information to detect conflicts.
> 
> 	But what about checksum or md5 hash?  If the local file and the global
> file have different version-vectors/store-ids it seems reasonable for
> the autorepair facility to check the checksum/hash/whatever to see if
> the files are in fact the same.  If the files are identical in content
> and meta information (date, etc.) then the resolution should be
> automatic.  Or am I missing a point that makes this difficult to
> implement?

The Coda server is very strict when it comes to conflict resolution and
never looks at the contents of objects. The server only uses the version
vector and store-id information and decides based on that if a replica
only missed a COP2 message (identical store-id), or if a server was
unreachable during some operation (other server has a dominating version
vector). If none of the common scenarios apply, then it declares a
conflict for files. For directories there is an additional attempts to
merge the logs of recently committed operations (resolution log)

> > I strongly believe that connected mode (and cfs strong) mostly provide
> > the user with the perception that he won't get conflicts, and in 99% of
> > the cases this perception is probably true. But the remaining 1% will
> > still cause conflicts or disconnections and reintegrations. So I'd
> > rather work on making (write-)disconnected mode, reintegration and
> > repair reliable enough so that people don't really have a need for
> > connected mode anymore.
> 
> 	This is a point that should definitely be made in the manual or the cfs
> man page.  It never occurred to me that strongly connected mode might be
> *less* reliable than weakly connected or write-disconnected mode.

Connected operation is clearly more reliable in most cases, why else are
we trying to avoid switching to the modes that would actually give us
far better performance. Connected mode gives immediate feedback if an
operation fails, in all other modes such failures would become a
reintegration conflict. But since we are doing so many optimistic
things, in the way we handle replication and callbacks and such, 
there is no 'flag' we can set to guarantee that connected mode will
never ever get a conflict. So while practically it is probably more
reliable most of the time, the few failure cases make it only as good
as write-disconnected.

At the moment we might be a little worse off since sometimes a switch to
logging is handles as a new operation and we end up conflicting with
ourselves. The idea really is to improve the client so that we don't hit
such problems in the single-writer case even when we are disconnected.

> > But to have reliable conflict resolution with ASRs requires repair to
> > work in all possible situations and right now there are still far too
> > many cases of unnecessary or unrepairable conflicts.
> 
> 	By application-specific-resolution, do you mean filetype-specific
> resolution?  That is, would teh server look at the type of file and use
> an appropriate resolution script or policy according to that?  It's an
> interesting concept...

Actually these are run on the client, there are some hooks in various
places so that whenever a conflict is detected a message is sent to a
helper process. The helper then runs a script that is given exclusive
access to the volume based on the process group id. If it succeeds in
automatically repairing the problem the operation that failed because of
the conflict is restarted in the client. So ideally the application
never actually knows there was a problem. The code had been neglected
for a while, got a small refresher about 2 years ago, but its success
hinges on whether we can reliably repair the conflict, which isn't
always the case.

> > The only other reason for connected mode that I see is because people
> > want their updates to be visible on other clients as soon as they 'save
> > a file', but that can be done with a synchronous mode where we force a
> > reintegration before returning back to the application.
> 
> 	Right, having updates appear immediately is usually quite desirable in
> our case -- at least for certain volumes.  Is there another way to do
> this besides cfs strong?  Is there a cfs fsync planned for the future?

cfs forcereintegrate already exists, in the future we might add a per
user or per volume flag that forces reintegration whenever a (store)
operation is added to the log.

Jan
Received on 2005-05-03 20:47:56