Coda File System

Re: volume connection timeouts

From: Patrick Walsh <pwalsh_at_esoft.com>
Date: Tue, 03 May 2005 08:38:36 -0600
On Mon, 2005-05-02 at 16:32 -0400, Jan Harkes wrote:

> could also be that you have a reintegration conflict, which makes the
> volume switch to disconnected state. This is because a reintegration
> (local-global) conflict actually moves everything in the volume to a
> temporary local repair volume so that we can show both the local and the
> global versions later on when the conflict is expanded. The problem is

	What would you suggest would be the best way to detect when a conflict
occurs so that an administrator can be notified?  Is there a particular
message we can monitor one of the logs for?  Or perhaps a cron job with
a find command similar to this:

find . -type l -a -not \( -xtype f -o -xtype d \)

Or perhaps monitoring the /usr/coda/spool directory?  How is this
managed in other places?

	Also, the repair utility doesn't seem to have a way to list what
objects are in conflict -- you have to already know the full path to
them.  Are there any undocumented commands or shortcuts for using this
utility?

> So the client ends up dirtying a lot of memory, both rvm and the data
> associated with the container file, but doesn't flush anything to disk.
> Then it sends an RPC to the server who fetches the data, which is most
> likely for the most part still in memory since the client didn't flush.
> But now the server is hit with a double whammy, not only is he writing
> his own dirtied data, but also the client's dirty stuff.
> 
> And because we're single threaded (userspace threading), the server
> process is blocked until all the data has hit the disk. In the mean time
> the client is eager for a response, earlier when it was fetching data
> the server was real quick to respond, the RTT estimate probably ended up
> being near zero. So it get's impatient and assumes the request got
> dropped and retransmits. 

	Would it help my situation if there was a minimum for the RTT estimate
in the case where the estimate is near zero?  That would make it so the
server can take a moment to flush a file without the client write
disconnecting.  
	
> At the same time, the poor server is still
> stuck waiting for the disk, and can't even dash off a quick ack telling
> the client that it did get the request and is working on it.

	Are there any plans to make the server multi-threaded to avoid these
sorts of bottle-necks?

> but in this case the server probably ended up completing the operation,
> we just missed out on the final reply. And when we reintegrate we
> automatically hit a conflict because the locally cached version vector
> is different from the server's version _and_ there is a different store
> identifier associated with the operation. So we assume we got an
> update/update conflict, i.e. another client wrote to the file while we
> were disconnected.

	Is there any way to compare the files when this happens?  I mean, in
our case, most often this is what is happening, but the server did write
the file and the local and global versions are identical in content,
time-stamp, and size.  I thought coda used some checksumming to detect
this sort of thing.  Is that something else?  Could it be applied here
to reduce false-conflicts?

-- 
Patrick Walsh
eSoft Incorporated
303.444.1600 x3350
http://www.esoft.com/

Received on 2005-05-03 10:40:08