Coda File System

Re: coda client hangs

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Thu, 26 May 2005 12:08:31 -0400
On Wed, May 25, 2005 at 02:51:04PM -0600, Patrick Walsh wrote:
> > This could mean that your client cache is too small, or that the file
> > was deleted on the server between the status and data validation phases.
> > As there were 131 interrupts, this would indicate that this is happening
> > quite often.
> 
> 	These log entries were referring to a log file, so I'm sure it was
> being opened and closed pretty frequently.  As a workaround we've moved
> the log files to /tmp and just use cron to copy them to coda.

That's where write-disconnected would be really nice, because the writes
would happen in the background. Ofcourse it also increases the chances
of getting a conflict.

> > So we fetch the new version (which is 653 bytes larger), and again it
> > already changed on the server before we were done. I guess some process
> > on is opening/writing/closing the file a lot, and every close results in
> > a store.
> 
> 	True to the second part, but there is only one client writing to the
> file and no conflicts ever occurred, so I'm not sure the version vectors
> differed.

Interesting, I wonder why the client would bother about the version
vectors if it is the one that performed the store in the first place.

I found why we see the VV difference errors. UpdateStatus used to be
split into a mutating and a non-mutating version depending on whether
we had results from a mutating operation. Both versions were merged, but
the comparison function was called as if we always were performing an
attribute revalidation. As a result you would _want_ to see the VV
difference error, otherwise the local object wouldn't get refreshed.

> > Ah, when tokens are refreshed all old connections are first destroyed.
> 
> 	Why is this?  We have cron jobs that refresh tokens three times a day,
> plus for various operations, like our cron jobs that download updates,
> the tokens are refreshed at the beginning of the operation to insure
> successful writes.  So are you saying that if something is trying to
> write a file and something else refreshes the tokens for that user (even
> though the coda user <-> local user mapping is the same) that the write
> operation will be broken off?  (And potentially cause a crash?)  If so,
> can this behavior be changed?

It shouldn't affect pending operations. All connections have an 'inuse'
flag (refcount). If a connection is still active we simply flag it as
dying and it will get reclaimed when the operation completes. All other
connections send out a Disconnect message to the server first and then
get reclaimed.

I have read the code up and down and only found one place where we
possibly can go wrong if the refcount is > 1, but that should never
happen since we shouldn't grab a reference to any connection object that
is already in use...

Got the culprit in,

userent::Reset()

    /* Delete the user's connections. */
    {
    struct ConnKey Key; Key.host.s_addr = INADDR_ANY; Key.uid = uid;
    conn_iterator next(&Key);
    connent *c = 0;
    connent *tc = 0;
    for (c = next(); c != 0; c = tc) {
        tc = next(); /* read ahead */
        if (tc) tc->GetRef(); /* make sure we don't lose the next connent */
        (void)c->Suicide(1);
        if (tc) tc->PutRef();
    }

Combined with,

PutConn(connect **cpp)

    if (!c->HasRef())
        { c->print(logFile); CHOKE("PutConn: conn not in use"); }

    if (c->dying) {
        connent::conntab->remove(&c->tblhandle);
        delete c;
    }
    else
        c->PutRef();

If a connection is already in use, the userent::Reset iterator bumps the
refcount on the 'next connent' to 2. Suicide makes an RPC2 call to the
server and yields waiting for a reply. Then if the in-use connection
returns first, and it was already marked as 'dying', the call to PutConn
takes the entry off the list and destroys it. Then when Suicide returns
and tries to iterate to the next entry it will segfault.

This also explains why we don't get to see this all to often, it
depends on having long-running RPC2 operations while connections are
being destroyed. The only puzzle left is why the next connection is
already marked as dying, this would indicate that we actually already
went through the loop before and I haven't figured out how that might
have happened.

The attached patch should fix the problem though.

Jan

Index: comm.cc
===================================================================
RCS file: /afs/cs/project/coda-src/cvs/coda/coda-src/venus/comm.cc,v
retrieving revision 4.90
diff -u -u -r4.90 comm.cc
--- comm.cc	7 Mar 2005 21:53:43 -0000	4.90
+++ comm.cc	26 May 2005 15:54:48 -0000
@@ -256,12 +256,12 @@
     if (!c->HasRef())
 	{ c->print(logFile); CHOKE("PutConn: conn not in use"); }
 
-    if (c->dying) {
+    c->PutRef();
+
+    if (!c->HasRef() && c->dying) {
 	connent::conntab->remove(&c->tblhandle);
 	delete c;
     }
-    else
-	c->PutRef();
 }
 
 
Received on 2005-05-26 12:11:40