Coda File System

Re: coda client hangs

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Fri, 27 May 2005 13:44:49 -0400
On Thu, May 26, 2005 at 04:32:13PM -0600, Patrick Walsh wrote:
> 14:50:01 Coda token for user 0 has been discarded
> 15:00:00 Coda token for user 0 has been discarded
> 15:00:00 Coda token for user 0 has been discarded
> 15:00:00 Coda token for user 0 has been discarded
> 15:00:01 Coda token for user 0 has been discarded
> 15:10:00 Coda token for user 0 has been discarded
> 15:15:00 Coda token for user 0 has been discarded
> 15:20:00 Coda token for user 0 has been discarded

I wonder why these tokens are being discarded, this message is only
shown in 2 cases.

- A server believes the token is invalid (expired or unable to decrypt)
- A user has explicitly called cunlog

Since it seems to happen so regularily (every 5 minutes), my guess is
that this is a result of a cron job that uses cunlog. But when why don't
we see it at 14:55 or 15:05, but are seeing 4 calls at 15:00.

Multiple cronjobs that are each using 'clog root ; do_work ; cunlog
root'? And 4 happen to run simultaneously at 15:00?

> Assertion failed: nlink != (olink *)-1, file
> "/home/pwalsh/working/coda/BUILD/coda-6.0.10/coda-src/util/olist.cc",
> line 257
> Sleeping forever.  You may use gdb to attach to process 8378.

Crash in a list iterator, we use olists in many places, I think it is a
singly linked list.

> 	And gdb gives this (I'm sure it compiled with -g so I don't know why
> we're not getting symbols):

Are you still creating an RPM package? It could be that rpmbuild
implicitly strips everything before it packages the binaries. If you
still have the build tree around somewhere you should be able to use the
venus binary in the build tree even when the running venus is stripped.

> (gdb) bt
> #0  0xb747d761 in __libc_nanosleep () from /lib/libc.so.6
> #1  0xb747d6ae in __sleep (seconds=1)
> at ../sysdeps/unix/sysv/linux/sleep.c:70
> #2  0x080cce1c in strcpy () at ../sysdeps/generic/strcpy.c:31
> #3  0x080c8ca5 in strcpy () at ../sysdeps/generic/strcpy.c:31
> #4  0x0804e55d in strcpy () at ../sysdeps/generic/strcpy.c:31
> #5  0x080ac414 in strcpy () at ../sysdeps/generic/strcpy.c:31

This is definitely a different trace compared to the other one, since we
don't have that 'callback function' style jump through a high address.
It is also not a segfault, but a more normal assertion.

Argh, I think I know what is going on... The conn_iterator (which
iterates over the list of connections) is derived from the
olist_iterator. And it is internally doing the same 'trick' where it
saves the 'next' pointer for the next iteration. But it doesn't know
anything about locking down objects.

So the pinning down that is done while we walk the list is completely
useless, since the iterator itself doesn't really need the current
object (except that it does because it still tests 'current == last()').
But for the iteration it uses the saved next ptr, which might be
unlinked/destroyed because it wasn't pinned down with a refcount.

Soooo, now I have to go through the complete code and identify all
places where we might be using these olist_iterators either directly or
indirectly and check if they are already tracking the next ptr
themselves (like we do when destroying connections) and if the objects
are locked when we yield. And then I can remove the useless next-ptr
bit.

All these iterators that support 'safe deletion of the current object'
never work right and are causing some very nasty race conditions when
there are multiple threads involved. Luckily Coda has cooperative
threading, so we typically don't yield, except in a few cases. One of
which is ofcourse when we destroy rpc2 connections.

Quick fix for you, in coda-src/venus/user.cc around line 372.

        tc = next(); /* read ahead */
	if (tc) tc->GetRef(); /* make sure we don't lose the next connent */
	(void)c->Suicide(1);
	if (tc) tc->PutRef();

Change that Suicide(1) to Suicide(0), this way the client won't tell the
server is is disconnecting, so we won't make an RPC2 call, and as a
result will not yield.

The problem with this fix is that the server will slowly build up a lot
of old connections that will stick around until either the server is
rebooted, or the client is disconnected for a couple of minutes.

Jan
Received on 2005-05-27 13:46:13