Coda File System

Re: coda client hangs

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Tue, 31 May 2005 14:15:07 -0400
On Tue, May 31, 2005 at 09:09:35AM -0600, Patrick Walsh wrote:
> 	OK, over the weekend coda hung again).  On two machines with two
> different problems (a signal 11 and an assert failure in realms.cc), but

I've been working on the iterators over the weekend. The 'safe deletion'
parts are removed from the core iterator functions and I think I got all
places where we were doing unsafe deletion.

I also improved the conn_iterator and volent_iterator objects to keep
references to the objects while we walk the list, those two looked like
the most intensive users of the problematic sequence

    for entry in list:
	call yielding function
	delete entry

One thing I couldn't figure out until just now is why you actually have
conn entries, which are used for non-replicated or backup volumes and
during weak-reintegration, instead of mgrp entries which are used for
replicated volumes.

That turned out to be quite simple, there is a special 'V_UID' that is
used for some operation when we don't know the actual user who triggered
the operation or for certain background operations. Some of the obvious
ones are when we get volume information and during the periodic server
probes/backprobes. And V_UID is set to '0', which means that the root
user has a bunch of connections associated with it. Now most people
probably aren't giving Coda tokens to root, so this explains why you
seem to be seeing these problems a lot more than most people.

I changed those places to use ANYUSER_UID, which means that it will try
to use the first available (authenticated?) connection or else allocate
one that should belong to nobody. I haven't been able to get my client
to segfault yet, even when hitting it concurrently with disconnections,
reauthentications and read and write operations. But then again, your
'test-environment' might be hitting it harder than I ever can.

> one of the machines we'll ignore for now since I haven't updated it with
> all the newest RPMs.  Unfortunately, it seems that the debugging symbols
> are still not in the venus binary, though I don't have the slightest
> idea why not.  I'll look into this yet again.

I think rpmbuild actually strips the binaries before it packs them in
the RPM, the unstripped versions seem to be placed in an associated
'debuginfo' RPM, although that might only be for libraries.

> 	In the meantime, I've noticed a trend.  These signal 11's seem to
> happen almost always at 00:30:01.  I've tried running all our cronjobs
...
> somewhere outside of cron.  There is nothing in the syslog or cronlog to
> indicate anything happening at that time.  Is that a special time in
> Coda?  Here's the latest batch of logs and backtraces:

Not really, there are periodic hoard walks and server probes, but those
are relative to when venus starts, so unless you always start your
client at exactly the same time I wouldn't expect them to coincide so
nicely on 00:30 real-time intervals.

> 18:30:00 Coda token for user 0 has been discarded
> 19:00:00 Coda token for user 0 has been discarded
> 20:00:01 Coda token for user 0 has been discarded

If the venus.log contains

    userent::Connect: Authenticated bind failure, uid = 0

Then this is a result of a server rejecting the token, otherwise it is
because of an explicit call to cunlog.

> Date: Sat 05/28/2005
> 
> 00:30:01 Fatal Signal (11); pid 8041 becoming a zombie...
> 00:30:01 You may use gdb to attach to 8041
> 
> [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA (dir130)
> [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA() -> 22
> [ D(829) : 0000 : 00:30:01 ] userent::Connect: VGAPlusSHA_Supported -> 1
> [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA (dir129)
> [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA() -> 22
> [ D(829) : 0000 : 00:30:01 ] userent::Connect: VGAPlusSHA_Supported -> 1
> [ W(190) : 0000 : 00:30:01 ] *****  FATAL SIGNAL (11) *****

'D' is the probe-daemon thread, and 'W' is a worker thread.

My guess is that 'D' is sending out periodic server probes (using the
V_UID = 0 connections), while at the same time the user calls 'cunlog'
for the root user which triggers the worker thread to destroy any
connections owned by root, which to me indicates that I'm fixing the
correct bugs in the iterators.

Jan
Received on 2005-05-31 14:16:10