Coda File System

Re: coda client hangs

From: Patrick Walsh <pwalsh_at_esoft.com>
Date: Tue, 31 May 2005 09:09:35 -0600
	OK, over the weekend coda hung again).  On two machines with two
different problems (a signal 11 and an assert failure in realms.cc), but
one of the machines we'll ignore for now since I haven't updated it with
all the newest RPMs.  Unfortunately, it seems that the debugging symbols
are still not in the venus binary, though I don't have the slightest
idea why not.  I'll look into this yet again.

	In the meantime, I've noticed a trend.  These signal 11's seem to
happen almost always at 00:30:01.  I've tried running all our cronjobs
at once to force the crash and I've analyzed what cron jobs would be
running around that time, including cron.daily type stuff, and I've come
to the conclusion that the problem is probably being generated from
somewhere outside of cron.  There is nothing in the syslog or cronlog to
indicate anything happening at that time.  Is that a special time in
Coda?  Here's the latest batch of logs and backtraces:

18:30:00 Coda token for user 0 has been discarded
19:00:00 Coda token for user 0 has been discarded
20:00:01 Coda token for user 0 has been discarded

Date: Sat 05/28/2005

00:30:01 Fatal Signal (11); pid 8041 becoming a zombie...
00:30:01 You may use gdb to attach to 8041


[ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA
(dir130)
[ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA() ->
22
[ D(829) : 0000 : 00:30:01 ] userent::Connect: VGAPlusSHA_Supported -> 1
[ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA
(dir129)
[ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA() ->
22
[ D(829) : 0000 : 00:30:01 ] userent::Connect: VGAPlusSHA_Supported -> 1

[ W(190) : 0000 : 00:30:01 ] *****  FATAL SIGNAL (11) *****


(gdb) bt
#0  0xb73f79d6 in __sigsuspend (set=0x1560b0dc)
    at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
#1  0x080ab4b1 in strcpy () at ../sysdeps/generic/strcpy.c:31
#2  <signal handler called>
#3  0x080c8cae in strcpy () at ../sysdeps/generic/strcpy.c:31
#4  0x0817e3c0 in ?? ()
#5  0x0804e55d in strcpy () at ../sysdeps/generic/strcpy.c:31
#6  0x080ac414 in strcpy () at ../sysdeps/generic/strcpy.c:31
#7  0x080ac0e8 in strcpy () at ../sysdeps/generic/strcpy.c:31
#8  0x080abd66 in strcpy () at ../sysdeps/generic/strcpy.c:31
#9  0x080a4b57 in strcpy () at ../sysdeps/generic/strcpy.c:31
#10 0x080a5d39 in strcpy () at ../sysdeps/generic/strcpy.c:31
#11 0x080aa35e in strcpy () at ../sysdeps/generic/strcpy.c:31
#12 0x080a13b6 in strcpy () at ../sysdeps/generic/strcpy.c:31
#13 0xb741a8c4 in __makecontext () from /lib/libc.so.6
#14 0x08128bb8 in ?? ()
Cannot access memory at address 0x30303a30
(gdb)


..Patrick





On Fri, 2005-05-27 at 13:00 -0400, Jan Harkes wrote:
> On Fri, May 27, 2005 at 08:29:03AM -0600, Patrick Walsh wrote:
> > 	Another day, another Signal 11.  On the machine running the latest
> > patches and versions, we have these logs:
> 
> > [ W(327) : 0000 : 00:30:01 ] *****  FATAL SIGNAL (11) *****
> > 00:30:01 Fatal Signal (11); pid 25080 becoming a zombie...
> > 00:30:01 You may use gdb to attach to 25080
> > 
> > 	And this gdb trace:
> > 
> > 0xb73f79d6 in __sigsuspend (set=0x159250fc)
> >     at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
> > 45      ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or
> > directory.
> > ---Type <return> to continue, or q <return> to quit---
> >         in ../sysdeps/unix/sysv/linux/sigsuspend.c
> > (gdb) bt
> > #0  0xb73f79d6 in __sigsuspend (set=0x159250fc)
> >     at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
> > #1  0x080ab4b1 in strcpy () at ../sysdeps/generic/strcpy.c:31
> > #2  <signal handler called>
> > #3  0x0804e53c in strcpy () at ../sysdeps/generic/strcpy.c:31
> > #4  0x0c319ef8 in ?? ()
> > #5  0x080ac414 in strcpy () at ../sysdeps/generic/strcpy.c:31
> > #6  0x080ac0e8 in strcpy () at ../sysdeps/generic/strcpy.c:31
> 
> Still no debug symbols in the binary, that's kind of annoying. Also,
> this trace looks suspiciously similar to the previous ones so we're
> probably looking at the same bug. Which means that the one I found
> wasn't actually triggered in your case.
> 
> That jump from 0x08 to 0x0c and then back to 0x08 looks a lot like we
> called a library function which then called a callback in the main
> program.
> 
> Jan
> 
-- 
Patrick Walsh
eSoft Incorporated
303.444.1600 x3350
http://www.esoft.com/

Received on 2005-05-31 11:10:33