Coda File System

Re: coda client hangs

From: Patrick Walsh <pwalsh_at_esoft.com>
Date: Tue, 24 May 2005 08:20:57 -0600
	Can someone (Jan?) please take a look at this and let me know what you
think?  We're going to have to abandon coda and start investigating
commercial solutions soon if we can't resolve this.  The reason for our
using coda is to have a high-availability distributed file system.
Without the high-availability, it becomes a liability.  I've invested so
much time into coda that I'd really like to see it work.

	The clients have hung again, but because we were in the middle of
testing some other things, I couldn't take the time to gdb it.  It seems
somewhat coordinated since 3 out of 4 coda clients were all hung and
needed to be restarted.

	Another issue: although we have a cron job that gets fresh tokens 3
times per day, root (and possibly other users) sometimes lose their
tokens.  I suspect this is because we run several clog commands at the
same time for different users (user nobody, user root, etc., all try to
get tokens at the same time).  Is it possible that this would cause a
problem?

Thanks,

..Patrick (original problem below details)


On Fri, 2005-05-20 at 10:43 -0600, Patrick Walsh wrote:
> 	To recap: we are setting up a cluster of servers using coda as the
> shared filesystem.  This cluster of servers uses coda for html files,
> ftp files, etc.  We have two dedicated coda servers.
> 
> 	These servers haven't moved into production yet as we want to make sure
> they are absolutely stable.  Alas, it seems they are not.
> 
> 	Twice in recent times the coda client has hung.  Restarting venus fixed
> the problem.  When this happens next time I'll attach gdb to the process
> to try to see what happened.  In the meantime, all I have is the console
> and venus log files.  (We're using client version 6.0.8.)  The venus log
> file is extremely long with lots of messages like this:
> 
> [ H(07) : 0207 : 04:42:20 ] Hoard Walk interrupted -- object missing!
> <606e1fc8.7f000001.964.56d1>
> [ H(07) : 0207 : 04:42:20 ] Number of interrupt failures = 131
> 
> and like this:
> 
> [ W(823) : 0000 : 11:13:17 ] Cachefile::SetLength 552280
> [ W(823) : 0000 : 11:13:17 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28),
> VVs differ
> [ W(823) : 0000 : 11:13:19 ] Cachefile::SetLength 552933
> [ W(823) : 0000 : 11:13:20 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28),
> VVs differ
> 
> changing to this:
> 
> [ W(1783) : 0000 : 17:17:18 ] Cachefile::SetLength 3243845
> [ W(1783) : 0000 : 17:17:19 ] *** Long Running (Multi)Store: code =
> -2001, elapsed = 1252.4 ***
> [ W(1783) : 0000 : 17:17:19 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28),
> VVs differ
> 
> and then eventually to this:
> 
> [ W(1783) : 0000 : 21:54:50 ] Cachefile::SetLength 7015276
> [ W(1779) : 0000 : 21:54:53 ] WAITING(606e1fc8.7f000002.e.28): level =
> RD, readers = 0, writers = 1
> [ W(1783) : 0000 : 21:54:53 ] *** Long Running (Multi)Store: code =
> -2001, elapsed = 3633.3 ***
> [ W(1783) : 0000 : 21:54:53 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28),
> VVs differ
> [ W(1779) : 0000 : 21:54:53 ] WAIT OVER, elapsed = 361.2
> 
> 	The very end of venus.log looks like this:
> 
> [ W(1783) : 0000 : 21:54:57 ] Cachefile::SetLength 7016538
> [ D(1804) : 0000 : 21:55:00 ] WAITING(SRVRQ):
> [ W(821) : 0000 : 21:55:00 ] WAITING(SRVRQ):
> [ W(823) : 0000 : 21:55:00 ] *****  FATAL SIGNAL (11) *****
> 
> 
> 	Most of the complaints I think are harmless and I think result from
> this file: 606e1fc8.7f000002.e.28, which I believe is the apache log
> file.
> 
> 	Here's the end of console.log:
> 
> 12:55:00 root acquiring Coda tokens!
> 12:55:01 root acquiring Coda tokens!
> 12:55:01 Coda token for user 0 has been discarded
> 15:55:00 root acquiring Coda tokens!
> 15:55:00 root acquiring Coda tokens!
> 18:55:00 root acquiring Coda tokens!
> 18:55:00 root acquiring Coda tokens!
> 21:55:00 root acquiring Coda tokens!
> 21:55:00 root acquiring Coda tokens!
> 21:55:00 Fatal Signal (11); pid 1708 becoming a zombie...
> 21:55:00 You may use gdb to attach to 1708
> 
> 
> 	Finally, to my questions: 1) is there something I can do to prevent
> future signal 11's?  2) If such a signal (whatever it means) happens,
> can coda just restart itself instead of going into a zombie state and
> causing httpd and proftpd to hang?
> 
> 	Thanks for your help.
> 
-- 
Patrick Walsh
eSoft Incorporated
303.444.1600 x3350
http://www.esoft.com/

Received on 2005-05-24 10:21:51