Coda File System

Re: Venus Segfault

From: Jan Harkes <>
Date: Thu, 10 Feb 2005 22:03:04 -0500
On Thu, Feb 10, 2005 at 06:31:37AM +0800, Alan Tam wrote:
> Jan Harkes wrote:
> >So I have a pretty good idea where it crashed, but no idea how it
> >managed to crash there.
> Maybe it is caused by my manual editing of these files to [1] correct 
> the wrongly detected machine names. Probably I should remove everything 
> else and install again. I've got a lot of such experience anyway.

Changing stuff like that shouldn't crash a client, it might not make it
able to reach a server. I'll try to mess around a bit with stuff to see
if I can reproduce it.

> But still sometimes I do have no way to discover where the problems are. 
> Process can be frozen, my not knowing what it is waiting for [2]. And in 

That looks like a normal 60 second rpc2 timeout. It used to be 15
seconds, but too many people had problems with unexpected disconnections
over weak links (or links with asynchronous bandwidth like ADSL or cable
modems) that we bumped up the delays to more conservative values.

If you had 2 or more root servers for your realm this would probably
have lasted a multiple of this. A clear indication that we really should
teach RPC2 to look at ICMP error responses (NETUNREACH/HOSTUNREACH) so
that we can abort useless retries to unreachable servers more quickly.

> most cases, the messages logged are simply not enough to track down what 
> is configured wrong.

Are you running 'codacon'? I tend to run it permanently in a separate
xterm on my desktop and it does often give some more feedback about the
transient stuff that is going on.

> sltam_at_beta:/coda$ date; ls -l; date
> Thu Feb 10 06:23:35 HKT 2005
> total 9
> dr-xr-xr-x   2 root guest 2048 Dec 25 02:57 ./
> drwxr-xr-x  25 root root  4096 Feb  5 19:12 ../
> lrw-r--r--   1 root guest    9 Feb 10 03:29 -> 
> Thu Feb 10 06:24:26 HKT 2005

Ok, so we know that '' is already known by venus
either because you accessed it earlier, or because you have obtained
tokens for that realm.

Now the question is where we waited for that 60 second timeout. Name
resolution must have succeeded, because I don't get any delay if I do
'ls /coda/'. So there are 2 rpc operation between here and
successfully mounting the volume. One is the volume location query, and
the second is where we try to get the attributed of the root directory
of the volume. If the server is not running we probably timed out on the
first, and if the IP-address in the location information is wrong we
probably timed out on the second.

What you could try is the 'getvolinfo' command, I'm not sure whether is
is installed with the client or the server, but it should be in
/usr/sbin. Do something like 'getvolinfo ""', and
that should return the volume location information of the rootvolume. 
The result will contain information like, 

    Replica0 id c7000085, Server0

Check if that IP is actually valid and reachable, my guess is that
either that address is, or that your server is bound to a
different address.

Received on 2005-02-10 22:04:59