Coda File System

Re: cfs lv hangs on NetBSD/sparc, NetBSD/sparc64

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Thu, 27 Jul 2006 16:13:48 -0400
On Wed, Jul 26, 2006 at 09:39:52PM -0400, Sean Caron wrote:
> Hi all,
> 
> Running coda 6.0.14 on NetBSD/sparc64 back-end servers and a mix of
> NetBSD/sparc, NetBSD/sparc64, and NetBSD/macppc clients (well
> described in the list archives). I have recently taken note of the
> fact that the 'cfs lv' command does not work on any of the SPARC
> systems -- perhaps an endianness bug?

More likely a 64-bit issue.

How did you build the servers on the 64-bit machines? Were they compiled
as 32-bit applications?

> sonnet% cfs lv /coda/diablonet.net/tmp
> (just sits forever)
> ^C/coda/diablonet.net/tmp: Interrupted system call
> sonnet%
> 
> (this happens both against my own servers, and against
> testserver.coda.cs.cmu.edu so I know its not anything wrong with my
> server configuration)

So the 'cfs lv' hangs on a 32-bit sparc client, even when talking to
testserver? If this is the first time you access the realm it could be a
DNS resolver problem.

> After this point, venus is completely hosed -- if you go to CODA-space
> and try to do anything like list a directory, read/write a file,
> whatever, it is just hung up and will sit, until you kill it, always
> with the interrupted system call error. You have to kill venus off and
> re-invoke with venus -init to make it work again.

I've noticed that when the realm lookup fails, some thing are not
cleaned up correctly and venus crashes. When venus dies it hangs around
waiting for a debugger, and commands tend to get 'stuck' until venus is
killed.

> If I ktruss it, I see it is hanging up on the system call,
> 
>   803 cfs      open("/coda/.CONTROL", 0, 0)       = 3
> 
> and cranking venus up with debuglevel -d 100, I see this in the logs:
> 
> [ W(15) : 0000 : 21:21:03 ] fsobj::Lookup: (diablonet.net/tmp), uid = 0
> [ W(15) : 0000 : 21:21:03 ] fsobj::Access : (diablonet.net, 8, 0), uid = 0
> [ W(15) : 0000 : 21:21:03 ] Realm::GetUser local uid '0' for realm 'diablonet.net'

Ah, the realm lookup did complete otherwise we wouldn't see lookups and
access calls for subdirectories.

> [ W(15) : 0000 : 21:21:03 ] srvent::GetConn: host = blossom.diablnet.net, uid = -1, force = 0
> [ W(15) : 0000 : 21:21:03 ] PutConn: host = blossom.diablnet.net, uid = -1, cid = 391227550, auth = 0
> [ W(15) : 0000 : 21:21:03 ] PutServer: blossom.diablnet.net

This is kind of strange, there seems to be an 'o' missing, typo in /etc/hosts?

> [ W(15) : 0000 : 21:21:03 ] volent::volent: (7f00000b, diablonet.tmp)
> [ W(15) : 0000 : 21:21:03 ] repvol::repvol 5043a4c8 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> [ W(15) : 0000 : 21:21:03 ] vsgdb::GetVSG 451276c8 451276c0 451276b8 451276b0 451276a8 451276a0 45127698 45127690

This is strange, it seems to think that the volume with only a single
replica is replicated on 8 servers.

Normally you would see the repvol::repvol and the vsgdb::GetVSG lines
match up.

[ W(12073) : 0000 : 15:59:41 ] repvol::repvol 5108d908 51000148 50fe9688 00000000 00000000 00000000 00000000 00000000
[ W(12073) : 0000 : 15:59:41 ] vsgdb::GetVSG c7d10280 6fde0280 c0bf0280 00000000 00000000 00000000 00000000 00000000

Also the host values in the VSG array are very suspicious. I would
expect the GetVSG line to look like,
    vsgdb::GetVSG 97a00740 00000000 00000000 00000000 00000000 00000000 ...

And finally, we end up getting a bus error probably when we try to
release this strangly initialized structure.
    
> [ W(15) : 0000 : 21:21:03 ] mgrpent::CheckNonMutating: acode = -2001
>                hosts = [0x45120ed0 0x45120ec8 0x45120ec0 0x45120eb8 
>                0x45120eb0 0x45120ea8 0x45120ea0 0x45120e98],
>                retcodes = [0 -2002 -2002 -2002 -2002 -2002 -2002 -2002]
> [ W(15) : 0000 : 21:21:03 ] mgrpent::Put 0x10dc00, uid = 0, mid = 1, auth = 
> 1, r
> efcount = 2, detached = 0
> [ W(15) : 0000 : 21:21:03 ] mgrpent::PutHostSet: 0x10dc00
> [ W(15) : 0000 : 21:21:03 ] *****  FATAL SIGNAL (10) *****

I'll look at the logs a bit more, it is almost like your struct in_addr
contains a pointer instead of a 32-bit integer.

Jan
Received on 2006-07-27 16:16:19