Coda File System

Re: Resolve endian problem kill x86 server?

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Fri, 16 Jul 2004 13:53:33 -0400
On Thu, Jul 15, 2004 at 11:28:18AM -0500, Troy Benjegerdes wrote:
> Is there going to be any reasonable way to run testcases or otherwise
> audit the code for all the potential endian and 64 bit problems?

2 issues here, I'll deal with them one at a time,

endianess,

Anything that only uses RPC2 request/reply should be ok. That code has
seen enough use. However some places we send things around with the
side-effect (SFTP) as a large buffer. Reintegration is one of these, and
that code actually borrows a lot of the marshalling/unmarshalling
functionality from RPC2 and it works without a problem.

Then there is resolution, the code dealing with resolution logs isn't
all that pretty, and I am not sure if it actually even tries to marshall
the logs before sending it to the other side. So there could be endian
related problems there.

The other issue, 64-bit

Big problems just about everywhere. The only code that I've cleaned up
and actually tested on a 64-bit alpha machine are LWP and RVM. In this
area, RPC2 actually causes a lot of problems. RPC2_Integer is defined as
a long integer instead of a int32_t. This effectively leaked into
everything that is using RPC2, so almost everywhere where currently an
unsigned or signed long is used, we should really be talking about ints.

I think I also saw somewhere that a string was originally a pointer in a
struct, but when it is 'flattened' for storage in RVM or to send it
across the network the pointer is replaced with an offset. These
structures will either have to use a 64-bit integer, or not try to
alternately use both pointers and offsets in the same field. 

> I managed to kill the x86 server on a resolve this time...
> 
> The last thing in 'SrvLog' is this:
> 
> 10:58:04 rsle::InitFromBuf Bad begin stamp 0x84ea32fb
> 10:58:04 rsle::InitFromBuf Bad begin stamp 0x84ea32fb

That does look like resolution logs are not marshalled to be platform
independent and are simply dumped straight from RVM.

> 10:58:03 Incomplete host set in COP2.
> 10:58:03 Incomplete host set in COP2.

These happen when an operation does not complete successfully on all
replicas. Often caused by a crashed server, or a client that (believes
it) is disconnected from a server.

> 10:58:03 CheckRetCodes: server 209.234.73.41 returned error 102
> 10:58:03 ViceResolve:Couldnt lock volume 7f000001 at all accessible
> servers

I had to do some searching, but 102 is VNOVNODE. To me this says that
the object we're trying to resolve doesn't yet exist on all servers. The
real problem in this case is in fact with the parent directory. The
client should automatically go up one level and try to resolve the
directory, which would create the directory entry as well as a runt
(empty) object. Only then can we resolve the contents of the file.

This is done to avoid creating lots of orphan objects where we don't
even know where they belong in the tree. It could even be that this
involves a removed file and the server that returned VNOVNODE is the one
that actually is correct here, although that is probably unlikely if you
only have a single client.

Jan
Received on 2004-07-16 13:54:42