Coda File System

Re: Coda always crashes while copying files across

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 15 Aug 2001 09:38:48 -0400
On Wed, Aug 15, 2001 at 09:31:56AM +0800, Jeremy Malcolm wrote:
> I am running Coda 5.3.on Red Hat Linux 6.2.  It seems to be installed OK
> (after a fair amount of trial and error) and so now I am trying to copy
> all my data into the coda store.  I am doing that from the coda server
> which is also running venus.  While copying, periodically I will get:
> 
>   08:23:22 Cache Overflow: (52, -214828)

You have got over 200MB worth of data that doesn't fit in the client
cache. This is probably due to the client going disconnected and logging
modification locally. Because a client with a 20MB cache is not expected
to hold so much data the amount of fso's (cacheable objects) and CML
entries (modification log entries) is very limited. Because of some C++
pecularities involving the stupidly running of object initializers when
an allocation fails and returns a NULL pointer. i.e. we are pretty much
unable to avoid crashing when the client exceeds the number of FSO's or
CML's limits.

The only 'official' solution to the C++ allocation problems is to use
exceptions which weren't implemented in gcc until 2.95 and what is
implemented doesn't work with threaded (or at least LWP threaded)
programs.

> But then after about five or ten minutes I will get an error about the
> device being full (sorry, it had scrolled out of my buffer so I can't
> copy it into this email), followed by a message like:

Your client probably ran out of FSO's because there were too many
pending reintegrations.

> 08:23:20 Local inconsistent object at
> /coda/programs/distfiles/Windows/fireworks4-TBYB.exe, please check!
> ...snip...
> 08:23:38 Cache Overflow: (52, -221652)
> cp: preserving permissions for ./dreamweaver4/Dreamweaver
> 4/Configuration/Objects/Frames/Left Top.gif: No such device

And about here venus dies because it probably couldn't allocate another
CML entry.

> The disk is not full, and neither would coda's store be full.  df shows:

The limit for the client is not really diskspace, but the number of
cacheable objects.

> After I kill the copying process (which is still churning through
> "device not configured" lines) the last line changes to:
> 
> Coda                   9000000         0   9000000   0% /coda

The STATFS upcall fails and the kernel module falls back on returning
fake information to avoid locking up your system.

> and when I do ls I get:
> 
>   ls: /coda: Input/output error
> 
> This happens even after I restart venus.  After doing so, venus shows up
> as a process, but when I run codacon I get:

Did you kill the old venus process and unmount /coda before restarting?
Venus cannot reattach to a 'running' filesystem, because some files might
be open, etc. By forcing you to unmount the FS, which the Linux kernel
doesn't allow as long a file is still open, forces you to kill processes
that still have references to files in /coda.

> There are no errors in /usr/coda/etc/venus.log, at this stage, however.
> In /vice/srv/SrvLog I have lots of messages like this:
> 
>   08:27:37 VLDB_Lookup: no more records in VLDB

This is strange, it indicates that a lookup is performed for a volume
that doesn't exist, or at least this server doesn't know about. Are the
/vice/db/VRDB and /vice/db/VLDB files on both servers the same?

> Even more interestingly it seems to be only when I delete the file on
> the other machine that venus finally shuts down on the main server and I
> get the following in its venus.log:

Ok, so the old venus was probably still hanging around, or the new venus
was being 'busied' by one of the servers. The callback that resulted
from the delete operation probably triggered the release of some lock
that allowed progress.

> [ X(00) : 0000 : 08:39:49 ] fsobj::Recover: invalid fso
> (fireworks4-TBYB.exe, (0xffffffff.0xfffffffe.0x2)), attempting to
> GC...0x20206b88 : fid = ((0xffffffff.0xfffffffe.0x2)), comp =
> fireworks4-TBYB.exe, vol = 20212e88

Recovery only happens during startup, I'm not sure why you are seeing
it this late. You might have to reinitialize the crashed venus.

> To finally get coda working again on the main machine, I seem to have to
> reboot.  Shutting down and restarting the coda services doesn't cut the
> mustard (even if I check with ps that they are all dead.  I also check
> if there are any processes still listening on port 370).

    killall -9 venus
    umount /coda
    # has to succeed, otherwise the new venus process won't start

    venus -init &

The umount will most likely fail because some processes still have
open references to files in Coda. Sometimes they can be found using
'lsof | grep /coda'.

To minimize the chance of switching to logging mode and running the
client of of FSO/CML objects use 'cfs strong' before copying more data
than would fit in the client cache. 'cfs adaptive' returns the client to
it's normal behaviour where it switches to write-disconnected mode when
the estimated network bandwidth is low, or when the server is getting
loaded.

Jan
Received on 2001-08-15 09:38:55