Coda File System

RE: venus crash

From: Florin Muresan <Florin.Muresan_at_atc.ro>
Date: Tue, 13 Mar 2007 14:58:43 +0200
Hello everybody!

I have a similar problem in my coda realm. Venus crashes when I try to
copy(overwrite)/delete many files. 
I use coda in a production environment for a web hosting solution.
During the testing period this situation never happened. My guess is
that the problem occurs because of the high number of accessed files per
second that triggers false conflicts.
I am running coda-server and coda-client version 6.0.16 on Debian
Sarge(stable) boxes and so far I didn't encounter any other problems. It
works very smoothly.

Do you think that if I install the version 6.9.0 the problem with false
conflicts will be avoided?
What any other suggestions for this situation?

Regards,
Florin

-----Original Message-----
From: Jan Harkes [mailto:jaharkes_at_cs.cmu.edu] 
Sent: Monday, March 12, 2007 8:09 PM
To: codalist_at_coda.cs.cmu.edu
Subject: Re: venus crash

On Mon, Mar 12, 2007 at 05:45:28AM +0100, S. Cance wrote:
> I missed something in the console log file :
> 
> 05:38:55 RecovTerminate: dirty shutdown (1 uncommitted transactions)
> Assertion failed: 0, file "fso_dir.cc", line 96
> Sleeping forever.  You may use gdb to attach to process 1890.

This is an assertion that triggers when we try to create a new filename
entry in a directory, but the name already exists. This is a situation
that should never occur, before we try to create this new directory
entry we should have checked if it exists or not.

So we may be forgetting (in some code path) to check if the name already
exists. It is also possible that we do try to check for existence, but
somehow combine that test with a check for the validity of the object
the name refers to and if that object is missing (which is definitely
possible during disconnected mode) we assume that the directory entry
can be safely created.

Either way, it is a bug. It may be hard to track down. Clearly we're
adding some name. If you are using vim, it has several different ways it
could possibly handle how files are written (create backup, overwrite
original, or move original to backup, create new version, etc)

Is the problem reliably reproducable? In that case you could rotate the
log, bump the debug level on your client and try to trigger the problem.

    (vutil -swap ; vutil -d 100) # rotate logfiles and set debug level
100

Then if the problem occurs again the log hopefully will contain enough
detail to figure out what sequence of operations caused the problem. If
the problem didn't occur it is probably best to set debugging back to 0
to avoid filling up your local disk.

    (vutil -d 0) # turn of excessive debug logging.

> I lost the modifications on the file, but what is surprising is that I

> get CML conflicts on vim's swp's files.
> 
> is it normal behaviour ?

There is a case where false conflicts are detected even when it only
involves write operations, or more specifically the store of new file
data when a file is closed, from a single client.

If your client is older than 6.9.0, it can switch between 2 different
modes of operation. In the 'connected' mode it will send individual
operations to the server (create/store/chown/chmod), the other mode is
typically called 'write-disconnected', but the same mechanism is used
after disconnections so a better name is probably 'reintegrating', it
logs the operations and reintegrates them several at a time.

Now when a connected mode store operation fails (f.i. network timeout)
we don't know if it ever reached the server. So to make sure we don't
lose any data we switch from connected mode to 'reintegrating' and
perform the store operation again, this time logging it in the CML so
that we will resend the operation when we get connected again.

Not the false conflict is triggered when the server did in fact see the
connected store operation and committed it locally, but the reply was
lost because of the disconnection/network timeout. In that case the
retried store from the CML is trying to update a file that was already
updated on the server and it is flagged as a conflict.

Reintegration does know how to detect repeated operations because it
assumes we have bad connnectivity, but it only works for when the
operation was previously sent by reintegration. 6.9.0 never uses
connected mode, so everything is logged and reintegrated and this
type of false conflict doesn't happen.

Jan
Received on 2007-03-13 09:14:21