Coda File System

Re: rw access hangs forever

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Tue, 27 Mar 2007 21:53:02 -0400
On Wed, Mar 28, 2007 at 12:39:53AM +0200, Enrico Weigelt wrote:
> well, I just had several problems again: 
> 
> First the writing hung, after restarting venus, is worked again, 
> and then a few mins later the coda directory was completely 
> inaccessible.
> 
> But here's my venus.err log from this time. 
> 
> Hope that helps a bit.

Yes it does,

> Date: Wed 03/28/2007
> 
> 00:22:48 Coda Venus, version 6.0.15
...
> 00:22:49 	31 CML entries on free-list
> 00:22:49 starting FSDB scan (4166, 100000) (25, 75, 4)
> 00:22:50 	(history.dat, 50868448.7f000000.1e20.46be) found owrite object, discarding
> 00:22:50 	(lost file data backed up to /usr/coda/spool/history.dat-7)
> 00:22:50 	(.parentlock, 50868448.7f000000.2f6c.4c3c) found owrite object, discarding
> 00:22:50 	(lost file data backed up to /usr/coda/spool/.parentlock-8)
> 00:22:50 	(appreg, 50868448.7f000000.f84.c33) found owrite object, discarding
> 00:22:50 	(lost file data backed up to /usr/coda/spool/appreg-8)

Looks like the last time the client crashed while several files were
still open for writing. Since we're not sure if their content is
consistent we save a backup copy of the file in /usr/coda/spool.

This has clearly happened before, the -N extension is used to avoid
copying over previous backup copies. I did a quick locate and both
appreg and history.dat are firefox/mozilla related files, I'm guessing
.parentlock is probably from the same application. There also are about
31 changes that have not yet been propagated back to the server.

...
> 00:29:48 fatal error -- Assertion failed: file "fso_cfscalls2.cc", line 269

This is during the same run and this assertion is interesting, it
indicates that we received a close upcall for a file that we never knew
was opened. This should never happen, which is why the fatal assertion
triggers (i.e. we don't even know if the file was opened for reading or
writing).

Not sure how this could happen, unless venus was restarted with an
(intentionally) undocumented flag to rebind to an existing mountpoint.
If you started venus normally there must be some sequence of operations,
probably from firefox/mozilla that happens to trigger a bug where we
forget that a file was previously opened.

> Date: Wed 03/28/2007
... 
> 00:30:58 Coda Venus, version 6.0.15

Ok, venus is restarted.

...
> 00:30:58 	32 CML entries on free-list

Ah, one additional change was added during the previous run before we
crashed.

> 00:30:58 starting FSDB scan (4166, 100000) (25, 75, 4)
> 00:30:58 	(key3.db, 50868448.7f000000.f9e.c40) found owrite object, discarding
> 00:30:58 	(lost file data backed up to /usr/coda/spool/key3.db-8)

And yet another mozilla/firefox file that was open for writing during
the crash. Maybe this was the one for which we got mixed up about the
open count, but it also may be that the new CML record was the
problematic one. i.e. we may have thought we saw the final close for an
owrite file, and added a CML_STORE record and then got an unexpected
additional close upcall. Clearly we've crashed before while this file
was still open for writing.

...
> 00:31:09 Resolved realm 'jupiter.local'
> 00:33:08 Reintegrate: /, 3/3 records, result = SUCCESS

About 2 minutes after we reconnect to the realm we manage to reintegrate
3 records. In a way this is good news, there was no conflict. On the
other hand there is no clear indication why the client is unresponsive
at this point. There should be at least an additional 29 records waiting
to be reintegrated but that doesn't seem to be happening.

Now 6.0.15 is an older release, so I'm looking through the commit logs
to see if any fixes were committed that may be related to these
problems. I don't see anything obvious, one that may affect you in some
cases is the following,

    commit 58ad8b0f0ce88a28c8a1f5a27f929b49deb1a5a9
    Author: Jan Harkes <jaharkes_at_cs.cmu.edu>
    Date:   Wed Jun 7 16:56:04 2006 -0400

    Fix client crash in fsobj::Matriculate

    We were calling Matriculate outside of an RVM transaction.

This is an interesting one that would cause another assertion to
trigger. If we get 2 (almost) simultaneous requests for an uncached
object, the first one will send an RPC2 call to the server to get the
object metadata. At this point the object is in an uninitialized state.
The second opener will find the uninitialized object and block until the
first thread has received the server response finalized the object. This
is done by matriculate. Now when we are reintegrating changes, the
server sends callbacks to ensure that the client revalidates that it
will have the same object as present on the servers. So shortly after a
reintegration there is a good chance that we encounter such a situation.
This fix was released as part of Coda-6.0.16.

The only other change that may be relevant is if you are currently using
the Coda kernel module that came as part of Linux-2.6, you might want to
try to build the experimental one from

    http://www.coda.cs.cmu.edu/pub/coda/linux/kernel/linux-coda-6.5.tar.gz

One of the experimental changes in this version involves the way we
block signals while we are waiting for venus to read an upcall. I think
there is be a chance that with the old method a signal may restart the
close() systemcall which could cause venus to see a duplicated close
upcall.

Jan
Received on 2007-03-27 21:55:33