Coda File System

Re: Server keeps crashing after conflict resolve attempt

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Tue, 27 May 2008 11:53:37 -0400
On Tue, May 27, 2008 at 08:21:51AM -0400, Janusz Krzysztofik wrote:
> After I have tried to resolve a local conflict, my server has crashed.  
> Now it keeps crashing just after the client reconnects. My SrvLog ends 
> with:

Clearly something like this should not happen, if some operation that
was sent by the client doesn't apply correctly to the internal state we
should be sending an error back to the client instead of crashing.

So I am really interested in the full server log as well as the list of
operations the client is trying to reintegrate. As the server logs are
pretty big, please send it to me off-list. To get the list of operations
the client is trying to reintegrate, run 'cfs listlocal
/path/to/problematicvolume' on the client.

> 13:09:21 --PO: 1000003.8cfd0.c04d2
> 13:09:21 Entering VFlushVnode for vnode 8cfd0
> 13:09:21 Entering ObjectExists(volindex= 2, (467e7.c04d2)
> 13:09:21 ObjectExists: NO object 467e7.c04d2
> 13:09:21 ****** FILE SERVER INTERRUPTED BY SIGNAL 11 ******
> 13:09:21 ****** Aborting outstanding transactions, stand by...
> 13:09:21 Uncommitted transactions: 1
> 13:09:21 Uncommitted transactions: 1
> 13:09:21 Committing suicide now ........
>
> I had similiar problems several times before and managed to restore the  
> operation by reinitializing the client cache. This time I would prefere  
> keeping all the changes waiting for reintegration. Is there a way to  
> skip the error provoking operation without purging the client cache?

I guess we would first have to identify which operation is the
problematic one and then we can discard successive operations from the
beginning of the reintegration log with 'cfs discardlocal' until we
get past the problematic ones.

The problematic object seems to be the one that has the file identifier
'1000003.8cfd0.c04d2'. Now the 1000003 part is the non-replicated
volume-id, so the object probably has a different volumeid value on the
client. The failure seems to indicate that another object is missing,
most likely a directory. So maybe we are trying to create or move an
object in a directory that no longer exists so I wonder if there is a
rename/removedir or create/removedir pair of operations in your CML.

If the problem is caused by such a combination it may be possible to
artificially force the client to reintegrate in really small batches so
that both operations end up in different reintegration attempts. There
is no easy way to do this, it would probably involve both choking the
client's available bandwidth (Lua script for rpc2) and modifying the
reintegration parameters (cfs wd -time 0.001) to the point that we only
push one record at a time. In theory it should be possible, but I've
never tried to actually do this.

Jan
Received on 2008-05-27 11:54:45