Coda File System

Re: emacs and coda on NetBSD/1.6

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 12 Feb 2003 13:55:40 -0500
On Wed, Feb 12, 2003 at 12:00:13PM -0500, Greg Troxel wrote:
> 11:49:10 worker::Return: message write error 3 (op = 20, seq = 92135),
> wrote -1 of 12 bytes
> 11:49:10 DispatchWorker: signal received (seq = 92135)

Error 3 (ESRCH) is returned when the upcall was cancelled because the
userspace process received an interrupt. In this case we tried to reply
after the signal was handled, but before it was seen by venus.

> 11:49:24 DispatchWorker: signal received (seq = 92151)
> 11:49:25 worker::Return: message write error 3 (op = 20, seq = 92151),
> wrote -1 of 12 bytes

And here we saw the signal about a second before we tried to send a
reply back. Venus tries to interrupt the worker thread that is handling
the upcall, but generally won't be able to actually stop it.

> I do not have this problem on NetBSD/{i386,sparc} 1.5.4ish or
> FreeBSD/i386 4.7ish.
> 
> I wonder if emacs is doing some sort of asynchronous IO that isn't
> handled correctly in the NetBSD kernel.

We block some signals in the kernel module when we're waiting for a
reply from venus, perhaps the signal handling has changed and they are
still coming through.

In the Linux kernel module, we never allow any interrupts for
CODA_CLOSE. This is very important, if that upcall is aborted before
venus picks it off the queue, venus will have a non-zero 'pending
writers' and will never propagate changes back to the server. The file
will also be considered 'dirty' during startup and moved aside to
/usr/coda/spool.

Besides that, we also ignore all other signals except SIGKILL and SIGINT
for at least 30 seconds, I believe it was xemacs that was sending
various signals to itself during system calls.

I though Bob Baron implemented something similar in the *BSD kernel
modules, except that he didn't even let SIGINT and SIGKILL through, and
atomatically aborted the upcall after 60 seconds or something.

> I also get emacs into a state where it is nonresponsive and stuck in
> 'R' according to ps.

Does NetBSD have the 'D' state (i.e. blocked in syscall?), perhaps it is
repeatedly calling the same system call, interrupts it and because the
call 'failed' tries again.

Interestingly enough the upcall that is aborted is CODA_FSYNC. When the
file is open for writing, venus will call sync(2) and then flush any
pending RVM operations to the log (which calls fsync). Seems a bit
useless really because it doesn't 'commit' venus to anything and doesn't
really guarantee that updates will be seen by venus if it crashes.
With all the syncs, I can believe venus will block a bit and, especially
when it causes an impatient emacs to loop on the fsync(fd).

Simplest thing is probably to avoid the useless syncs in venus (patch
attached). What I really want to get at some point is that fsync
triggers a store operation, i.e. that it would be a synchronization
point and any updates at that point in time will be propagated to the
server. It would allow an application to commit updates without
closing/reopening the file. Which right now won't do much in
write-disconnected mode because pending stores are optimized away when
the file is opened for writing.

Jan


--- coda/coda-src/venus/worker.cc.orig	2003-01-31 22:22:53.000000000 -0500
+++ coda/coda-src/venus/worker.cc	2003-02-12 13:50:56.000000000 -0500
@@ -1179,7 +1179,7 @@
 		{
 		LOG(100, ("CODA_FSYNC: u.u_pid = %d u.u_pgid = %d\n", u.u_pid, u.u_pgid));
 		MAKE_CNODE(vtarget, in->coda_fsync.Fid, 0);
-		fsync(&vtarget);
+		//fsync(&vtarget);
 		break;
 		}
 
Received on 2003-02-12 13:59:46