Coda File System

Re: process suspension issues

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Tue, 18 May 2004 23:41:34 -0400
On Sun, May 16, 2004 at 11:54:37PM -0400, shivers_at_cc.gatech.edu wrote:
> Suppose on my coda client, I say
>     cp *.c /coda/myserver/src
> and copy a lot of files to my coda filesystem. Both client & server are on the
> net, the client via cable modem. If, in the middle of the cp, I suspend
> the process by typing ^z in unix, two bad things happen:
> 
>   - It takes 3 minutes to suspend. (Which basically renders a large portion
>     of the reason to suspend moot.)

The close(2) syscall is the only syscall that we absolutely can't
interrupt in the kernel because if we do we don't know whether the
kernel still has an active reference to the file. So if you had hit ^Z
anywhere during the open or write calls it would have suspended right
away.

If your client was operating write-disconnected the close upcall would
have simply triggered a log update and returned quickly, while the
actual store to the server would have happened in the background.

You probably told the client to (try to) stay in fully connected mode
and in that case the semantics are such that close(2) won't return until
we know for sure that the update has been committed on the server. This
clearly took about 3 minutes.

>   - The current copy operation aborts with an "syscall interrupted" error and
>     so when I later resume the cp job, I will find that one of my files failed
>     to get copied.

Once we get a response from the server that the operation completed, we
re-enable interrupts. At that time we notice the the SIGSTOP signal and
process it which triggers the syscall interrupted message. However the
operation must have been completed and committed on the server as we got
a reply for the upcall. I don't know for sure why a file would not be
copied, but maybe the signal isn't seen until the next operation starts
(the open call for the next file), and we returns EINTR or something to
the cp application, cp is simply handling that as a fatal failure
instead of a retryable one.

> Is this standard behavior for coda? Am I doing something wrong?

I'm guessing it is an application behaviour of cp, normally quite hard
to trigger. You would have to time the suspend signal so that it arrives
during the open call and while the kernel happens to be in an
interruptible sleep, maybe while performing disk IO. It is simply
trivial to trigger this case on Coda because we actually go into such an
interruptible sleep every time we need to inform the userspace cache
manager and for longer periods of time than a typical disk IO operation.

Jan
Received on 2004-05-18 23:43:36