Coda File System

Re: Some more question about argument against having both coda-server and client on the same machine

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Tue, 25 Oct 2005 13:41:00 -0400
On Mon, Oct 24, 2005 at 03:48:28PM +0400, mkondrin wrote:
> I have read the Coda-wiki page with Jan Harkes arguments against 
> installing coda server and coda client on the same host. If I get it 
> right the problem is that client checks the response time of the server 
> to solve is it disconnected or not. If the server does not respond 
> during the period of time equals "typical response time" times "some 
> unknown multiplier" times "number of retries" then the client mark this 
> server as disconnected. When both server and client reside on the same 
> machine typical response time is about zero. So in the moments when the 

Actually, there is a lower limit of about 300ms for the roundtrip time
estimates, otherwise we'd disconnect almost immediately. However if the
server process is blocked while syncing updates to disk it is possible
that it will take longer.

The disconnection is actually wanted in many cases. A reintegrating
client will sends updates in batches of up to 100 operations, which are
committed by the server in a single transaction. Last time I checked,
committing a transaction was the part where the server spent most time
(not CPU, but wall-clock, waiting for the sync to disk to complete).

So reintegrating actually ends up significantly reducing server load, it
can be up to 100 times faster. So the adaptation actually ends up
working very well.

> is resided on the same machine as the client. If the client thinks that 
> this one server is disconnected wouldn't it try another server? Or is 
> the "zero response time" applied for this server too?

No, each server will time out differently, but if the system is
trashing, it will affect the client just as much as the server. So the
client might be blocked while swapping and not see the response from the
second server fast enough.

> Can we disable disconnection check on the client (suppose we have an
> environment where all connections are physically strong) so it would
> never disconnect even if it "hangs"?

Not really, even if you twiddle the timeout values to make the client
wait longer, something, somewhere will probably cause a problem. For
instance your client might be more prone to livelock, which is currently
resolved by the fact that when the RPC operation times out we
disconnect, which makes the operations fall back to either cached
contents, or fail. Or a deadlock, when it runs out of worker threads.

Also, it isn't always the client that gives up on the server, sometimes
the server intentionally drops the client, either to handle token
expiry, or if the client isn't acking callback breaks, which often
happens when clients are mobile or behind masquerading firewalls.

If the server would indefinitely wait for a callback break to succeed,
then some writer is blocked waiting for that callback, which means a
client is blocked. Now there are situations where 2 clients are trying
to update different files which require callback to be broken on the
other client, but both clients have no spare threads to handle those
callbacks, etc. This would lead to a complete deadlock were it not for
the fact that some RPC will time out. If either of the write operations
time out, then that client will be disconnected, but the other can
proceed. If the callback times out, the client will be nak'ed, which is
handled like a quick disconnect/reconnect. Both writes can proceed,
however the nak'ed client will have to revalidate it's cache for
potentially missed callbacks.

Distributed systems are fun :)

Jan
Received on 2005-10-25 13:42:16