Coda File System

Re: CODA and DRBD, speed of transport

From: Jan Harkes <>
Date: Mon, 19 Jun 2006 14:01:01 -0400
On Mon, Jun 19, 2006 at 04:39:57PM +0200, Reiner Dassing wrote:
> 1. Is it possible to setup SCM as a HA-Cluster via DRBD/heartbeat?
>    Is it possible to mirror the raw devices with the help of DRBD?
>    I think. Yes

The only reason why the SCM is 'special' is because the contents of
/vice/db is replicated by a master-slave mechanism by the updatesrv and
updateclnt daemons. That data is however mostly readonly. If the SCM is
unavailable it only prevents you from adding new volumes and users, and
users cannot change their passwords.

For the rest all Coda servers are equivalent. So if your realm is set up
to have 2 or more 'root servers', the volume lookup requests will
success as long as any of these root servers is available. Also, if your
rootvolume has at least 2 replicas, client will be able to connect as
long as at least one replica is still alive.

Writes are allowed even when some replicas are unavailable, however to
facilitate server-server resolution when the unavailable replicas come
back online the server maintains a per volume log of recent operations.
This log has a limited size, I think it defaults to 4096 operations, but
it can be increased with volutil setlogparms.

If a server log is completely filled it will die and will have some
trouble resolving once it is brought back on-line as some additional log
space is needed to complete the resolution.

I guess you could use DRBD to keep the contents of /vice/db in sync
between the servers. The only missing bit is that when updateclnt pulls
new updates it signals the server to re-read the volume location and
replication databases. It might be possible to make the server
periodically check the timestamps on these files and avoid the need for
such a signal.

One thing I can't see from the DRBD documentation is how applications
are supposed to figure out whether they are running on a primary or a
secondary node. And if the auth2 daemon doesn't know, then how can the
user that wants to change his password tell which auth2 daemon
supposedly has read/write access to the file.

If your current SCM dies, you either restart it, or if the failure is
more serious you could elect a new SCM by writing the hostname of the
new SCM in /vice/db/scm on all servers and restarting the auth2 and
update daemons.

> 2. As we want to use CODA for data exchange between remote hosts
>    and lokal one with the help of a CODA server, we must know
>    whether a specific file is definitely stored on CODA
>    before, we make a cfs disconect and cunlog at the remote client.
>    (For some remote clients we have to pay for volume and time.)
>    How is this possible?

There is a 'cfs forcereintegrate' command. This will synchronously
trigger a reintegration and only return when it either completes
successfully or fails. Once it returns it checks the length of the
client modification log to see if everything got reintegrated. Not sure
if the exit code correctly indicates whether reintegration succeeded

I'm not sure cunlog is really necessary on the remote client. cfs
disconnect simple makes RPC2 unable to transmit (or receive) any data.
So it is a pretty effective way to disable all Coda traffic. On the
other hand, it only adds the low-level filter, so the client doesn't
actually know it cannot reach the servers until RPC2 times out.
Similarily when reconnecting we only remove the filter, but leave it up
to the higher layers to figure out for themselves that the servers are
actually available again. (disconnect/reconnect are mostly a debugging
aid so that we don't have to pull our network cables all the time)

So you might want to add a 'cfs checkservers' after the disconnect and
reconnect to force the higher layers to notice the state change faster.

> 3. Is it possible to setup the CODA client to force venus to make the
>    update to the CODA server as fast as possible;
>    i.e., as soon the file is copied (via cp) to /coda/... the update
>    should start?

Depending on which 'state' the client is in.

If the client is operating in 'fully connected' mode it will immediately
send the update to all available servers as soon as the application
closes the file descriptor. If the client is in disconnected mode, there
is nothing we can do until the servers become available.

If the client is in write-disconnected mode then there are some
parameters that define how long an operation needs to remain in the log
before it is eligible for reintegration and how long a reintegration is
allowed to 'hog' the connection. The default age is set to about a
minute, many files only have a very short life-span and we actually save
a lot of network traffic by not immediately writing them to the server.

These parameters can be changed by the user, but I'm not sure if they
are persistent in the released stable version and a disconnection
followed by reconnection would drop them back to the default settings.

I've been working on a version of Coda that does not have a connected
mode operation, which does persistently store the CML age and hog time
values. Running it with age 0 and hogtime 1 works really nicely. On a
fast network everything is reintegrated pretty quickly after an update
(within 5 seconds), but on a slow link and a lot of local write activity
it will not be able to reintegrate everything within a second, so it
will start to build up a little CML backlog and this allows the client
to optimize away unnecessary operations that created temporary files and
such. There is also a special combination, age 0 and hogtime 0, which
forces a synchronous reintegration before we return to the user. So
effectively when for instance the close() syscall has returned the store
operation has already been reintegrated. It is looking very promising,
but dealing with conflicts is considerable more difficult with the new
code base.

(actually dealing with a reintegration conflict on top of a
server-server conflict has been impossible for Coda clients. With the
new code they can in principle be repaired, but there are still far to
many cases that are not handled correctly).

Received on 2006-06-19 14:02:35