Coda File System

replica restoration by runt resolution

From: <u+codalist-wk5r_at_chalmers.se>
Date: Sat, 31 Jan 2009 17:53:17 +0100
Hi All,

Subj is a very comfortable way to do disaster recovery,
highly recommended when you know what you are doing.
(Set up a fresh server, create empty volume replicas and run an ls -alR...)

Unfortunately I was hit by a subtle problem and hence post a word of caution.

If you happen to resolve big files while your servers have limited
bandwidth to each other, the resolution will fail.

High latency severely limits sftp transfer speeds and a file copy
may take over 5 minutes. I guess the following happens:
The lock watcher thread comes in, finds a lock over 5 minutes old,
believes it is stale, removes it.
The file is not resolved but instead becomes a server-server conflict,
the new replica remaining 0 bytes long.
You have to do a manual repair (replaceinc), which can become a disaster
if you have many such files.

(Wonder if this lock timeout is / can be configurable?)

My workaround was to set up a temporary server near the remaining one,
populate it by Subj, then ship the server contents (the local files)
over the slow link - or physically - to the actual server host.
Unfortunately this implies "double change of the server IP", which means
the client(s) involved have to be reinitialised and the servers
restarted at least twice.

Regards,
Rune
Received on 2009-01-31 11:54:28