Coda File System

Re: How To Repopulate A Server

From: Patrick J. Walsh <>
Date: Mon, 11 Sep 2006 14:11:34 -0600

	We'll give that a try in the next day or two and report back on how  
it went.  Thanks very, very much for your help.

Patrick Walsh
eSoft Incorporated
303.444.1600 x3350

On Sep 11, 2006, at 1:09 PM, Jan Harkes wrote:

> On Mon, Sep 11, 2006 at 08:36:02AM -0600, Patrick J. Walsh wrote:
>> 	Looking at the source code, I *think* we hit the limit for the
>> number of files we can have in a directory.  Luckily, and for some
> Looks like it, the error seems to be EFBIG (file too large) when it
> tries to add a new entry to the directory.
>> odd reason, our other coda server was still running without
>> problems.  So we turned off the problematic coda server and pruned
>> out the directories.
> That's nice, servers have an annoying habit of dying at the same  
> time in
> these cases. I guess your client was weakly-connected and tried to
> reintegrate with only this replica. Actually the server log you  
> attached
> seems to indicate that the server starts up fine, but then dies  
> during a
> resolution attempt. So the problem may actually be in the server  
> that is
> still running and is being propagated to the crashing server during
> log-based resolution.
> The safest thing right now would be to create a backup tarball of
> anything in that volume that you care about. Destroying/re- 
> resolving the
> replica on the crashing server will use a different resolution  
> mechanism
> (runt-resolution), which may work and solve the problem (successful
> resolution truncates the resolution logs so the bad create won't get
> sent anymore), but it may also cause the still running server to  
> realize
> something is wrong and die.
>> 	Now the question is, how can we get the problematic coda server
>> started back up?  Assuming there isn't some other problem, is there a
>> way to start up the coda server and have it wipe out its existing
>> knowledge of what files are on what volumes and then rebuild that
>> knowledge from the working server, similar to how we set it up in the
>> first place (with an ls -lR or something)?
> If your server really crashes during the salvage phase, we can
> temporarily disable salvaging and make sure there are no other volumes
> with problems.
>     cat > /vice/vol/skipsalvage << EOF
> 1
> 2000004
> Then start the server and see if it comes up. Because the volume will
> not be attached there are going to be errors in the logs about VLDB
> lookup failures when clients attach and try to revalidate the missing
> replica.
> If this worked we can shut the server back down and use 'norton' to  
> mark
> the volume so that it will get deleted during startup before it  
> tries to
> fsck everything. Then the server should be able to start with the
> missing replica. Finally we have to recreate the underlying volume
> replica that was marked for destruction and purged during startup.
> You'll need to gather some information which is probably easier to get
> now before we start blowing away replicas and such, besides it is good
> information to know so we can double check we're actually blowing away
> the right volume.
> It looks like the broken replica is 2000004, you need to find which
> replicated volume it belongs to.
>     grep -i 2000004 /vice/db/VRList
> The replicated volume number is the one in the second column  
> starting with 7f.
> Also note which replica this one has in the list
> e.g.
>     vm:u.jaharkes 7f000604 2 d1000129 c80000df 00000000 00000000  
> 00000000 ...
>     replicate volume id = 7f000604
>     replica index for d1000129 = 0
>     replica index for c80000df = 1
> Knowing the index is useful because the replicas are named based on  
> the
> replicated volume name + index. So in my example volume d1000129  
> has the
> name vm:u.jaharkes.0 and volume c80000df has the name vm:u.jaharkes.1.
> You also need to get the rvm log and data parameters from
> /etc/coda/server.conf.
>     grep ^rvm /etc/coda/server.conf
> It should also be possible to have bash parse that file. So now we'll
> shut down the server.
>     volutil shutdown
> ... check the log to see if the server is completely shut down.
>     . /etc/coda/server.conf
>     norton -rvm $rvm_log $rvm_data $rvm_data_length
> Then with norton we can double check the values we have,
>     norton> show volume 0x2000004
> This should show the name and replicated volume id (groupid?). If
> everything seems to match up correctly we can mark the volume for
> deletion,
>     norton> delete volume 0x2000004
>     norton> quit
> Now we can remove the skipsalvage file, the volume will be completely
> purged so there is no reason to skip it during salvage,
>     rm /vice/vol/skipsalvage
> Then we restart the server, it will take a while because it is  
> going to
> delete everything related to that volume.
>     startserver &
> Starting it in the background so we can keep an eye on the server log.
> Once the server is back we can recreate the volume replica.
>     volutil create_rep /vicepa <volume replica name> <replicated  
> volume id> \
> 	0x2000004
> (with my example the <volume replica name> is something like
> vm:u.jaharkes.1 and <replicated volume id> is 0x7f000604)
> At this point running 'cfs checkservers' and 'ls -lR /coda/path/to/ 
> volume'
> should trigger runt resolution and rebuild the contents of the the  
> newly
> created replica.
> Jan

Received on 2006-09-11 16:13:37