Coda File System

Re: Recovering from hard disk failure on replicated server

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Mon, 26 Mar 2001 16:54:18 -0500
On Mon, Mar 26, 2001 at 01:35:35PM -0500, Brad Clements wrote:
> I've looked through the archives and docs, can't find a ready answer to this.
> 
> I have 3 servers replicating a variety of volumes.
> 
> One of the servers has lost a hard drive that contained the RVM (both
> partitions) and /vicepa

Cool.

> Fortunately all of the volumes on the failed drive are on the SCM.
> 
> What's the proper procedure to get this server up and running again?

We need to know exactly which volume replicas were stored on that
server. There should be a /vice/vol/BigVolumeList that contains all the
important information (and then some).

We need the following information for every volume that used to be on
the server.

partition	replica name	replicated groupid	volume id
/vicepa		xx:user.one.N

It shouldn't really matter whether we use the same partition or not,
we're rebuilding the lost replica anyways.

replica name is identical to the replicated volume name + ".<nr>"
eg. replicated volume = "vm:u.jaharkes.mail", replica names on the
different servers are "vm:u.jaharkes.mail.0" and "vm:u.jaharkes.mail.1".

Replicated groupid is simply the replicated volume id (eg. 0x7F0000A8).

Volume id is the underlying replica id. This is recorded in volume
replication database (VRDB/VRList), so it better be the same one ;)

Ok, get /vice/vol/BigVolumeList, it looks like....

P/vicepa Hmahler.coda.cs.cmu.edu T23dbe3 F584b8
P/vicepb Hmahler.coda.cs.cmu.edu T23dbe3 Fd5c18
Wvm:www.root.1 Ic8000001 Hc8 P/vicepa m0 M0 U7cbf Wc8000001 C388376d4 D388376d4 B3abed327 Af66
Wvm:www.public.1 Ic8000002 Hc8 P/vicepa m0 M0 U87a1 Wc8000002 C36b6097c D36b6097 c B3abed2bc Af218c
Bvm:www.public.1.backup Ic8000003 Hc8 P/vicepa m0 M0 U8ac6 Wc8000002 C36b687c5 D 36b687c5 B0 A0

It is pretty simple to extract the required info from here. The lines
starting with P identify partitions on servers, search down to the
server that is gone.

Cut and paste everything up until the next "P". Now we should have the
info for all volume replicas (and backup volumes).

strip out volume replica information only, "grep ^W volumes"

Now we have lines like:
    W<volume replica name> I<replica id> H?? P<partition name>

Either by hand or using awk strip pull the info out. Now we're only
missing the replicated volume id's. Those are in /vice/db/VRList.

"example VRList lines"
e:braam.rep2 7F00042E 2 e60000ed e500001e 0 0 0 0 0 0 E0000149
e:braam.tallis 7F0003F6 1 e500001a 0 0 0 0 0 0 0 E0000153

<replicated volume name> <replicated volume id> <nr of volumes> <volumeid's>

So we can get the replicated volume id's by doing;

    "grep -i <volumeid> /vice/db/VRList | cut -d' ' -f2"

Hopefully we'll end up with all the necessary information in a nice list
similar to the following;

/vicepa userone.1 7F000125 e60000de
/vicepa usertwo.1 7F000126 e60000df
etc.

Once we have all this info we're in pretty good shape. Put a working
drive in the failed server, and reinstall an empty codaserver
    (rm -rf /vicepa/* ; vice-setup-rvm) or even
    (rm -rf /vice ; rm -rf /vicepa/* ; vice-setup)

Then bring the servers up, and each of the lines of information we
created earlier exactly match the arguments of

    volutil create_rep partition-path volumeName grpid [rw-volid]

which is what we need to do for every lost volume. Once the volumes are
created there is just one thing left to do, get the data back onto the
newly reinitialized server.

    cfs strong
    cd /coda/some/restored/volume ; volmunge -a `pwd`

Instead of using volmunge, "ls -lR" will also do the trick, but it will
cross volume boundaries. It can take a few hours, but it should work.

Jan
Received on 2001-03-26 16:55:42