Coda File System

Re: Replicated volume upgrades

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Sun, 29 Feb 2004 23:33:39 -0500
On Sat, Feb 28, 2004 at 08:50:04PM -0500, Jan Harkes wrote:
> I'll detail the experimental way in another message when I have a bit of
> time. But it is worth a wait, because it might just be what you're
> looking for.

Allright, got some time now.

The following trick is something that is only possible with 6.0 or
more recent clients and servers. The problem with earlier versions of
the server is that they internally still rely on the VSG numbers to get
the servers that are hosting replicas of a volume. This was at some
point replaced by a lookup in the VRDB (volume replication database).

There are some additional issues. Resolution logs are turned off for
singly replicated volumes, and turning it back on is often problematic.
Also the clients have to refetch volume information and recognize that
there are suddenly more replicas available. Because there are several
issues, and it is relatively recent change there are still a lot of
places that are not well tested and you could actually lose the existing
replica. So having an off-coda copy of your data is definitely very
strongly recommended.

In any case, if this works it should save a lot of time, if it doesn't
the 'recovery procedure' (destroying the old volume, creating a new
replica and copying all the data from the backup) is what you would have
to do already.

Lets look at how a replicated volume works. There is a two level
lookup, first we get the 'replicated volume' information, which returns
a list of 'volume replicas'. I guess we could call the higher level
replicated volumes 'logical volume' because in reality we only store a
logical mmapping in a server. The volume replicas are 'physical
volumes', they actually have some tangible presence on a server. The
createvol_rep script first creates all the physical volume replicas on
all servers, and finally adds the logical replicated volume information
to the VRList file (which forms the basis for the VRDB). Now pretty much
everything from here on would be done on the SCM.

So we need some of the information which can be found in the
/vice/db/VRList file.

    * volume name
    * replicated volume id
    * volume replica id

For the singly replicated volume a line in the VRList would look like,

    <volumename> <replicated id> 1 <volumeid> 0 0 0 0 0 0 0 E0000xxx

The first difficulty is that when we created the singly-replicated
volume, we turned off resolution logging for the underlying volume
replica as it really isn't useful. Also because there is no resolution
with other servers, there is no way to truncate the log if it grows too
much. So we need to turn resolution back on, this is something that can
very easily go wrong, and might actually cause later problems because
all directories normally have at least a single NULL resolution log
entry, but without resolution logging those do not exist and there might
still be some place left over that expect at least one entry. I just
tried it on a test server which is running the current CVS code and it
worked fine.

    volutil <replicaid> setlogparms reson 4 logsize 8192
      * 'reson 4'      - turn resolution on
      * 'logsize 8192' - allow for 8192 resolution log entries

Then we can create the additional volume replica on the new server.

    volutil -h <newserver> create_rep /vicepa <volumename>.1 <replicated id>
      * creates a replicated volume on the /vicepa partition.
      * volumenames for volume replicas are typically 'name'.number,
	there is no real reason for this except to keep them separate in
	the namespace. Interestingly we pretty much only perform lookups
	on the volume replica id, and not the name except when we try to
	mount a replicated volume by name.
      * resolution logging is already enabled for new volume replicas.

When the command completes it should have picked a new 'volume replica
id' for the newly created volume and dumped it to stdout. We can then
take this number and add it into the VRList record for the replicated
volume. We leave most of the line the same,

    <volumename> <replicated id> 2 <volumeid> <new volid> 0 0 0 0 0 0 E0000xxx
				 ^ bump the count from 1 to 2
					      ^ replace the first 0
					      with the new volumeid

Basically we're done, except for the fact that no client or server
actually knows that the new volume replica exists or that it is part of
the replicated volume. For that we have to tell the server to update the
VLDB and VRDB databases.

    bldvldb.sh <newserver>
      * build volume location database. This gets the current list of
        volumes replicas from <newserver> and builds a new volume
	location database.

    volutil makevrdb /vice/db/VRList
      * builds a new VRDB file.

Once these databases are propagated by the updatesrv/updateclnt daemons
all servers are now aware of the replicated volumes. Clients will not
realize it until they refresh the locally cached volume information.
For the root volume this normally happens only when the client is
restarted. For all other volumes, you can alternatively use 'cfs
checkvolume' which invalidates the mountpoints so they will be looked up
again. There really should be some sort of callback to automate this.
Another gotcha here, before release 6.0.3, clients would segfault when
the replication changed.

Finally when all of this is done, a simple recursive ls (ls -lR) will
trigger resolution and the new replica will get populated with all the
files from the original.

Jan
Received on 2004-02-29 23:38:07