Coda File System

Next Previous Contents

12. The Backup System

12.1 Introduction: Design of the Coda Backup Subsystem

As the use of the Coda file system increased, the need for a reliable backup storage system with a large capacity and a mininal loss of service became apparent. A one operation backup system was determined to be infeasible given the volume of data in Coda, the nature of a distributed filesystem, and the long downtime that would normaly be required to backup the system in one operation.

In order to meet the goals of high availability and reliability inherient in Coda design and to make efficient use of backup hardware and materials, the volume was chosen as the unit of data, and 24 hours was choosen as the time unit for system management and administration. The result of these design considerations is a volume by volume backup mechanism that occurs in three phases:

cloning

The cloning phase consists of freezing the (replicated) volume, creating a readonly clone of each of the replicas, and then unfreezing the volume. This allows mutating operations on the replicated volume to occur while maintaining a snapshot to backup. Once the cloning phase has been completed, normal read-write services can be resumed without fear of data corruption due to mutating operations on an active file system.

dumping to diskfiles on a backup spool machine

The dumping phase consists of converting the read-only volume clones to disk images stored as regular disk files, on a spool machine. A dump can either be full, in which all files are dumped; or incremental, in which only those files or directories which have changed since the last successful backup are included in the dump. This allows for a system in which only a subset of volumes need a full backup at anyone time (with incrementals done between full backups), thus reducing the amount of offline storage and network bandwidth needed at any one time. However, it allows the re-creation of data a granularity of 24 hours when combined with incremental dumps. Incremental dumps, however, are only supported for replicated volumes. Since there is little need for non-replicated volumes, only full dumps are supported for non-replicated volumes.

saving to media

The last phase consists of writing all the dump files from the backed up volumes dumped on local partitions to an archival media such as tape. Any standard backup system can be used for this phase. At CMU, we use the BSD dump and restore utilities to write and retrieve the disk images of Coda volumes to tape.

Practically, this system has been implemented as series of tasks The first two tasks are carried out by the backup program the latter by a Coda independent perl script (tape.pl).

Remeber, in practice, many restores are a result of a user accidently deleting or corrupting thier own files. In this case, users may use the the cfs mechanism to retrive files for the last 24 hour time line. For example:

cfs mkmount OldFiles u.hmpierce.0.backup

will mount the hmpierce's user backup volume from replica 0 to the OldFiles mount point. The file can then be copied out (backup volumes are read-only). Only if restoration needs are older than 24-hours or some chatastrophic event outside of the users control occurs do restores from tape normally need be done

Several tools have been developed to help in the creation, analysis, and restoration of data backups. Some of these tools have been developed by the Coda team (those tools concerning Coda FS to local disk conversion) such as backup and tape.pl (used to coordinate the efforts of backup , dump , etc), others employ off-the-shelf software such as the traditional UNIX dump or tar to tranfer the disk images created from the dump phase to the backup media. Coda, however, provides a perl script frontend to dump .

12.2 Installing a Coda Backup Coordinator Machine

The Coda Backup coordinator should be a trusted machine. It should be able get all the files that exist in the /vice subtree on the servers, although it is not necessary to run a fileserver on the backup coordinator (nor is it recommended that the backup machine be a fileserver).

Assuming the Backup Coodinator has been setup with an appropriate operating system, the steps are as follows:

NEEDS SECTION RECOMMENDING DATA/LOCAL DISK SPACE RATIO

On the Coda File Server designated as the SCM, create a file called /vice/db/dumplist . The dumplist contains three fields: volume id specified as its hex value, the full/incremental backup schedule, and a comment which is generally the human readable volume name. For example:

7f000001        IFIIIII         s:coda
7f000002        IIIIIFI         u:satya

The first column specifies the volume id to be backed up, the second column specifies the backup schedule by the day of the week begining Sunday, and the third column is a comment, ussually the volume name in human readable form. So, the volume id 7f000001 is scheduled for a Full backup every Monday, and incrementals Tuesday through Sunday and from the comment, we know this is a system volume called "coda". Likewise, the second volume 7f000002 is scheduled for a Full backup on Friday with incrementals being done Saturday through Thursday and is a user volume called "satya".

On the SCM, modify /vice/db/vicetab to indicate which host is acting as the backup coordinator and which partitions on the backup server are to be used by the backup coordinator to store the dump files. On a tripily replicated system, vicetab might look like this:


tye             /vicepa         ftree  width=8,depth=5
taverner        /vicepa         ftree  width=8,depth=5
tallis          /vicepa         ftree  width=8,depth=5
dvorak          /backup1        backup
dvorak          /backup2        backup
dvorak          /backup3        backup

vicetab , in addition to listing information on the servers providing replicated data, must also include information on the backup coordinator with backup coordinator's name in the first column, the backup partitions in the second column, and the designation "backup" in the third column. The 4th column is not used for the backup sub-system. Please see the vicetab(5) man page for additional information.

Note: that the number of partitions available for dumping may be controlled by the system administrator. Because the volume of data may be both large and variable, the backup program intelligently decides where to store individual dump files based on size accross the specified backup partitions. The directories in the sample vicetab , are assumed to be seperate local disk partitions. An organized central symbolic link tree is created by the backup.sh script in the directory /backup that points to the actual files scattered accross the /backup1, /backup2, and /backup3 given in this example.

  1. On the Backup Coordinator, the backup binaries and shell scripts need to be installed. The following directories should be have been created under /vice upon the installation of the Coda backup package:

/backup
/vice/backup
/vice/backuplogs
/vice/db
/vice/vol    
/vice/lib
/vice/bin
/vice/spool
/vice/srv

In addition, the file /vice/UpdateMonitor should be created once the update monitor is run for the first time. The primary binaries that should be installed under /vice/bin to get started are:


backup
backup.sh
bldvldb.sh
merge
updateclnt
updatesrv
updfetch
tape.pl

Once it has been verified that the backup system is installed, the files


/vice/db/hosts 
/vice/db/files

on the SCM should be manually copied to the same location on the Backup Coordinator. These are needed the first time by the updateclt daemon the when it runs. Also, Coda currently relies on the BSD dump and restore command to manipulate the tapes. A copy of dump package should be installed on the backup coordinator. BSD dump is available for all UNIX and UNIX like operating systems we have sucesfully run Coda on. Please check with your OS Vendor if you need help obtaining a copy.

=============================== Upon completion backup will print which volumes were successfully backed up, the volumes on which backup failed, and those volumes which were not specified for backup.

The merge program allows a system administrator to update the state seen in a full dump by the partial state in an incremental dump. This is useful when a user wishes to restore to a state that was captured by a full and some number of incremental dumps (For instance, in the middle of the week) The merge program applies an incremental to a full dump, producing a new full dump file.

An incremental is a partial snapshot with respect to the previous dump. The Coda backup facility maintains an order on dumps for a volume. The merge program will only allow an incremental to be applied to its predecessor in the order. This predecessor may be a full dump or the output of the merge program.

Once the administrator has created or retrieved the full dump which contains the desired state of a volume, she can create a read-only copy of that state by using the volutil restore facility. This volutil command creates a new read-only volume on a server. The new volume can be mounted as any other Coda volume. Regular Unix file operations can then be used to extract the desired old data. The obvious exception is that mutating options will fail on files in a readonly volume.

12.3 Incremental Dumps

In every dump (full or incremental) produces a file containing the version vectors and StoreIds of every vnode in the volume. These files have names of the form /vice/backup/ < groupid > . < volid > .newlist for replicated volumes and /vice/backup/ < volid > .newlist for non-replicated volumes. When the backup coordinator is convinced that the backup of a volume has completed, the *.newlist file is renamed to be *.ancient via the volutil ancient call. These files are stored in a human-readable format for convenience.

When creating an incremental dump, the server looks for the .ancient file corresponding to the volume. If it doesnt exist, a full dump is created. If it does exist, it is used to determine which files have changed since the last successful backup. The server iterates through the vnode lists for the volume and the version vector lists from the *.ancient file comparing the version vectors and storeIds. A discrepancy between the two implies that the file has changed and should be included in the incremental dump. Since version vectors are not maintained for non-replicated volumes, incremental dumps are not supported by the coda backup facility.

By comparing the sequence numbers in the vnodes lists, it can also be determined if a file or directory had been deleted (since the vnode is no longer in use). Vnodes that are freed and then reallocated between dumps look like vnodes which have been modified, and so are safely included in the incremental dump.

It is also important to maintain an ordering on the incremental dumps. To correctly restore to a particular day each incremental dump must be applied to the appropriate full dump. To ensure that this happens, each dump is labled with a uniquifier, and each incremental is labeled with the uniquifier of the dump with respect to which it is taken. During merge, the full dumps uniquifier is compared with the uniquifier of the dump used to create the incremental. If they do not match, the incremental should not be applied to the full dump.

12.4 Tape files

Once the dump files have been created, they must be written to tape. This is due to the fact that disk space is usually a limited commodity. The basic mechanism for the writing is the unix tar (1) facility.

Each tape contains a series of tar files, the first and last of which are labels. The start and end labels are indentical, and contain version information, the date the backup was taken, and an index which maps individual dump files into offsets into the tape. Thus the Coda backup tapes are self identifying for easy sanity checks. The label is a tar file which only contains a simple unix file called TAPELABEL .

The dump files are first sequenced by size. They are then broken down into groups, where the total size of the group must be larger than a certain size, currently .5 Megabytes. Each group is stored in a single tar file on the tape. These data tar files are the 2nd through n-1st records on the tape, the first and nth being a tar files containing just the tape label.

This structure was chosen for several reasons. The first is that it is easy to implement. Tar has been used for many years, and has been proven to be reliable. The second is easy access of information on the tape. Using a single monolithic tar file would often require hours of waiting to retrieve a single dump file. This way you can skip over most of the data using mt (1) and its fast-forward facility. Finally, it provides a simple and effective end-to-end check to validate that all the information has made it to tape.

At CMU, we have created a convention for capturing sufficient information for reliability, while trying to avoid excess use of tapes. Full backups are taken once a week. However, since our staging disks are not large enough to hold full dumps for all the replicas of all the volumes, we stagger the full backups across the week.

There are three kinds of requests for restorations: users who have mistakenly trashed a file, users who lost data but didnt know it, or bugs which require us to roll back to a substantially earlier state. The first class of restores can be typically handled by yesterdays state, which we keep on-line in the form of read-only backup clones. Thus almost all forms of requests never reach the system administrator at all. To give users easy access to the previous days backup, create a directory, OldFiles, in their coda directory, and mount each of the backups in the OldFiles directory.

If the user didnt catch the loss of data immediately, its reasonable to expect that they will catch it before a week has passed. We keep all incrementals and fulls to guarantee we can restore state from any day in the last week. This requires 14 tapes, or two weekly sets. One weeks worth is not sufficient, because state from later incrementals relies on earlier incrementals in order to be restored. Thus as soon as the first incremental tape is over-written (say Mondays), the state from the remainder of the last week is lost (last Tuesdays, Wednesdays, etc).

The third class of data loss is either due to infrequently used files or to catastrophy. (Weve actually been forced to rely on the backup system to restore all Coda state due to major bugs in the servers ). Since its unreasonable to keep all the tapes around, we only save tapes containing full dumps. Weekly tapes are saved for a month, and monthly tapes are saved for eternity.

12.5 Restoring a backup clone

A basic assumption of performing backups is that eventually someone will need to restore old state of a volume. To do this they should contact the system administrator, specifying the volume (groupid and repid for replicated volumes or just the volid for non-replicated volumes) and the date of the state they wish to restore.

The system administrator must then determine which dump files contain the state. There could be more than one involved since the state may have been captured by a full and some incremental dumps. Once the administrator knows the dates of the backups involved, she must get the appropriate tapes and extract the dump files (via the extract.sh script) .

The administrator then creates the full state to be restored by iteratively applying the incrementals to the full state via the merge program. Once the state for the date in question has been restored, a read-only clone is created by choosing a server to hold the clone, and invoking the volutil restore operation, directing the call to the chosen server. Once the clone has been restored, the administrator should build a new VLDB, and mount the volume in the Coda name space so the user can access it. When the user has finished with it, she should notify the administrator in order for the clone to be purged.

12.6 Backup Scripts

Although the backup program handles all the tricky details involved in Coda backup, there still remains some issues to be handled, most notably the saving of the dump files to tape. This is done by a series of scripts, backup.sh , writetotape.sh , and checktape.sh . The job of extracting dump files from tape is handled by extract.sh .

backup.sh takes the name of the directory in which to run backups. It creates a subdirectory whos name indicates the date that backup was run. It then runs the backup program, using the dumplist file in the directory specified in the arguments, saving the output of backup in a logfile in the newly created subdirectory. It copies in the current Coda databases (so they will be saved to tape along with the dump files.) It then invokes writetotape.sh and checktape.sh to write and verify that the files have been safely recoreded.

writetotape.sh performs the work of saving the files on tape. It takes the directory in which the backup was taken (the subdirectory generated by backup.sh) , and the device name of the tape drive. It first checks to see that the tape to be used is either empty or has the correct label. For Coda at CMU this means checking that the tape was last used on the same day of the week. It then gathers the dump files and databases into groups and generates the tape label for this backup. Finally it writes the tape label and all the groups to the tape via the tar (1) facility, marking the end of the tape with another copy of the tape label.

checktape.sh verifies that writetotape.sh did its job correctly. Like writetotape.sh , it takes the backup directory and the name of the tape drive as input. It first reads off the tape label, comparing it with one stored in the backup directory. It then scans all the data tar files, comparing their actual contents with what it expected, and finally reads the tape label at the end, comparing it with the saved value.

extract.sh is used to extract a dump file from a Coda backup tape. It takes the name of the tape device, the date the backup was taken, and the identifier of the volume to be restored. The date should be specified in the form DDMMMYYYY, as in 10Feb1992. Volume identifiers are "groupid.repid" for replicated volumes and "volid" for non-replicated volumes. extract.sh will locate the correct group by reading the tape label, fast-forward the tape to the correct tape file, and extract the dump file for the volume.


Next Previous Contents