Coda File System

coda server crashed and won't recover

From: Stephan Koledin <SKoledin_at_fool.com>
Date: Tue, 15 Aug 2000 22:37:35 -0400
I'm currently running a krb5-built coda 5.3.8 on RedHat-6.2. No other src
mods except for changes suggested by the README.kerberos.

While I had a read-only volume restored and mounted from a previous dump and
directly after dumping some backup volumes to disk, my codasrv crashed and
won't come back up. It seems to be looking for the restored volume, but
can't find it, I assume because it was only temporarily restored from a
file. The previously restored volume was given id 1000004, as you can see in
the log below, where the server tries to startup and recover. 

Is there anyway to recover from this, or will I just need to rebuild the
server again? 
Appreciate any help as this has happened twice now after a server crash
where a restored volume was still up and mounted. Server seems to be stable
most of the time. Has just been going flaky with restored volumes it seems.
Thanks. -Stephan


16:22:25 New SrvLog started at Tue Aug 15 16:22:25 2000

16:22:25 Resource limit on data size are set to 2147483647

16:22:25 Server etext 0x80c44ba, edata 0x80fa5a0
16:22:25 RvmType is Rvm
16:22:25 Main process doing a LWP_Init()
16:22:25 Main thread just did a RVM_SET_THREAD_DATA

16:22:25 Setting Rvm Truncate threshhold to 5.

Partition /vicepa: inodes in use: 23230, total: 16777216.
16:22:42 Partition /vicepa: 5148943K available (minfree=5%), 4960987K free.
16:22:42 The server (pid 4941) can be controlled using volutil commands
16:22:42 "volutil -help" will give you a list of these commands
16:22:42 If desperate,
                "kill -SIGWINCH 4941" will increase debugging level
16:22:42        "kill -SIGUSR2 4941" will set debugging level to zero
16:22:42        "kill -9 4941" will kill a runaway server
16:22:42 Vice file system salvager, version 3.0.
16:22:42 SanityCheckFreeLists: Checking RVM Vnode Free lists.
16:22:42 DestroyBadVolumes: Checking for destroyed volumes.
16:22:42 Salvaging file system partition /vicepa
16:22:42 Force salvage of all volumes on this partition
16:22:42 Scanning inodes in directory /vicepa...
16:22:46 SFS: There are some volumes without any inodes in them
16:22:46 Entering DCC(0x1000001)
16:22:46 DCC: Salvaging Logs for volume 0x1000001

16:22:46 done:  10 files/dirs,  13 blocks
16:22:46 SFS:No Inode summary for volume 0x1000002; skipping full salvage
16:22:46 SalvageFileSys: Therefore only resetting inUse flag
16:22:46 Entering DCC(0x1000004)
Magic wrong in Page i           
16:22:46 DCC: Bad Dir(0x1000004.6d.68e9) in rvm...Aborting
16:22:46 JE: directory vnode 0x1000004.6d.68e9: invalid entry ; 
16:22:46 JE: child vnode not allocated or uniqfiers dont match; cannot
happen



**** Here's some of the final SrvLog data from before the crash. Seems like
my SrvLog has been much much busier than normal, but perhaps I just turned
on more detailed logging somehow? You can spot the crash at the end of this
segment pretty easily, but there don't seem to really be any clues as to
what caused it since all the previous actions finished properly.

16:08:53 --DC: (0x100000c.0x1.0x1) ct: 1

16:08:53 VN_PutDirHandle: Vn 1 Uniq 1: cnt 0, vn_cnt 0

16:08:53 VN_GetDirHandle for Vnode 0x1 Uniq 0x1 cnt 1, vn_cnt 1

16:08:53 VN_GetDirHandle for Vnode 0x1 Uniq 0x1 cnt 2, vn_cnt 2

16:08:53 VN_PutDirHandle: Vn 1 Uniq 1: cnt 1, vn_cnt 1

16:08:53 VN_GetDirHandle for Vnode 0x1 Uniq 0x1 cnt 2, vn_cnt 2

16:08:53 VN_PutDirHandle: Vn 1 Uniq 1: cnt 1, vn_cnt 1

16:08:53 --DC: (0x100000c.0x1.0x1) ct: 1

16:08:53 VN_PutDirHandle: Vn 1 Uniq 1: cnt 0, vn_cnt 0

16:10:58 VAttachVolumeById: vol 1000007 (h.skoledin.backup) attached and
online
16:10:58 S_VolMakeBackups: backup (1000007) made of volume 1000006 
16:10:58 NewDump: file /vice/backup/7f000004.1000006.newlist volnum 7f000004
id 1000007 parent 10000
06
16:11:06 S_VolNewDump:  volume dump succeeded
16:11:06 VAttachVolumeById: vol 100000d (pub.install.0.backup) attached and
online
16:11:06 S_VolMakeBackups: backup (100000d) made of volume 1000009 
16:11:06 NewDump: file /vice/backup/7f000006.1000009.newlist volnum 7f000006
id 100000d parent 10000
09
16:11:08 S_VolNewDump:  volume dump succeeded
16:11:08 VAttachVolumeById: vol 100000e (pub.jabber.0.backup) attached and
online
16:11:08 S_VolMakeBackups: backup (100000e) made of volume 100000a 
16:11:08 NewDump: file /vice/backup/7f000007.100000a.newlist volnum 7f000007
id 100000e parent 10000
0a
16:11:17 S_VolNewDump:  volume dump succeeded
16:11:17 VAttachVolumeById: vol 100000f (pub.krb5.0.backup) attached and
online
16:11:17 S_VolMakeBackups: backup (100000f) made of volume 100000b 
16:11:17 NewDump: file /vice/backup/7f000008.100000b.newlist volnum 7f000008
id 100000f parent 10000
0b
16:11:18 S_VolNewDump:  volume dump succeeded
16:11:20 VAttachVolumeById: vol 1000010 (pub.coda.0.backup) attached and
online
16:11:20 S_VolMakeBackups: backup (1000010) made of volume 100000c 
16:11:20 NewDump: file /vice/backup/7f000009.100000c.newlist volnum 7f000009
id 1000010 parent 10000
0c
16:12:13 S_VolNewDump:  volume dump succeeded
16:12:22 ****** FILE SERVER INTERRUPTED BY SIGNAL 11 ******
16:12:22 ****** Aborting outstanding transactions, stand by...
16:12:22 Uncommitted transactions: 0
16:12:22 Uncommitted transactions: 0
16:12:22 Becoming a zombie now ........
16:12:22 You may use gdb to attach to 1853

Date: Tue 08/15/2000

16:18:20 Starting new SrvLog file

**** couldn't seem to attach a debugger to it either, looked like it just
crashed and didn't even zombie like it said it would...

Thanks again for any help with this.



Stephan B. Koledin
The Motley Fool
Systems DORC
skoledin_at_fool.com
http://www.fool.com
Received on 2000-08-15 22:39:54