Coda File System

Unresponsive repair operation lets CML grow

From: Simon de Hartog <simon_at_speakup.nl>
Date: Fri, 01 Apr 2011 21:15:17 +0200
Hi all,

since a few weeks, we have repeatedly had problems with one Coda client 
that doesn't seem to push his updates to the server. We have monitoring 
on every client and get a call when the CML entries go over 25. I've 
found a (from what I think) local/global conflict. I'll just post some 
info, not sure what you need to be able to point me in the right direction.

We have two servers and currently about 8 clients. The problem client is 
called cmp06. The volume with the conflict is named cmpprod. This 
already happened before. The actions we resorted to the last two times 
were stop all apps using files in /coda, stop venus, de-install venus 
and "rm -rf /var/log/coda /var/lib/coda /var/cache/coda" and then 
reinstall venus again from scratch. This worked for a while, 
modifications were correctly pushed to the servers and showed up on 
other clients.

Output of commands run on cmp06:

root_at_cmp06:/# ctokens
Tokens held by the Cache Manager for root:
     @nkh.spup.net
         Coda user id:    10001
         Expiration time: Sat Apr  2 21:37:02 2011
root_at_cmp06:/# cfs cs
Contacting servers .....
All servers up
root_at_cmp06:/# cfs lv /coda/nkh.spup.net/cmpprod
   Status of volume 7f000004 (2130706436) named "cmpprod"
   Volume type is ReadWrite
   Connection State is Reachable
   Reintegration age: 0 sec, time 15.000 sec
   Minimum quota is 0, maximum quota is unlimited
   Current blocks used are 2965098
   The partition has 7823104 blocks available out of 11756312
   *** There are pending conflicts in this volume ***
   There are 30 CML entries pending for reintegration (3617288 bytes)

The command cfs listlocal /coda/nkh.spup.net/cmpprod never returns and 
gives no output at all (waited for a little over 30 minutes)

The directory containing the conflict shows:
root_at_cmp06:/coda/nkh.spup.net/cmpprod/voicemail/company8184892/217684# 
ls -alFh 20110401-130733-31546631891-1301656053.482641-386.wav
lrw-r--r-- 1 root nogroup 29 Apr  1 20:41 
20110401-130733-31546631891-1301656053.482641-386.wav -> 
@7f000004.000035ce.00002610_at_n

The client has coda-client 6.9.5 installed from your Debian package, the 
servers have coda-server and coda-update Debian packages with version 6.9.4.

The /var/log/coda/venus.log is filled with entries like these:

[ W(177) : 0000 : 21:08:54 ] WAIT OVER, elapsed = 5005.9
[ W(177) : 0000 : 21:08:54 ] WAITING(VOL): cmpprod, state = Reachable, 
[0, 0], counts = [0 0 5 0]
[ W(177) : 0000 : 21:08:54 ] CML= [30, 103], Res = 0
[ W(177) : 0000 : 21:08:54 ] WAITING(VOL): shrd_count = 0, excl_count = 
0, excl_pgid = 0

And the /var/log/coda/venus.err contains:
21:00:02 volume cmpprod has unrepaired local subtree(s), skip 
checkpointing CML!
21:02:27 DispatchWorker: signal received (seq = 654736)
21:10:02 volume cmpprod has unrepaired local subtree(s), skip 
checkpointing CML!

So I executed repair with the following transcript:
root_at_cmp06:/coda/nkh.spup.net/cmpprod/voicemail/company8184892/217684# 
repair
This repair tool ... <cropped> ... the current repair session.
repair > beginrepair
Pathname of object in conflict? []: 
/coda/nkh.spup.net/cmpprod/voicemail/company8184892/217684/20110401-130733-31546631891-1301656053.482641-386.wav

And is does not give any results, already waited for over 10 minutes 
now. The directory listing doesn't show any expanded replicas, only the 
broken symlink. The other clients all show the above mentioned file with 
a size of 0 bytes.

I'm not sure whether this is too much, too little or "sufficient" debug 
info. If anyone needs more info, please let me know so I can provide it.

Thank you very much in advance for your effort.

Kind regards,
Simon de Hartog
Special Technical Services
SpeakUp B.V.
http://www.speakup.nl/
Received on 2011-04-01 15:48:33