Coda File System

replicated servers freezing under load

From: Jim Page - emailsystems.com <jim_at_emailsystems.com>
Date: Tue, 1 Jun 2004 11:51:39 +0200
Morning all

I believe I have a similar issue to that described by Steve in the post
http://www.coda.cs.cmu.edu/maillists/codalist/codalist-2003/5878.html -
which I don't think was ever resolved, so I'd like to take up the torch.

I am trying to use coda as the basis of a replicating highly available file
store shared between 2 mail servers in a cluster. To summarise - if I send
less than 10 mails per sec to the cluster it seems to be able to handle this
load pretty much indefinitely. More than 10 - after a couple of thousand
mails or so I get warnings in SrvLog and the MTA freezes while accessing
files via the /coda/<realm> mountpoint. restarting codasrv seems to fix it
but it just goes wrong again the same way when I restart the test. This
cluster should be able to handle 40-50 messages per sec at least.

o I am running coda (codasrv,venus et al) 6.0.3 (built by me on gcc 3.2.2),
slackware linux 9.0, kernel 2.6.3 (with whatever standard coda sources come
with the kernel).
o The servers are Dell 1750 dual Xeons, 4GB RAM.
o I have set up a realm which includes the 2 servers, and I authenticate
using a cron script that calls clog. Both machines are running both codasrv
and venus.
o Both clients are connected read-write and both servers are apparently up.
o The MTA accesses the shared directories via /coda/<realm>/
o I set up coda and venus up using default paths, and the largest default
options available in the setup scripts.
o It is extremely unlikely (I don't think possible actually) that both
clients will attempt to access the same file simultaneously, though it's
entirely possible that one system may attempt to delete a non-empty dir that
contains files open on the other system (clearly I anticipate failure here!)

The failure mode is always the same. Here is a typical entry the SrvLog:

07:34:34 ****** WARNING entry at 0x8122320 already has deqing set!

here is where codasrv is at:

(gdb) where
#0  0x4024b44e in select () from /lib/libc.so.6
#1  0x400812fc in __JCR_LIST__ () from /usr/lib/liblwp.so.2
#2  0x4007d130 in IOMGR (dummy=0x0) at iomgr.c:354
#3  0x4007ef16 in Create_Process_Part2 () at lwp.c:796

The MTA is stuck in an open() call.

I am pretty new to coda so I'm not too sure where to go with this beyond
trawling the coda ML and google and trying anything that seems remotely
related - which I have done. I have copies of all logs, and I will post
anything that anyone thinks would be useful.

I have tried (clutching at straws) setting serverprobe=120, and 60, no
difference. I have iptables loaded on both servers but not doing any
masquerading (not doing anything actually beyond the defaults) - so I'm
inclined to think that Jan's idea that it is related to socket routing and
masq is not so in my case.

This is eminently reproducable. I can reproduce it within 2 minutes on
demand. This is a showstopper for us regarding our use of coda. I have some
limited time however (a couple of days before I am forced to abandon coda
for some less satisfactory alternative) and I am happy and keen to try and
assist in debug in any way I can in that period. I have not delved into the
coda source yet but I'm open to suggestions and I am a reasonably competent
programmer. I'm hoping to entice someone from the coda core team to help out
here as it seems like there is a serious fundamental bug which, if fixed,
would greatly benefit the coda community, especially those wanting to load
coda up a bit. It happens so quickly and regularly in my case that I can;t
believe others aren;t in the same boat.

Trying not to sound too desperate ... :)

Cheers
Jim Page


Email has been scanned for viruses and SPAM by Email Systems
*** Email the way you want it ***
Received on 2004-06-01 05:55:35