Coda File System

a new view of the old locking issue in the Coda server

From: <u-codalist-s7mc_at_aetey.se>
Date: Fri, 25 Nov 2011 12:40:48 +0100
Hello,

During the last years we had sometimes Coda servers becoming
unresponsive and causing their clients to hang indefinitely.

For a long time we believed that the problems were caused by
locking races at updates. Despite all the efforts and the help
of Jan to analyse and trace down the problems we did not find
the reason.

Recently we have made two observations which possibly sched a new light
on this.

1. The logs showed that the phenomenon has faded away. The only
plausible reason seemed to be the reduction of peak load previously
caused by other processes on the server machine.

2. We observed also the same (or a very similar?) behaviour in another
realm, in the absense of any updates. The triggering event seems to be
flaky network connectivity. In the actual case there were several
hundred client hosts but no write traffic.

So the breakage seems to happen when the server either misses incoming
packets or its sent packets get lost - both are likely to happen when
the server process does not get enough CPU or/and when the network
is temporarily cut off.

The server process becomes mostly idle but it sends BUSY to the clients.
This should mean that it waits for some condition which never happens.
I wonder if there is a missing/unreliable timeout check in some codepath.

The server logs in that state look like
---------------------------------------------------------------------
 ...
16:04:45 There are currently 921 connections in use
16:04:45 There are 540 workstations and 5 are active (req in < 15 mins)
 ...
16:05:20 AuthLWP-0 received new connection 753826942 from NN.NN.NN.NN:42411
16:05:20 client_GetVenusId: got new host NN.NN.NN.NN:42411
16:05:20 Building callback conn.
16:09:34 New Data Base received
 [the network failure (there was a failure around that time)] <===========
16:14:28 Callback failed RPC2_DEAD (F) for ws NN.NN.NN.NN:35910
16:14:28 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.35910 rpcid 1472989626
16:18:16 AuthLWP-4 received new connection 81613927 from NN.NN.NN.NN:47196
16:18:16 AuthLWP-3 received new connection 335328897 from NN.NN.NN.NN:50819
 ...
 [> 500 "new connection" here] <==========================================
 ...
16:21:00 AuthLWP-0 received new connection 1256344367 from NN.NN.NN.NN:54369
16:21:00 client_GetVenusId: got new host NN.NN.NN.NN:54369
16:22:31 Building callback conn.
16:22:31 Building callback conn.
16:22:31 Building callback conn.
16:22:31 Building callback conn.
16:22:31 Building callback conn.
16:22:31 Building callback conn.
16:22:31 Building callback conn.
16:22:31 Building callback conn.
16:22:31 Building callback conn.
16:22:31 Building callback conn.
16:22:31 AuthLWP-2 received new connection 1127630056 from NN.NN.NN.NN:46101
16:22:31 client_GetVenusId: got new host NN.NN.NN.NN:46101
 [the new connections (apparently reconnections) end here] <================
 [ there are much fewer "Building callback conn" than
   "received new connection" between 16:18:16 and 16:22:31 above ]
16:45:49 Callback failed RPC2_DEAD (F) for ws NN.NN.NN.NN:37758
 ...
19:22:41 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.58732 rpcid 1573500488
19:22:41 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.35548 rpcid 1800508411
19:22:41 Building callback conn.
19:22:41 AuthLWP-4 received new connection 80554604 from NN.NN.NN.NN:59457
19:22:42 client_GetVenusId: got new host NN.NN.NN.NN:59457
19:23:43 AuthLWP-3 received new connection 743361068 from NN.NN.NN.NN:58732
19:23:43 client_GetVenusId: got new host NN.NN.NN.NN:58732
19:29:55 AuthLWP-0 received new connection 1672927627 from NN.NN.NN.NN:38592
19:29:55 client_GetVenusId: got new host NN.NN.NN.NN:38592
19:43:48 AuthLWP-2 received new connection 1433944362 from NN.NN.NN.NN:49381
19:43:48 client_GetVenusId: got new host NN.NN.NN.NN:49381
19:52:32 SmonDaemon timer expired
19:52:32 Entered CheckRVMResStat
19:52:32 Starting SmonDaemon timer
20:22:35 Building callback conn.
20:22:35 Building callback conn.
20:22:35 Building callback conn.
20:22:35 Building callback conn.
20:22:35 Building callback conn.
20:22:35 Building callback conn.
20:22:35 Building callback conn.
20:22:35 Building callback conn.
20:22:35 AuthLWP-1 received new connection 173156798 from NN.NN.NN.NN:49932
20:22:35 client_GetVenusId: got new host NN.NN.NN.NN:49932
20:40:00 RPC2_Bind to NN.NN.NN.NN port 52289 for callback failed RPC2_NOBINDING (F)
20:40:00 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.52289 rpcid 1412666908
20:40:00 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.52289 rpcid 624175938
20:40:00 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.52289 rpcid 2023617723
20:40:01 Worker9: Unbinding RPC connection 29215551
20:40:01 Deleting client entry for user System:AnyUser at 0.0.0.0.0 rpcid 29215551
20:40:01 Building callback conn.
20:40:01 Callback failed RPC2_DEAD (F) for ws NN.NN.NN.NN:41861
20:40:01 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.41861 rpcid 1653698504
20:40:01 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.41861 rpcid 174909523
20:40:01 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.41861 rpcid 1135798093
20:40:01 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.41861 rpcid 1323397277
20:40:01 Building callback conn.
20:40:01 AuthLWP-4 received new connection 1963123312 from NN.NN.NN.NN:51701
20:40:01 client_GetVenusId: got new host NN.NN.NN.NN:51701
20:40:45 AuthLWP-3 received new connection 1810381824 from NN.NN.NN.NN:41861
20:40:46 client_GetVenusId: got new host NN.NN.NN.NN:41861
20:41:01 Shutdown received     <===========================================
 ...
20:41:01 There are currently 1456 connections in use
20:41:01 There are 528 workstations and 3 are active (req in < 15 mins)
 ...
---------------------------------------------------------------------

It looks like there is a chance for a deadlock when a client reconnects
while the server is "too slow" to answer. Somehow they wait on each other
(?), sending BUSY and keeping each other waiting. Can the timeouts for
old connections be reset/kept from expiring by new packets coming from
the same client?

Note that the server got over 500 new connections while apparently
keeping most of the old ones (which should be non-functional at that time).

Hope Jan might take a quick look at this.

Thanks for Coda, it is otherwise serving us well and is extremely useful.

Regards,
Rune
Received on 2011-11-25 07:01:03