Coda File System

Re: slow write performance on linux

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Fri, 28 Mar 2003 10:28:15 -0500
On Fri, Mar 28, 2003 at 04:58:19AM -0500, Steve Simitzis wrote:
> i'm running a 2.4.18 linux kernel, coda 5.3.20, and the latest rpc2
> 1.15. from what i understand, the latest rpc2 was meant to fix
> problems related to slow write performance, but it hasn't helped my
> installation at all.

Well, it fixes extreme packet loss during the client->server bulk
transfers. This is only part of what you look at as write performance.

* I just reread this email before sending it off and it is a bit
* technical. But there are some interesting numbers hidden between
* all of it.

Coda works a lot like a (classic) BSD FFS filesystem, i.e. all metadata
updates are synchronous. Every single create, chmod, or chown won't
'complete' until the server is absolutely positive that the update has
hit the disk. This already is extremely different (and a lot slower)
than what you are probably used to, ext2 asynchronously writes back
modifications and relies on the fsck to fix things up if the power fails
before everything is written back. BSD has embraced 'softupdates' where
they order 'dependent writes' and as such can do a similar async
writeback, but the filesystem will always be in a consistent state.

If we look at things performance wise, we have an application performing
a mutating operation, then we get the context switch to the userspace
cachemanager, which commits to the operation locally. Then it sends the
operation to the server (network latency), the server then performs a
transaction to validate and perform the operation and heavily relies on
fsync to make sure all updates have been written to the disk. The server
then returns a response (network latency again), and the client
registers success of the operation. Only then can we return the result
to userspace (context switch again).

So we have at least 2 context switches, the network RTT, and the time it
takes for the server to perform the transaction. Now on linux, fsync
should be probably be called fsuck, it is extremely slow has to walk
through the pagetables to find dirty pages and schedule the writes, and
it seems to sync not just file we called fsync on, but all pending
writes on the server. This includes updates to logfiles etc. And because
all the filesystems on the server are typically doing async writeback,
there often is a lot of data that needs to go to disk and our process is
the one that's paying for it because we actually care about consistency.

Peter Braam once calculated that it was not ossible to perform more than
about 100 of these synchronous transactions on one of our servers. Now a
typical 'file write' with tar involves at least about 5 or 6 remote
calls (create, store, chown, chmod, set timestamps, rename). So if we're
dealing with the creation of files with no data, we would probably be
able to deal with about 20 files per second and if we took out the
consistency guarantees, we would still have a overhead of 6 times the
network latency, which on a 100Base-T network would be in the order of a
millisecond or two, but on a PPP link it is probably more than a second.

> my coda server is an unloaded, 4 CPU machine, and i've been testing
> it against a single dual CPU client machine.

Because all Coda programs are single threaded, adding more CPU's won't
help. In the long run we want to slim down the Coda server process and
make it easier (or trivial ;) to run multiple server processes on a
single machine.

> what i've observed is that writes are very slow, and seem to be
> hanging on the close() (at least, on the client side). untarring
> tar archives is also very slow.

The slowness during untar is totally dependent on how fast we can make
our RPC's. And as you can see from the description above, there is no
trivial way to speed it up. Hanging on close() would typically indicate
a problem in the 'ViceStore rpc'. We don't write back data until the
file is closed, so if you have a large file, it will block until all the
data is sent to the server.

Also even with the reduced packet loss, our bulk transfers don't have
the 'sophistication' of current TCP, we run the whole deal from
userspace, which costs us some. There is a fixed window and it is
tuned towards wireless networks where packet loss is typically caused by
corruption and not congestion. i.e. we don't use 'slow start' and ramp
up, but use simple stop and wait until all outstanding packets are
acknowledged and then kick back to pushing data at the 'estimated
bandwidth' of the link until something goes wrong and we fall back to
stop-and-wait.

Between a client and a single server, I'm seeing about 3.9MB/s. When
talking to 2 servers it goes up to about 4.2MB/s, and because the data
is sent to both servers at the same time, this is in fact a little more
than 8.4MB/s on the wire. With 3 servers I seem to hit a limit, 2.6MB/s
average, however that is still more than 7.8MB/s going through the wire.

Some things that can be useful, running a codacon process on the client
should give some indication of what RPC operations it is performing.
Then there are rpc2 and sftp packet statistics. The server should dump
these to the log once in a while, but you can force them with 'volutil
printstats'. Half of the statistics seem to go to stdout, but the
interesting ones in this case are dumped to /vice/srv/SrvLog.

10:00:28 RPC Packets retried = 25, Invalid packets received = 158, Busies sent = 335

* Retries, how many operations did the server have to resend because a
  client or server wasn't responding.
* Invalid, number of packets that were not decoded as 'proper' RPC2 packets.
* Busies, a client retried an operation before we were done, and the
  server tells it to back off for a while.

10:00:28 SFTP:  datas 915, datar 196824, acks 25605, ackr 152, retries 0, duplicates 0
10:00:28 SFTP:  timeouts 1, windowfulls 0, bogus 0, didpiggy 279

* sftp data packets sent, data packets received, sftp acks sent, acks received.
  If everything goes right, we should be fewer acks than the number of
  received data packets divided by 8. Same way for data sent and
  received acks.

* retries and duplicates, typically no good, it means that we are
  sending or receiving data twice.

* timeouts, windowfulls, bogus.
  Before rpc2-1.15, bogus would be enormous, we were dropping packets
  because the sftp thread hadn't been scheduled so 'nobody' was waiting
  for the incoming packet. Windowfulls should really be more than '0',
  the bandwidth estimate is probably too conservative so we never really
  have a full window of 32 packets on the wire.

* didpiggy, the amount of data that the client requested was so small
  that we just stuck it on the back of the rpc2 reply and saved
  ourselves the overhead of tranferring the data seperately.

I also monitor all servers with 'smon' which generates data for rrdtool.
This data contains similar numbers so I can look at graphs of the
average number of rpc2 operations or server CPU load over the past year
for any server.

Jan
Received on 2003-03-28 10:32:03