Coda File System

Re: Files Bigger Than Cache Size

From: Jan Harkes <>
Date: Sat, 2 Feb 2008 01:51:28 -0500
On Fri, Feb 01, 2008 at 01:29:18PM +0000, wrote:
> Is there a particular reason why 256K directory size is used? What code  
> changes would be required to increase this to, for example, 64M?

Directory data is allocated in units of 2KB pages. There is one page
which just stores pointers to the other pages, so a directory will use
at least 2 pages, one for the pointers and one page that holds the data
for the '.' and '..' entries. The 2KB page of pointers can contain 128
pointers, I am not even sure if this page of pointers is allocated as
4KB on a 64-bit system. But 128 pointers 2KB pages results in a 256KB
directory size limit.

There are a bunch of other assumptions, like directory entries willl
never cross one of these page boundaries and are allocated in units of
multiples of 32-bytes. At some point I think the kernel modules were
reading the data out of directories in 2KB units as well. There is also
a very similar variant without the pointer-page which is used to send
the directory data between servers and from the servers to the client.

Technically it should be possible to double the basic page size, this
would double the limit, even quadruple if we also double the size of the
pointer page. However this breaks the way servers store directory data,
so all servers would need to be rebuilt from scratch. It also affects
clients, a client that is still using the old page size will now after
unpacking even a smaller directory get confronted with entries that
cross the former 2KB boundary and possible get more directory data than
it can handle. Also when a directory is serialized to be sent to the
client, the sent data is not collected via scatter-gather, but copied
into a single large buffer. Finally it may affect the ability of various
kernel modules to read the directory contents.

And really that is just too much trouble for a change that doesn't
really fix anything. So now we have a 1MB directory size limit which
would be around 16000 maildir messages, enough to store one month of
linux-kernel mail, but not anymore.

A real solution would not only have direct pointers to data pages, but
also indirect, double indirect and possibly even triple indirect. Or a
btree layout. And definitely a better way to send data across the net
that doesn't involve copying everything into a large memory buffer, same
thing for passing directory contents to the kernel, we should avoid
having to copy/convert the directory data every time a directory is
opened. It may even be possible to just store the directory contents in
an on-disk container file instead of in RVM, from some statistics and
calculations I estimate that about 50% of the RVM data allocated on my
servers is used to store directory contents.

Of course, there are advantages of storing directories in RVM, the
transactional and recovery guarantees, as well as the performance
benefit of having everything in memory and accessible with just one
pointer dereference.

Received on 2008-02-02 01:54:06