Coda File System

Re: [kind of offtopic] Re: CODA Scalability

From: Nick Andrew <>
Date: Fri, 23 Aug 2002 01:25:14 +1000
On Thu, Aug 22, 2002 at 12:17:39PM +0200, Ivan Popov wrote:
> I suspect the earlier projects were abandoned because it is hard to use
> consequently. It may be my fault, but I am not aware of people actually
> deploying overlay filesystems, on any platform.

It seems like a logical way to extend the apparent size of a filesystem
beyond the storage capacity of a single host, without having to
create or rearrange mountpoints.

It's easy to add capacity from remote servers, simply by NFS-mounting
their available storage.

Mountpoints are a nuisance though, because they separate the total
filespace into two or more (large) chunks, and I as the administrator
have to cut my filesystem - I have to find a single directory which
is big enough (including its subdirectories) to make a significant
difference in freespace on my original filesystem. Moving the files
from the cut point to the new server requires manual effort and
possibly downtime.

It's also possible that there are no appropriate directories to
cut a filesystem into two, which will make a difference. If I
had a 100 gig filesystem which was full, and it contained 100
top-level directories each of approx. 1 gig utilisation, and I
wanted to add an additional 50 gigs on a second server, I could
balance the utilisation only by moving 33 directories across
and setting up 33 symlinks from the original names into a new
mountpoint. That's not desirable.

So if I have explained the problem well, perhaps you can see why
I think a union filesystem should be a useful tool, because it
can extend the amount of storage available without any need for
the administrator to make artificial cuts in the filespace, nor
moving files around to balance utilisation.

Now Coda solves some of the problems I have outlined above. If
my understanding of it is correct, one can add a new server to
a cell and the new server's available disk is equivalent
to the existing server's disk, from the point of view of
clients. One must still move volumes from server to server to
balance utilisation, but at least this can be done without
downtime. The balancing problem becomes easier because the
volumes themselves are typically smaller, and one does not
have to setup a "symlink farm" after moving them. The only
proviso is that each volume must reside completely on one
server (i.e. can't be split between two servers). It would
lead to an issue only when the volumes become _too big_,
for example bigger than 50% of the physical drive capacity.

So I think that Coda's architecture as I described above is
one which permits an "infinitely scaleable filespace",
probably way better than a union filesystem, the only
problem is that Coda's data structures and resource
requirements don't scale. A cheap <$1000 fileserver can
access a quantity of disk which is far beyond the ability
of the Coda system to serve (due to the physical memory
and VM requirements, and I guess the X86 memory architecture

So I'm still stuck in a dilemma, sorta. I might end up sharing
the bulk of my data over NFS ... that at least scales to
any capacity on a single server, and quite well with multiple
servers, it just doesn't scale with multiple clients. I don't
have dozens of clients, only a few PeeCees around my home, so
ability to handle raw capacity definitely ranks higher than
local (client) caching.

By the way, I joined the InterMezzo list and queried them on whether
their client was really a proper "cache". In other words, can the
server serve 50 gigs, and the client be configured with 1 gig,
and still be able to access the entire 50 gigs, although obviously
not at the same time. The answer was that they want it to be like
that, the client should be wiping unused files from its cache,
but it just doesn't do that yet, and so the possibility is there
that the client will exhaust its local backing store and fail.

Finally, one possible answer to the dilemma I raised above which is
"what do you do when your server is full and it cannot take any more
physical devices, and so you have to move to a multi-server model?"

The Network Block Device (NBD) driver allows a remote host's block
devices to be accessed by the local host, as if they were locally
attached storage. If the NBD device is supported by LVM, then
extending a filesystem would be a matter of adding the NBD to
the volume group containing the local host's logical volumes,
and then extending any full logical volumes onto the NBD, and
then resizing the filesystem on that logical volume. I believe
this solution scales to at least 256 gigs using the current linux
LVM implementation, and possibly more if different volume groups
are defined.

Considering the local host (the one with the direct attached disk
plus lots of NBD storage) as a server then, from a client's point
of view they continue to talk to only one server, which just
appears to have more space. This is inefficient of course, because
the data might have to travel over the network link twice; the
client should, ideally, transmit its request directly to the second
server ... but that sort of smarts requires a whole different
architecture to the simple sharing arrangement I started describing.

That idea brings me back to the Berkeley xFS concept of serverless
sharing, where hosts provide disk resource as peers, and clients
somehow locate and access their files. I wish their website was
working, somehow I think this is a project which will never be
completed. This is more of a cluster filesystem idea ... it would
be nice if we had a cluster filesystem to go with cluster processing,
so each additional CPU or storage device increases the capacity of
the cluster incrementally, with no real effort on an administrator's
part. MOSIX and OpenMOSIX claim to balance CPU and VM utilisation
across a cluster, but they don't balance filesystem utilisation; any
migrated process which does file or device I/O has its I/O
transparently redirected to the home system. There's no unified
filespace. Perhaps this could be achieved via some co-operation
between *MOSIX and one of the distributed filesystem projects.
Has anybody here played with either of the two MOSIXs?

Received on 2002-08-22 11:27:13