Coda File System

Re: Help with rvm_malloc's error -9

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Thu, 17 Mar 2005 12:25:55 -0500
On Thu, Mar 17, 2005 at 10:02:35AM -0300, Gabriel B. wrote:
> I let the client copying some files overnight. The logs are below.
> 
> My first tought was that it was a lack of inod (already hapened 3
> times here) but when the server tries to restart, it mentions that
> /vicepa has 200k inodes used, and most of the 16M inodes free. Also
> has lots of free space.
> 
> the full error log:
> No waiters, dropped incoming sftp packet (some 30 lines of that)
> No waiters, dropped incoming sftp packet
> rvm_malloc: error -9 size c7c00 file recovc.cc line 365
> RVMLIB_ASSERT: error in rvmlib_malloc

Ah, growvnodes... haven't seen that one in a while.

The problem is that the server is using an array for all vnodes (file
system object) in the volume, and when this array fills up it does the
something quite similar to the following.

    BeginTransaction();
    new = rvm_malloc((size+128) * sizeof(vnode *));
    memcpy(array, new, size * sizeof(vnode *));
    memset(&new[size], 0, 128 * sizeof(vnode *));
    rvm_free(array);
    array = new;
    EndTransaction();

Now there are 2 things happening here. First of all, everything happens
in a single transaction, which means we records the old and the new
state of all modifications so that we can roll back an aborted (or
failed) operation, or reapply a completed transaction that hasn't
actually been committed to RVM data yet. This is why the server is able
to restart without a problem after the crash. But this means that our
worst case log usage is the size of the newly allocated memory + the
size of the released memory, possibly even including the changes made by
the memcpy operation, because I believe we only optimize when applying the
logged operations to the RVM data.

So if we were growing the array from 204288 to 204544 vnodes, we'd need
the RVM log to be able to contain at least (204544/malloc/ +
204544/state before memcpy+memset/ + 204288/free/ + 204544/committed
state/) * 4 bytes + whatever overhead the logged RVM operations have.
So that would be more than 3MB. However the peak usage would only hit
around the EndTransaction operation, so I'm guessing you're not running
out of log space in this situation.

The other problem we hit here is that we need to be able to allocate a
single contiguous chunk of (size+128) * 4 bytes to satisfy the
rvm_malloc request. But we're allocating before we free the previously
used space, and the available free space in RVM is probably fragmented
by now.

RVM tends to fragment over time, and unlike normal (non-persistent)
allocators this fragmentation remains even when we a restart. So even if
there is still more than 100MB of RVM available it might be that the
largest available chunk is only a hundred kilobytes or less. The
fragmentation is worse when we are only filling a single volume, if
files never get deleted, any of space we previously used for the vnode
array is per definition too small.

The allocation over time would go something like this

(A=array, v=vnode, .=unused)

    Avv............ (let's say we filled the array and need to grow)
    .vvAAvv........
    .vv..vvAAAvv...

At this point we're stuck, even though there is enough space to allocate
the new array there is no contiguous area of 4 pages so we can't
actually use any of the free space.

Now if we did the same, but using several volumes instead of just one
we'd see something more like,

    Avv............
    .vvAAvv........ (let's say we are starting to fill the second volume)
    AvvAAvvvv......

At this point we can already store as many vnodes as we had in the
previous case, but we still have a large consecutive chunk of
allocatable memory. If the first volume happens to grow first, the space
it leaves behind can be used to fit the array of the second volume.

Now the actual allocation has been made a bit smarter in avoiding
fragmentation, which is probably why we haven't seen a growvnodes report
in a while, but this is most likely your problem.

In any case, enough of the details, what can be done about it.

Well, the real solution to the problem in a way already exists, it was
implemented by Michael German, who got a server running with a million
files or more, the time it took to fill a server up became a bottle neck
and by that time he also started to get similar fragmentation problems
on his clients.

Instead of using a growing array of vnodes, it relies on a fixed size
hash table off of which we have chains of vnodes hanging. The size of
the hash table can be adjusted by the administrator at run time so that
he can keep the vnode chains on frequently used volumes short and quick.
The reason(s) why these fixes haven't been committed into the main
codebase are,

  - Some code walks through the list of vnodes and expects to see them
    in a steadily increasing order. Volume salvage during startup was
    one, but that one got fixed pretty quickly. The other is when we try
    to incrementally update an existing backup volume during a backup
    clone operation, which still remains broken.
    Resolution might have some problems, i.e. it hasn't even been tested
    yet, but I don't really expect serious issues in that area.
  - It is totally incompatible with the RVM data layout of existing
    servers. There is no way to smoothly upgrade except by copying
    everything off of the existing server and building a new server from
    scratch. And since it requires such an invasive upgrade it might be
    interesting to consider what other major (future) changes we could
    anticipate and allow for during the upgrade.
  - The new code might not even be able to restore existing Coda format
    backup dumps, luckily the impact of this problem is greatly avoided
    by Satya's codadump2tar conversion tool which will allow us to at
    least convert the old dumps to a tarball and restore data that way.

Maybe this code is interesting enough to start an experimental Coda
branch which may completely break compatibility a couple of times as
needs arise (i.e. don't assume that you can actually restart your
servers after a 'cvs update; make; make install').

The second option, which is probably more realistic at the moment, is to
increase the size of your RVM data segment,

    http://www.coda.cs.cmu.edu/misc/resize-rvm.html

(and in the long term to store data on more than just a single volume)

Jan
Received on 2005-03-17 12:27:29