Coda File System

Re: Real life lessons of disconnected mode

From: M. Satyanarayanan <satya_at_cs.cmu.edu>
Date: Tue, 27 Jun 2006 08:09:42 -0400
Hi Rune,
 Thank you for valuable feedback!  Most importantly, thank you
for taking the time to write about your experience in a
constructive manner and in sufficient detail to help us
improve Coda.  It is gratifying that the system was helpful
to you under real-world stress, except at the very end when
it collapsed.  Real-world experience and thoughtful feedback of
this kind is priceless.

The  hoarding/caching area of Venus is definitely one that we
have had an eye on for some time now for a complete redesign
and rewrite.    What's difficult is to strike the right balance 
between precisely tracking changes on servers  and presenting a
frozen snapshot view.  How automated versus manual the synchronization
process should be is a delicate balance.    Also, how "atomic" the
resync process should be if failures happen during the resync
process (as is very possible with flaky wireless networks).
The current code is biased towards the fully automated end of 
the spectrum and no atomicity.  So the cache management policy
is roughly "as fresh a state as possible, without user interaction
or atomicity guarantees." 

The resulting caching/hoarding code is complex/buggy and can also have 
counterinuitive behavior, as you experienced.  But the obvious
alternatives have their own problems.  Any redesign is going to
have to make some hard choices on very important corner cases.
It would help to hear thoughts from you and other experienced Coda 
users on the points below.

One very early design choice we considered (but rejected) was to simply
pin  objects in the cache via hoarding.  Hoard priority is then not a 
useful conccept, but explicit hoard walks are still important (that's
the resync step).   Hoarded objects are "sticky" --- they 
never get thrown out, but new versions of them get fetched on
hoard walks.

One reason for rejecting the "sticky" approach was that we 
didn't have a good answer to the question of  what to do if the 
resync step would cause a pinned subtree to expand greatly
(beyond cache size limits).  E.g. you disconnect after hoarding
a 1-byte subtree; a later "hoard walk" discovers that the
1-byte subtree has grown to 10 GB, which is bigger than the cache.
What does Venus do now?   Currently, Coda tries to use the hoard priority
information to figure out what to throw out.  A different approach 
would be to ask for user help at this point or just to give
an error message at the hoard walk.  User interaction at this point
is questionable in the Unix design philosophy (unlike Windows or Mac),
because there may not be a GUI or a user to interact with.   That
design philosophy is the reason why conflicts are represented as
dangling sym links --- it is out of band communication to even
non-interactive programs.   The ASR (application-specific resolution)
mechanism can pop up a dialog box, but Coda views that as an
application-specific resolver and not as part of Venus.  Should we
do something similar here?  i.e. an upcall to applications-specific
code for exception handling in cache management?   Is there an
analog of the dangling sym link for the rock bottom fallback?

The deeper issue is static partitioning of the cache versus dynamic
partitioning.   Even without growth of hoarded subtrees, there
could be cache pressure to throw things out.  E.g. you hoard 
critical objects, then start crawling some big tree while still
connected.   The cache misses during the crawl will eventually
force a hard decision:  to throw out a hoarded object or not.
The "sticky" approach would never throw out a hoarded object
to relieve cache pressure.    But it would make the apparent
cache size smaller for non-hoarded objects.  

This is similar to the problem faced by a VM system that dynamically
balances between use of physical memory for VM pages versus I/O
buffer cache.  The difference is that we don't just face a performance
penalty in our case.  We face the much more difficult problem
of failure semantics and user distraction, not just for planned failures
(voluntary disconnections) but unplanned failures (involuntary 
disconnections, such as caused by RF signal loss when mobile).

Usage-based insights and ideas from the Coda user community on 
these issues would be very helpful --- please contribute.

                          -- Satya
Received on 2006-06-27 08:49:39