Reimplementing the Cedar File System using Logging and Group Commit
Hagmann


A lot of activity in Xerox PARC from 1979 to 1987.  Alto, Pilot, Cedar.
D-machines for Cedar:  Dorado, Dandelion, Dragon.

Cedar File System (CFS)

- Uniform, hierarchical namespace

- Complete file name consists of a server, root directory, zero or
  more subdirectories, a simple name, and a version.

- File Name Table is a B-tree mapping names to File Headers (similar
  to inodes).  The File Name Table is an index to the entire name
  space.

Labels (per-page) contain:
  - uid;  a unique identifier for the file that this page belongs to
  - a page number;  this is the logical page number within the file
  - page type (header, free, data)

The page is completely defined by its label.  An actual physical disk
address is only a hint that has to be checked with the label to make
sure the disk sector accessed is actually the one it was meant to.

One can access a page by its label;  this is slow though, as it may
take looking up the entire disk.  Another way is to use the physical
disk address, if the system knows one.  There still has to be a
check with the label to make sure the page is the right one though.
One cannot access a page just by its physical disk location.

CFS formats:

FNT -- 
	text name
	version
	keep
	uid (unique file identifier)
	header page 0 disk address (hint -- needs uid and page number to
				    succeed)

Header
	text name
	version
	keep
	create time
	byte size
	run table; describes the extents in the file.

Label
	uid
	page number
	page type (data, header, free)
 
Labels + Headers contain all file system information.  FNT is
redundant -- used only for fast lookup.

Can reconstruct FNT by looking at all labels and identifying all
headers. 

Notice that headers contain file names, which is different than 
UFS i-nodes.  This is because in CFS there can be *only* one
name for a file.


How does CFS make the disk a robust medium?
-------------------------------------------

Classes of errors:
1. Software errors: Memory smashes and wild writes
2. Hardware errors: Disk page errors

- Scavenging -- can reconstruct file system structure if
  software bugs corrupt memory state
- No wild writes -- cannot overwrite a page by accident, which
  could happen if you had a software bug.
- Some replication -- state can survive disk page errors
  e.g., FNT, headers, leaders

Mechanism:

- Disk pages are always accessed with (label, physical disk address)
  pair
- Label is checked each time.

Accidental overwriting of a page is quite unlikely!

Hints, such as the Volume Allocation Map (VAM) can be constructed from
the labels on the disk (recovery process).  If hints are wrong, they
do not jeopardize correctness.  

The File Name Table can also be considered to be a set of hints as it 
can be reconstructed by looking up all headers.

However, inconsistency is possible in CFS because of lack of atomicity.  
Multi-page B-tree updates may corrupt the structure.  However, we 
don't lose any files.  Looking at pages, we can find all header pages.

CFS does not verify run table during scavening.  What it could be
doing is run a check operation on each run table segment to make
sure that all extents are valid (just in case a software error
corrupts the run table).


Design decisions in FSD
-----------------------

1. FNT is now atomically updated, stored on disk, and replicated.

   - No longer a hint, it has to survive failures.

   - Locality (names and properties in the same structure)

   - FNT written *twice*.  On a read, both copies read and checked.

2. Leader pages are used to check FNT information.  Can guarantee
   FNT integrity.

3. Log-based recovery

   - Data spread over the disk can be logically and atomically updated
     with a "single" disk write to the log.

   - Writes to the buffers are delayed in anticipation of a further
     update to the page (a "hot spot") -- another benefit is that the
     writes can be done at a more convenient time.

   - Synchronous writes (as in UFS) result in
	- More writes
	- Little locality

   - Log in FSD used in FNT and leader updates

   - Redo log.  Logs entire pages.

   - Log not used for VAM or data.
	- wouldn't result in benefit for data -- files are versioned

   - Error model.  At most one or two adjacent sectors at a time (per-I/O)

   - Log record format

     | Header | Blank | Header | ...data... |  end  | ...data... |  end  |

     Replication guards against disk errors while writing to the log.

   - Log is separated into thirds.

     When entering a new third, we need to look at the cache and see 
     whether any of the dirty pages has its latest update in the 
     about-to-be-overwritten third.  If so, we need to force those to
     disk.

     Due to high locality in the FNT, the number of FNT entries that is
     usually written is nearly zero.  Dirty FNT entries are most likely
     logged in the newer third.

     The only times pages are actually written to their disk locations
     is when entering a new third and during crash recovery.

   - Write-ahead logging property?  (Necessary with group commit!)

4. Group commit

   - All metadata updates within a group commit period make up one
     log record, which can potentially be pretty long.

   - Improves performance by introducing uncertainty as to when some
     modifications to the file system become permanent.

   - Helps with "hot spots".  FNT updates concern only a few pages.
     Each group commit period is a single log record.  If committing
     often, each update on the same page would be a separate log
     record entry -- so too many I/Os and too fast log consumption.
     By increasing the group commit period, you can summarize many
     updates on the same pages into a small log record and write it
     once ==>> Slow log consumption and few I/Os.

   - Combination of logging and group commit reduces the number of
     I/Os for metadata by factor of about 3 -- or 2.34 for all I/Os.

   - Log records vary in size depending on activity and the group
     commit period.
	- Smallest record (one page) is 7 pages long.
	- Typical log record under high load:  33 sectors (5 + 2*14)

5. Free pages

   - VAM is a bitmap kept in volatile memory.  It can be reconstructed
     from the FNT at any time.  If properly saved, it can be just
     read from disk at boot time.

   - One issue is that we cannot mark pages free in the VAM before
     the file delete is committed in the FNT on disk.

6. Page allocator

   - Separates disk to small and large areas to avoid fragmentation.

7. File open

   - File open does not usually require an I/O (since the FNT is 
     cached).

   - First disk access is most often for the first data page -- leader
     page comes in as well.

8. Robustness

   - Labels are checking per-I/O, so more effective.  However,
	- they were not used in their full potential, and
	- rarely were there label errors due to incorrect software

     In short, getting rid of them doesn't cost much.

   - FSD is robust against six additional types of errors:
	- Atomic B-tree updates
	- Log writes two copies of pages
 	- FNT replicated

9. Recovery

   - VAM can be reconstructed from FNT.

   - FNT and leaders reconstructed from the log.

   - No need to scavenge ==>> time consuming.

-- Performance Analysis

   - Compute expected average time for typical file operations
	- create
	- delete
	- list
	- open
	- recover from disk error

   - Incorporates caching
	- Caches hit if information is small
 	- Hit except for leaves of FNT B-tree
	- Hits for leaves modeled after probability distribution 

   - Capture rotational delays, seeks, etc.
	- Script each individual operation
	- Validated by estimating performance for CFS, 4.3BSD

   - Ignores CPU time
        - Dragon will have fast CPU

-- Performance


   - FSD offers speedups from 1.0 (for reading a page, which is as
     fast as in the CFS case) to 100+ (for recovery) over CFS

   - FSD reduces number of I/Os by factor 3-6 over CFS

   - FSD reduces number of I/Os by factor 2-3 over FSD

   - What can we say about the FSD to BSD comparison over %CPU
     and %bandwidth?