Reimplementing the Cedar File System using Logging and Group Commit Hagmann A lot of activity in Xerox PARC from 1979 to 1987. Alto, Pilot, Cedar. D-machines for Cedar: Dorado, Dandelion, Dragon. Cedar File System (CFS) - Uniform, hierarchical namespace - Complete file name consists of a server, root directory, zero or more subdirectories, a simple name, and a version. - File Name Table is a B-tree mapping names to File Headers (similar to inodes). The File Name Table is an index to the entire name space. Labels (per-page) contain: - uid; a unique identifier for the file that this page belongs to - a page number; this is the logical page number within the file - page type (header, free, data) The page is completely defined by its label. An actual physical disk address is only a hint that has to be checked with the label to make sure the disk sector accessed is actually the one it was meant to. One can access a page by its label; this is slow though, as it may take looking up the entire disk. Another way is to use the physical disk address, if the system knows one. There still has to be a check with the label to make sure the page is the right one though. One cannot access a page just by its physical disk location. CFS formats: FNT -- text name version keep uid (unique file identifier) header page 0 disk address (hint -- needs uid and page number to succeed) Header text name version keep create time byte size run table; describes the extents in the file. Label uid page number page type (data, header, free) Labels + Headers contain all file system information. FNT is redundant -- used only for fast lookup. Can reconstruct FNT by looking at all labels and identifying all headers. Notice that headers contain file names, which is different than UFS i-nodes. This is because in CFS there can be *only* one name for a file. How does CFS make the disk a robust medium? ------------------------------------------- Classes of errors: 1. Software errors: Memory smashes and wild writes 2. Hardware errors: Disk page errors - Scavenging -- can reconstruct file system structure if software bugs corrupt memory state - No wild writes -- cannot overwrite a page by accident, which could happen if you had a software bug. - Some replication -- state can survive disk page errors e.g., FNT, headers, leaders Mechanism: - Disk pages are always accessed with (label, physical disk address) pair - Label is checked each time. Accidental overwriting of a page is quite unlikely! Hints, such as the Volume Allocation Map (VAM) can be constructed from the labels on the disk (recovery process). If hints are wrong, they do not jeopardize correctness. The File Name Table can also be considered to be a set of hints as it can be reconstructed by looking up all headers. However, inconsistency is possible in CFS because of lack of atomicity. Multi-page B-tree updates may corrupt the structure. However, we don't lose any files. Looking at pages, we can find all header pages. CFS does not verify run table during scavening. What it could be doing is run a check operation on each run table segment to make sure that all extents are valid (just in case a software error corrupts the run table). Design decisions in FSD ----------------------- 1. FNT is now atomically updated, stored on disk, and replicated. - No longer a hint, it has to survive failures. - Locality (names and properties in the same structure) - FNT written *twice*. On a read, both copies read and checked. 2. Leader pages are used to check FNT information. Can guarantee FNT integrity. 3. Log-based recovery - Data spread over the disk can be logically and atomically updated with a "single" disk write to the log. - Writes to the buffers are delayed in anticipation of a further update to the page (a "hot spot") -- another benefit is that the writes can be done at a more convenient time. - Synchronous writes (as in UFS) result in - More writes - Little locality - Log in FSD used in FNT and leader updates - Redo log. Logs entire pages. - Log not used for VAM or data. - wouldn't result in benefit for data -- files are versioned - Error model. At most one or two adjacent sectors at a time (per-I/O) - Log record format | Header | Blank | Header | ...data... | end | ...data... | end | Replication guards against disk errors while writing to the log. - Log is separated into thirds. When entering a new third, we need to look at the cache and see whether any of the dirty pages has its latest update in the about-to-be-overwritten third. If so, we need to force those to disk. Due to high locality in the FNT, the number of FNT entries that is usually written is nearly zero. Dirty FNT entries are most likely logged in the newer third. The only times pages are actually written to their disk locations is when entering a new third and during crash recovery. - Write-ahead logging property? (Necessary with group commit!) 4. Group commit - All metadata updates within a group commit period make up one log record, which can potentially be pretty long. - Improves performance by introducing uncertainty as to when some modifications to the file system become permanent. - Helps with "hot spots". FNT updates concern only a few pages. Each group commit period is a single log record. If committing often, each update on the same page would be a separate log record entry -- so too many I/Os and too fast log consumption. By increasing the group commit period, you can summarize many updates on the same pages into a small log record and write it once ==>> Slow log consumption and few I/Os. - Combination of logging and group commit reduces the number of I/Os for metadata by factor of about 3 -- or 2.34 for all I/Os. - Log records vary in size depending on activity and the group commit period. - Smallest record (one page) is 7 pages long. - Typical log record under high load: 33 sectors (5 + 2*14) 5. Free pages - VAM is a bitmap kept in volatile memory. It can be reconstructed from the FNT at any time. If properly saved, it can be just read from disk at boot time. - One issue is that we cannot mark pages free in the VAM before the file delete is committed in the FNT on disk. 6. Page allocator - Separates disk to small and large areas to avoid fragmentation. 7. File open - File open does not usually require an I/O (since the FNT is cached). - First disk access is most often for the first data page -- leader page comes in as well. 8. Robustness - Labels are checking per-I/O, so more effective. However, - they were not used in their full potential, and - rarely were there label errors due to incorrect software In short, getting rid of them doesn't cost much. - FSD is robust against six additional types of errors: - Atomic B-tree updates - Log writes two copies of pages - FNT replicated 9. Recovery - VAM can be reconstructed from FNT. - FNT and leaders reconstructed from the log. - No need to scavenge ==>> time consuming. -- Performance Analysis - Compute expected average time for typical file operations - create - delete - list - open - recover from disk error - Incorporates caching - Caches hit if information is small - Hit except for leaves of FNT B-tree - Hits for leaves modeled after probability distribution - Capture rotational delays, seeks, etc. - Script each individual operation - Validated by estimating performance for CFS, 4.3BSD - Ignores CPU time - Dragon will have fast CPU -- Performance - FSD offers speedups from 1.0 (for reading a page, which is as fast as in the CFS case) to 100+ (for recovery) over CFS - FSD reduces number of I/Os by factor 3-6 over CFS - FSD reduces number of I/Os by factor 2-3 over FSD - What can we say about the FSD to BSD comparison over %CPU and %bandwidth?