42 Crash Consistency: FSCK and Journaling

One major challenge faced by a file system is how to update persistent data structures despite the presence of a power loss or system crash.

Crash-consistency problem: On-disk state may only partially get updated on system crash or power loss

42.1 A Detailed Example

Crash Scenarios

Three possible outcomes if an error occured when a data block is allocated and written for a file (which should allocate a data block (and write data to it), update the data bitmap, and update the file inode):

  • only the data block is written: no problem at all (from the perspective of file-system crash consistency)
  • only the updated inode is written: garbage data will be read; file-system inconsistency: the allocated status of the data block is different in the bitmap and the inode
  • only the bitmap is written: space leak
  • only the inode and the bitmap are written: garbage data
  • only the inode and the data block: inconsistency between the inode and the bitmap
  • only the bitmap and the data block: space leak again

The Crash Consistency Problem

Crash-consistency problem (or consistent-update problem): the disk can only commit one write at a time, so that an update cannot be done atomically

42.2 Solution #1: The File System Checker

Early systems’ approach: fix inconsistencies later (when rebooting)

fsck’s phases:

  • check superblock
  • scans inodes, indirect blocks, etc. to make consistency between inodes and bitmaps (by trusting inodes)
  • check inodes for corruption and other problems
  • verify link count
  • check for duplicate pointers (two different inodes refer to the same block) (clear some inodes or make copies of the duplicate data)
  • check for bad block pointers (and remove them)
  • check directories

Fundamental problems of fsck: too slow

At a higher level, the basic premise of fsck seems just a tad irrational. Consider our example above, where just three blocks are written to the disk; it is incredibly expensive to scan the entire disk to fix problems that occurred during an update of just three blocks.

42.3 Solution #2: Journaling (or Write-Ahead Logging)

Write-ahead logging (or journaling): write logs before overwriting the structures to be able to recover from errors

Data Journaling

Journal write:

  • transaction begin block
  • blocks containing the exact contents of the updating blocks (physical logging) or some more compact forms (logical logging)
  • transaction end block

Checkpointing: override the old structures in the file system

Problems if a crach occurs during the write to the journal: if writing all the log blocks at once, as the disk might perform scheduling, they might not complete in order, and thus some blocks might not be correct (e.g. invalid data blocks within valid TxB and TxE blocks).

Solution: write all blocks except the TxE at once, and write TxE block after these writes. With atomicity guatantee!

Performance problem: the transaction-end block can only be written after previous blocks are written.

Optimizing log writes: using checksum

Three phases:

  • journal write (TxB, metadata and data)
  • journal commit (TxE)
  • checkpoint (write the contents to final on-disk locations)

Recovery

Replay the transactions that are commited to the disk (redo logging)

Batching Log Updates

Buffer all updates into a global transaction to avoid writing some same blocks many times.

Making The Log Finite

Journaling file systems treat the log as a circular data structure (the journal is referred to as circular log).

Mark the transaction free in the journal by updating the journal superblock some time after checkpoint.

Now the four phases:

  • journal write
  • journal commit
  • checkpoint
  • free

Metadata journaling

Ordered journaling (or metadata journaling): the user data is not written to the journal

One possible order (to avoid garbage data):

  • data write (1)
  • journal metadata write (2)
  • journal commit
  • checkpoint metadata
  • free

(1) and (2) can be written concurrently

43 Log-structured File Systems

The motivation

  • system memories are growing so that more data can be cached, and disk traffic increasingly consists of writes
  • a large gap between random and sequential I/O performace
  • exising file systems perform poorly on many common workloads
  • file systems are not RAID-aware

An ideal file system would thus focus on write performance, and try to make use of the sequential bandwidth of the disk. Further, it would perform well on common workloads that not only write out data but also update on-disk metadata structures frequently. Finally, it would work well on RAIDs as well as single disks.

43.1 Writing To Disk Sequentially

The basic idea: writing all updates (such as data blocks, inodes, etc.) to the disk sequentially

43.2 Writing Sequentially And Effectively

Sequential writes is not enough; to achieve good performance we must issue contiguous writes (or one large write)

Write buffering: keeps track of updates in memory, and write the buffered data at once when gets sufficient updates (called a segment)

43.3 How Much To Buffer?

$$ D=\cfrac{F}{1-F}\times R_\mathrm{peak}\times T_\mathrm{position} $$

43.4 Problem: Finding Inodes

43.5 Solution Through Indirection: The Inode Map

The inode map: takes an inode number as input and returns the disk address of the most recent version of the inode

LFS places chunks of the inode map right next to where it is writing all of the other new information.

Inode maps are compact enough to keep the active portions cached in main memory: inode map lookups rarely require disk accesses.

(From The Design and Implementation of a Log-Structured File System)

43.6 Completing The Solution: The Checkpoint Region

The checkpoint region (or CR) contains pointers to the latest pieces of the inode map; it is only updated periodically for performance

43.7 Reading A File From Disk: A Recap

The steps of reading a file (from start):

  • read the CR and cache it in memory
  • lookup the CR to find the address of a given inode
  • (the rest is pretty much the same as a typical FS)

43.8 What About Directories?

Steps:

  • look in the imap for the location of the inode of the directory
  • find inode number of the file in the inode of the directory
  • look in the imap for the location of the inode of the file

The imap also solves the recursive update problem through indirection.

43.9 A New Problem: Garbage Collection

The cleaner assures to preserve contiguous free regions by compacting existing segments to new, contiguous segments

43.10 Determining Block Liveness

Use segment summary block at the head of the segment to record the inode number and offset (which block of the file) for each data block

To determine whether a block is dead:

  • look in the segment summary block and find its inode number and offset
  • look in the imap to find and read the inode
  • look in the inode using the offset to compare the disk location

LFS can also use version number to determine this, avoiding extra reads.

43.11 A Policy Question: Which Blocks To Clean, And When?

The time to clean can be:

  • perioducally
  • during idle time
  • when the disk is full

43.12 Crash Recovery And The Log

Use two CRs containing update timestamps to recover from crash during CR updates

The last consistent snapshot of the file system may be quite old as LFS only writes the CR every 30 seconds or so; use roll forward to read through the next segments of the end of the log to see if there are any valid updates, and updates the file system accordingly.

44 Flash-based SSDs

Solid-state storage: random access, retains information on power loss

44.1 Storing a Single Bit

  • single-level cell (SLC) flash: a single bit per transistor
  • multi-level cell (MLC) flash: two bits per transistor (00, 01, 10, 11 represented by different levels of charge)
  • triple-level cell (TLC) flash: three bits per cell

SLC chips achieve higher performance and are more expensive.

44.2 From Bits to Banks/Planes

Flash chips are organized into banks or planes which consist of a large number of cells.

A bank is accessed in two different sized units:

  • block (or erase block): typically of size 128KB or 256KB
  • page: a few KB in size (e.g. 4KB)

44.3 Basic Flash Operations

Three low-level operations:

  • read (a page): random access; fast, 10s of $\mathrm{\mu s}$ or so
  • erase (a block): setting each bit to 1; quite expensive, takes a few ms; need to copy data of care somewhere else
  • program (a page): After a block is erased, change some of the 1’s within a page to 0’s; usually taking around $100\mathrm{\mu s}$

44.4 Flash Performance And Reliability

Wear out: erasing and programming accure a little bit of extra charge, 0 and 1 are increasingly difficult to be differentiated.

The typical lifetime:

  • MLC: 10,000 P/E (Program/Erase) cycles
  • SLC: 100,000 P/E cycles
  • actual lifetimes are much longer

Disturbance: some bits get flipped in neighboring pages when accessing (read/program) a particular page

44.5 From Raw Flash to Flash-Based SSDs

Flash translation layer (or FTL): translates read and write requests on logical blocks into low-level read, erase and program commands on underlying physical blocks and physical pages

The FTL should accomplish this task with the goal of delivering excellent performance and high reliability.

To achieve performance:

  • parallel
  • reduce write amplification

Write amplification: total write traffic issued to the flash chips (by the FTL) divided by the total write traffic issued to the SSD (by the client)

To achieve high reliability:

  • for wear out: wear leveling
  • for program disturbance: a sequential-programming approach

44.6 FTL Organization: A Bad Approach

Direct mapped organization: a read to a logical page is mapped directly to the physical page with the same page number

The problems with the direct-mapped FTL:

  • performance: each write needs to
    • read in the entire block (costly)
    • erase the entire block (quite costly)
    • program the block (costly)
  • reliability: the same block can be erased and programmed over and over (e.g., file system metadata is overritten repeatedly)

The direct mapped approach simply gives too much control over wear out to the client workload; if the workload does not spread write load evenly across its logical blocks, the underlying physical blocks containing popular data will quickly wear out.

44.7 A Log-Structured FTL

Use a mapping table to map the logical block to the physical address

The log-based approach improves performance in:

  • erases being required once in a while
  • spread writes across pages (wear leveling)

Downsides:

  • garbage collection may drives up write amplification
  • in-memory mapping tables might be large

44.8 Garbage Collection

To reduce GC costs, overprovisioning can be performed: adding extra flash capacity and delay cleaning, which will be performed in background.

44.9 Mapping Table Size

Block-Based Mapping

Hybrid Mapping

Two types of mapping table:

  • log table (per-page mappings)
  • data table (per-block mappings)

To read:

  • first find in the log table
  • if not found, find in the data table

The key to the hybrid mapping strategy is keeping the number of log blocks small. To do that, the FTL has to periodically examine log blocks and switch them into blocks that can be pointed to by only a single block pointer.

Three types of merge:

  • switch merge: all pointers of a block in the log table can be switched into a pointer pointing to this block
  • partial merge
  • full merge

Page Mapping Plus Caching

44.10 Wear Leveling

One problem of the log-structured approach: blocks filled with long-lived data do not get overwritten; solution: periodically read and rewrite elsewhere