Operating Systems: Three Easy Pieces 学习笔记（二）

42 Crash Consistency: FSCK and Journaling⌗

One major challenge faced by a file system is how to update persistent data structures despite the presence of a power loss or system crash.

Crash-consistency problem: On-disk state may only partially get updated on system crash or power loss

42.1 A Detailed Example⌗

Crash Scenarios⌗

Three possible outcomes if an error occured when a data block is allocated and written for a file (which should allocate a data block (and write data to it), update the data bitmap, and update the file inode):

only the data block is written: no problem at all (from the perspective of file-system crash consistency)
only the updated inode is written: garbage data will be read; file-system inconsistency: the allocated status of the data block is different in the bitmap and the inode
only the bitmap is written: space leak
only the inode and the bitmap are written: garbage data
only the inode and the data block: inconsistency between the inode and the bitmap
only the bitmap and the data block: space leak again

The Crash Consistency Problem⌗

Crash-consistency problem (or consistent-update problem): the disk can only commit one write at a time, so that an update cannot be done atomically

42.2 Solution #1: The File System Checker⌗

Early systems’ approach: fix inconsistencies later (when rebooting)

fsck’s phases:

check superblock
scans inodes, indirect blocks, etc. to make consistency between inodes and bitmaps (by trusting inodes)
check inodes for corruption and other problems
verify link count
check for duplicate pointers (two different inodes refer to the same block) (clear some inodes or make copies of the duplicate data)
check for bad block pointers (and remove them)
check directories

Fundamental problems of fsck: too slow

At a higher level, the basic premise of fsck seems just a tad irrational. Consider our example above, where just three blocks are written to the disk; it is incredibly expensive to scan the entire disk to fix problems that occurred during an update of just three blocks.

42.3 Solution #2: Journaling (or Write-Ahead Logging)⌗

Write-ahead logging (or journaling): write logs before overwriting the structures to be able to recover from errors

Data Journaling⌗

Journal write:

transaction begin block
blocks containing the exact contents of the updating blocks (physical logging) or some more compact forms (logical logging)
transaction end block

Checkpointing: override the old structures in the file system

Problems if a crach occurs during the write to the journal: if writing all the log blocks at once, as the disk might perform scheduling, they might not complete in order, and thus some blocks might not be correct (e.g. invalid data blocks within valid TxB and TxE blocks).

Solution: write all blocks except the TxE at once, and write TxE block after these writes. With atomicity guatantee!

Performance problem: the transaction-end block can only be written after previous blocks are written.
Optimizing log writes: using checksum

Three phases:

journal write (TxB, metadata and data)
journal commit (TxE)
checkpoint (write the contents to final on-disk locations)

Recovery⌗

Replay the transactions that are commited to the disk (redo logging)

Batching Log Updates⌗

Buffer all updates into a global transaction to avoid writing some same blocks many times.

Making The Log Finite⌗

Journaling file systems treat the log as a circular data structure (the journal is referred to as circular log).

Mark the transaction free in the journal by updating the journal superblock some time after checkpoint.

Now the four phases:

journal write
journal commit
checkpoint
free

Metadata journaling⌗

Ordered journaling (or metadata journaling): the user data is not written to the journal

One possible order (to avoid garbage data):

data write (1)
journal metadata write (2)
journal commit
checkpoint metadata
free

(1) and (2) can be written concurrently

43 Log-structured File Systems⌗

The motivation
system memories are growing so that more data can be cached, and disk traffic increasingly consists of writes
a large gap between random and sequential I/O performace
exising file systems perform poorly on many common workloads
file systems are not RAID-aware
An ideal file system would thus focus on write performance, and try to make use of the sequential bandwidth of the disk. Further, it would perform well on common workloads that not only write out data but also update on-disk metadata structures frequently. Finally, it would work well on RAIDs as well as single disks.

43.1 Writing To Disk Sequentially⌗

The basic idea: writing all updates (such as data blocks, inodes, etc.) to the disk sequentially

43.2 Writing Sequentially And Effectively⌗

Sequential writes is not enough; to achieve good performance we must issue contiguous writes (or one large write)

Write buffering: keeps track of updates in memory, and write the buffered data at once when gets sufficient updates (called a segment)

43.3 How Much To Buffer?⌗

$$ D=\cfrac{F}{1-F}\times R_\mathrm{peak}\times T_\mathrm{position} $$

43.4 Problem: Finding Inodes⌗

43.5 Solution Through Indirection: The Inode Map⌗

The inode map: takes an inode number as input and returns the disk address of the most recent version of the inode

LFS places chunks of the inode map right next to where it is writing all of the other new information.

Inode maps are compact enough to keep the active portions cached in main memory: inode map lookups rarely require disk accesses.
(From The Design and Implementation of a Log-Structured File System)

43.6 Completing The Solution: The Checkpoint Region⌗

The checkpoint region (or CR) contains pointers to the latest pieces of the inode map; it is only updated periodically for performance

43.7 Reading A File From Disk: A Recap⌗

The steps of reading a file (from start):

read the CR and cache it in memory
lookup the CR to find the address of a given inode
(the rest is pretty much the same as a typical FS)

43.8 What About Directories?⌗

Steps:

look in the imap for the location of the inode of the directory
find inode number of the file in the inode of the directory
look in the imap for the location of the inode of the file

The imap also solves the recursive update problem through indirection.

43.9 A New Problem: Garbage Collection⌗

The cleaner assures to preserve contiguous free regions by compacting existing segments to new, contiguous segments

43.10 Determining Block Liveness⌗

Use segment summary block at the head of the segment to record the inode number and offset (which block of the file) for each data block

To determine whether a block is dead:

look in the segment summary block and find its inode number and offset
look in the imap to find and read the inode
look in the inode using the offset to compare the disk location

LFS can also use version number to determine this, avoiding extra reads.

43.11 A Policy Question: Which Blocks To Clean, And When?⌗

The time to clean can be:

perioducally
during idle time
when the disk is full

43.12 Crash Recovery And The Log⌗

Use two CRs containing update timestamps to recover from crash during CR updates

The last consistent snapshot of the file system may be quite old as LFS only writes the CR every 30 seconds or so; use roll forward to read through the next segments of the end of the log to see if there are any valid updates, and updates the file system accordingly.

44 Flash-based SSDs⌗

Solid-state storage: random access, retains information on power loss

44.1 Storing a Single Bit⌗

single-level cell (SLC) flash: a single bit per transistor
multi-level cell (MLC) flash: two bits per transistor (00, 01, 10, 11 represented by different levels of charge)
triple-level cell (TLC) flash: three bits per cell

SLC chips achieve higher performance and are more expensive.

44.2 From Bits to Banks/Planes⌗

Flash chips are organized into banks or planes which consist of a large number of cells.

A bank is accessed in two different sized units:

block (or erase block): typically of size 128KB or 256KB
page: a few KB in size (e.g. 4KB)

44.3 Basic Flash Operations⌗

Three low-level operations:

read (a page): random access; fast, 10s of $\mathrm{\mu s}$ or so
erase (a block): setting each bit to 1; quite expensive, takes a few ms; need to copy data of care somewhere else
program (a page): After a block is erased, change some of the 1’s within a page to 0’s; usually taking around $100\mathrm{\mu s}$

44.4 Flash Performance And Reliability⌗

Wear out: erasing and programming accure a little bit of extra charge, 0 and 1 are increasingly difficult to be differentiated.

The typical lifetime:

MLC: 10,000 P/E (Program/Erase) cycles
SLC: 100,000 P/E cycles
actual lifetimes are much longer

Disturbance: some bits get flipped in neighboring pages when accessing (read/program) a particular page

44.5 From Raw Flash to Flash-Based SSDs⌗

Flash translation layer (or FTL): translates read and write requests on logical blocks into low-level read, erase and program commands on underlying physical blocks and physical pages

The FTL should accomplish this task with the goal of delivering excellent performance and high reliability.

To achieve performance:

parallel
reduce write amplification

Write amplification: total write traffic issued to the flash chips (by the FTL) divided by the total write traffic issued to the SSD (by the client)

To achieve high reliability:

for wear out: wear leveling
for program disturbance: a sequential-programming approach

44.6 FTL Organization: A Bad Approach⌗

Direct mapped organization: a read to a logical page is mapped directly to the physical page with the same page number

The problems with the direct-mapped FTL:

performance: each write needs to
- read in the entire block (costly)
- erase the entire block (quite costly)
- program the block (costly)
reliability: the same block can be erased and programmed over and over (e.g., file system metadata is overritten repeatedly)

The direct mapped approach simply gives too much control over wear out to the client workload; if the workload does not spread write load evenly across its logical blocks, the underlying physical blocks containing popular data will quickly wear out.

44.7 A Log-Structured FTL⌗

Use a mapping table to map the logical block to the physical address

The log-based approach improves performance in:

erases being required once in a while
spread writes across pages (wear leveling)

Downsides:

garbage collection may drives up write amplification
in-memory mapping tables might be large

44.8 Garbage Collection⌗

To reduce GC costs, overprovisioning can be performed: adding extra flash capacity and delay cleaning, which will be performed in background.

44.9 Mapping Table Size⌗

Block-Based Mapping⌗

Hybrid Mapping⌗

Two types of mapping table:

log table (per-page mappings)
data table (per-block mappings)

To read:

first find in the log table
if not found, find in the data table

The key to the hybrid mapping strategy is keeping the number of log blocks small. To do that, the FTL has to periodically examine log blocks and switch them into blocks that can be pointed to by only a single block pointer.

Three types of merge:

switch merge: all pointers of a block in the log table can be switched into a pointer pointing to this block
partial merge
full merge

Page Mapping Plus Caching⌗

44.10 Wear Leveling⌗

One problem of the log-structured approach: blocks filled with long-lived data do not get overwritten; solution: periodically read and rewrite elsewhere