Operating Systems: Three Easy Pieces 学习笔记(二)
42 Crash Consistency: FSCK and Journaling⌗
One major challenge faced by a file system is how to update persistent data structures despite the presence of a power loss or system crash.
Crash-consistency problem: On-disk state may only partially get updated on system crash or power loss
42.1 A Detailed Example⌗
Crash Scenarios⌗
Three possible outcomes if an error occured when a data block is allocated and written for a file (which should allocate a data block (and write data to it), update the data bitmap, and update the file inode):
- only the data block is written: no problem at all (from the perspective of file-system crash consistency)
- only the updated inode is written: garbage data will be read; file-system inconsistency: the allocated status of the data block is different in the bitmap and the inode
- only the bitmap is written: space leak
- only the inode and the bitmap are written: garbage data
- only the inode and the data block: inconsistency between the inode and the bitmap
- only the bitmap and the data block: space leak again
The Crash Consistency Problem⌗
Crash-consistency problem (or consistent-update problem): the disk can only commit one write at a time, so that an update cannot be done atomically
42.2 Solution #1: The File System Checker⌗
Early systems’ approach: fix inconsistencies later (when rebooting)
fsck
’s phases:
- check superblock
- scans inodes, indirect blocks, etc. to make consistency between inodes and bitmaps (by trusting inodes)
- check inodes for corruption and other problems
- verify link count
- check for duplicate pointers (two different inodes refer to the same block) (clear some inodes or make copies of the duplicate data)
- check for bad block pointers (and remove them)
- check directories
Fundamental problems of fsck
: too slow
At a higher level, the basic premise of
fsck
seems just a tad irrational. Consider our example above, where just three blocks are written to the disk; it is incredibly expensive to scan the entire disk to fix problems that occurred during an update of just three blocks.
42.3 Solution #2: Journaling (or Write-Ahead Logging)⌗
Write-ahead logging (or journaling): write logs before overwriting the structures to be able to recover from errors
Data Journaling⌗
Journal write:
- transaction begin block
- blocks containing the exact contents of the updating blocks (physical logging) or some more compact forms (logical logging)
- transaction end block
Checkpointing: override the old structures in the file system
Problems if a crach occurs during the write to the journal: if writing all the log blocks at once, as the disk might perform scheduling, they might not complete in order, and thus some blocks might not be correct (e.g. invalid data blocks within valid TxB and TxE blocks).
Solution: write all blocks except the TxE at once, and write TxE block after these writes. With atomicity guatantee!
Performance problem: the transaction-end block can only be written after previous blocks are written.
Optimizing log writes: using checksum
Three phases:
- journal write (TxB, metadata and data)
- journal commit (TxE)
- checkpoint (write the contents to final on-disk locations)
Recovery⌗
Replay the transactions that are commited to the disk (redo logging)
Batching Log Updates⌗
Buffer all updates into a global transaction to avoid writing some same blocks many times.
Making The Log Finite⌗
Journaling file systems treat the log as a circular data structure (the journal is referred to as circular log).
Mark the transaction free in the journal by updating the journal superblock some time after checkpoint.
Now the four phases:
- journal write
- journal commit
- checkpoint
- free
Metadata journaling⌗
Ordered journaling (or metadata journaling): the user data is not written to the journal
One possible order (to avoid garbage data):
- data write (1)
- journal metadata write (2)
- journal commit
- checkpoint metadata
- free
(1) and (2) can be written concurrently
43 Log-structured File Systems⌗
The motivation
- system memories are growing so that more data can be cached, and disk traffic increasingly consists of writes
- a large gap between random and sequential I/O performace
- exising file systems perform poorly on many common workloads
- file systems are not RAID-aware
An ideal file system would thus focus on write performance, and try to make use of the sequential bandwidth of the disk. Further, it would perform well on common workloads that not only write out data but also update on-disk metadata structures frequently. Finally, it would work well on RAIDs as well as single disks.
43.1 Writing To Disk Sequentially⌗
The basic idea: writing all updates (such as data blocks, inodes, etc.) to the disk sequentially
43.2 Writing Sequentially And Effectively⌗
Sequential writes is not enough; to achieve good performance we must issue contiguous writes (or one large write)
Write buffering: keeps track of updates in memory, and write the buffered data at once when gets sufficient updates (called a segment)
43.3 How Much To Buffer?⌗
$$ D=\cfrac{F}{1-F}\times R_\mathrm{peak}\times T_\mathrm{position} $$
43.4 Problem: Finding Inodes⌗
43.5 Solution Through Indirection: The Inode Map⌗
The inode map: takes an inode number as input and returns the disk address of the most recent version of the inode
LFS places chunks of the inode map right next to where it is writing all of the other new information.
Inode maps are compact enough to keep the active portions cached in main memory: inode map lookups rarely require disk accesses.
(From The Design and Implementation of a Log-Structured File System)
43.6 Completing The Solution: The Checkpoint Region⌗
The checkpoint region (or CR) contains pointers to the latest pieces of the inode map; it is only updated periodically for performance
43.7 Reading A File From Disk: A Recap⌗
The steps of reading a file (from start):
- read the CR and cache it in memory
- lookup the CR to find the address of a given inode
- (the rest is pretty much the same as a typical FS)
43.8 What About Directories?⌗
Steps:
- look in the imap for the location of the inode of the directory
- find inode number of the file in the inode of the directory
- look in the imap for the location of the inode of the file
The imap also solves the recursive update problem through indirection.
43.9 A New Problem: Garbage Collection⌗
The cleaner assures to preserve contiguous free regions by compacting existing segments to new, contiguous segments
43.10 Determining Block Liveness⌗
Use segment summary block at the head of the segment to record the inode number and offset (which block of the file) for each data block
To determine whether a block is dead:
- look in the segment summary block and find its inode number and offset
- look in the imap to find and read the inode
- look in the inode using the offset to compare the disk location
LFS can also use version number to determine this, avoiding extra reads.
43.11 A Policy Question: Which Blocks To Clean, And When?⌗
The time to clean can be:
- perioducally
- during idle time
- when the disk is full
43.12 Crash Recovery And The Log⌗
Use two CRs containing update timestamps to recover from crash during CR updates
The last consistent snapshot of the file system may be quite old as LFS only writes the CR every 30 seconds or so; use roll forward to read through the next segments of the end of the log to see if there are any valid updates, and updates the file system accordingly.
44 Flash-based SSDs⌗
Solid-state storage: random access, retains information on power loss
44.1 Storing a Single Bit⌗
- single-level cell (SLC) flash: a single bit per transistor
- multi-level cell (MLC) flash: two bits per transistor (00, 01, 10, 11 represented by different levels of charge)
- triple-level cell (TLC) flash: three bits per cell
SLC chips achieve higher performance and are more expensive.
44.2 From Bits to Banks/Planes⌗
Flash chips are organized into banks or planes which consist of a large number of cells.
A bank is accessed in two different sized units:
- block (or erase block): typically of size 128KB or 256KB
- page: a few KB in size (e.g. 4KB)
44.3 Basic Flash Operations⌗
Three low-level operations:
- read (a page): random access; fast, 10s of $\mathrm{\mu s}$ or so
- erase (a block): setting each bit to 1; quite expensive, takes a few ms; need to copy data of care somewhere else
- program (a page): After a block is erased, change some of the 1’s within a page to 0’s; usually taking around $100\mathrm{\mu s}$
44.4 Flash Performance And Reliability⌗
Wear out: erasing and programming accure a little bit of extra charge, 0 and 1 are increasingly difficult to be differentiated.
The typical lifetime:
- MLC: 10,000 P/E (Program/Erase) cycles
- SLC: 100,000 P/E cycles
- actual lifetimes are much longer
Disturbance: some bits get flipped in neighboring pages when accessing (read/program) a particular page
44.5 From Raw Flash to Flash-Based SSDs⌗
Flash translation layer (or FTL): translates read and write requests on logical blocks into low-level read, erase and program commands on underlying physical blocks and physical pages
The FTL should accomplish this task with the goal of delivering excellent performance and high reliability.
To achieve performance:
- parallel
- reduce write amplification
Write amplification: total write traffic issued to the flash chips (by the FTL) divided by the total write traffic issued to the SSD (by the client)
To achieve high reliability:
- for wear out: wear leveling
- for program disturbance: a sequential-programming approach
44.6 FTL Organization: A Bad Approach⌗
Direct mapped organization: a read to a logical page is mapped directly to the physical page with the same page number
The problems with the direct-mapped FTL:
- performance: each write needs to
- read in the entire block (costly)
- erase the entire block (quite costly)
- program the block (costly)
- reliability: the same block can be erased and programmed over and over (e.g., file system metadata is overritten repeatedly)
The direct mapped approach simply gives too much control over wear out to the client workload; if the workload does not spread write load evenly across its logical blocks, the underlying physical blocks containing popular data will quickly wear out.
44.7 A Log-Structured FTL⌗
Use a mapping table to map the logical block to the physical address
The log-based approach improves performance in:
- erases being required once in a while
- spread writes across pages (wear leveling)
Downsides:
- garbage collection may drives up write amplification
- in-memory mapping tables might be large
44.8 Garbage Collection⌗
To reduce GC costs, overprovisioning can be performed: adding extra flash capacity and delay cleaning, which will be performed in background.
44.9 Mapping Table Size⌗
Block-Based Mapping⌗
Hybrid Mapping⌗
Two types of mapping table:
- log table (per-page mappings)
- data table (per-block mappings)
To read:
- first find in the log table
- if not found, find in the data table
The key to the hybrid mapping strategy is keeping the number of log blocks small. To do that, the FTL has to periodically examine log blocks and switch them into blocks that can be pointed to by only a single block pointer.
Three types of merge:
- switch merge: all pointers of a block in the log table can be switched into a pointer pointing to this block
- partial merge
- full merge
Page Mapping Plus Caching⌗
44.10 Wear Leveling⌗
One problem of the log-structured approach: blocks filled with long-lived data do not get overwritten; solution: periodically read and rewrite elsewhere