Operating Systems: Three Easy Pieces 学习笔记(一)
36 I/O Devices⌗
36.1 System Architecture⌗
Devices are connected to the system via:
- memory bus (CPU - main memory)
- general I/O bus: PCI (graphics, high performance devices)
- peripheral bus: SCSI, SATA, USB (slow devices)
In modern systems:
- PCIe Graphics
- memory interconnect
- DMI (Direct Media Interface)
- PCIe
- eSATA
36.2 A Canonical Device⌗
A device has two important components:
- interface
- status
- command
- data
- internal structure
36.3 The Canonical Protocol⌗
Programmed I/O (PIO): the main CPU is involved with the data movement
36.4 Lowering CPU Overhead With Interrupts⌗
Utilize CPU:
- interrupt
- a two-phased approach: polls first, using interrupts if not yet finished
- coalescing
36.5 More Efficient Data Movement With DMA⌗
Direct Memory Access (DMA): transfer data between devices and main memory without much CPU intervention
36.6 Methods Of Device Interaction⌗
- explicit I/O instructions (on x86, the in and out instructions), usually privileged
- memory-mapped I/O (no new instructions are needed)
36.7 Fitting Into The OS: The Device Driver⌗
Read/write blocks:
- file system
- raw interface (support low-level storage management)
36.8 Case Study: A Simple IDE Disk Driver⌗
Registers are available by reading or writing to specific “I/O addresses” using (on x86) the in
and out
I/O instructions.
- wait for drive to be ready
- write parameters to command registers
- start the I/O: issuing read/write to command register
- data transfer (for writes): write data to data port
- handle interrupts
37 Hard Disk Drives⌗
37.1 The Interface⌗
The drive consists of a large number of sectors (512-byte blocks), each of which can be read or written
View the disk as an array of sectors (扇区)
- multi-sector operations are possible
- only guarantee a single 512-byte write is atomic
Efficiency:
- access closer blocks are faster than farther ones
- sequential read or write are faster than the random access manner
37.2 Basic Geometry⌗
- platter: 磁盘
- surface: the 2 sides of a platter
- spindle: 轴
- track: 道, concentric circles of sectors
- disk head
- disk arm
37.3 A Simple Disk Drive⌗
I/O time:
- rotational delay
- seek time
- acceleration
- coasting
- deceleration
- settling (significant)
- transfer
Track skew: sequential reads can be properly serviced when crossing track boundaries
Multi-zoned disk drives:
- a zone is a consecutive set of tracks on a surface
- each zone has the same number of sectors per track
- outer zones have more sectors
Cache (track buffer):
- read: to read in all of the sectors on that track and cache them in its memory
- write
- write back: acknowledge when puts into memory
- write through: acknowledge after actual writes
37.4 I/O Time: Doing The Math⌗
The rate of I/O:
$$ T_\mathrm{I/O} = T_\mathrm{seek}+T_\mathrm{rotation}+T_\mathrm{transfer} $$
$$ R_\mathrm{I/O}=\cfrac{Size_\mathrm{transfer}}{T_\mathrm{I/O}} $$
- a huge gap in drive performance between random and sequential workloads
- a large difference in performance between high-end “performance” drives and low-end “capacity” drives
37.5 Disk Scheduling⌗
Principle: shortest job first, SJF
Shortest-seek-time-first (SSTF, also shortest-seek-first or SSF): picking requests on nearest track to complete first
problems:
- the drive geometry is not available to the OS (fix: nearest-block-first, NBF)
- starvation: a steady stream to the closest track will block the access to other tracks
Elevator algorithm⌗
SCAN: simply moves back and forth across the disks
F-SCAN: freezes the queue when doing a sweep (avoid starvation of far-away requests)
C-SCAN (Circular SCAN): only sweeps from outer-to-inner (more fair to the inner and outer tracks)
Prevent fights from taking place on elevators!
SCAN and SSTF does not adhere to SJF closly: they ignore rotation
SPTF: Shortest positioning time first⌗
Also called shortest access time first, SATF
In engineering, it turns out “it depends” is almost always the answer, reflecting that trade-offs are part of the life of the engineer.
On modern drives both seek and rotation are roughly equivalent
SPTF is usually performed inside a drive
Other scheduling issues⌗
Where to perform scheduling: The OS scheduler usually picks (what it thinks) some best few requests (say 16) and issues them all to disk; the disk then uses its internal knowledge to service said requests in the best possible (SPTF) order.
I/O merging: merge I/O requests by the OS
Non-work-conserving: waiting for some time for a possible better request
38 Redundant Arrays of Inexpensive Disks (RAIDs)⌗
RAID: a technique to use multiple disks in concert to build a faster, bigger, and more reliable disk system
- externally looks like a disk
- internally consisting multiple disks, memory and processor(s)
38.1 Interface And RAID Internals⌗
To perform a logical I/O:
- calculate the actual disk(s)
- perform one or more physical I/Os
38.2 Fault Model⌗
RAIDs are designed to detect and recover from certain kinds of disk faults.
The fail-stop fault model: a not-working disk is assumed permanently lost
38.3 How To Evaluate A RAID⌗
Characteristics:
- capacity
- reliability
- performance
38.4 RAID Level 0: Striping⌗
RAID level 0, or striping as it is better known, serves as an excellent upper-bound on performance and capacity.
Striping: no redundancy; spread the blocks of the array across the disks in a round-robin fashion to get the most parrallelism.
Simple striping:
Disk 0 | Disk 1 | Disk 2 | Disk 3 |
---|---|---|---|
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
Stripe: blocks in the same row (0, 1, 2, 3)
Striping with a bigger chunk size (8KB, 2 blocks, a stripe consists of 32KB of data):
Disk 0 | Disk 1 | Disk 2 | Disk 3 |
---|---|---|---|
0 | 2 | 4 | 6 |
1 | 3 | 5 | 7 |
8 | 10 | 12 | 14 |
9 | 11 | 13 | 15 |
Chunk sizes⌗
- small chunk size: on large files, positioning more on a single disk
- large chunk size: less parallelism, but also less posisioning time
Thus, determining the “best” chunk size is hard to do, as it requires a great deal of knowledge about the workload presented to the disk system.
RAID-0 Analysis⌗
- capacity: perfect
- reliability: any disk failure will lead to data loss
- performance: all disks are utilized, often in parallel
Evaluating RAID Performances⌗
Two performance metrics:
- single-request latency
- steady-state throughput (critical in high-perfromance environments)
Two types of workloads (considered here):
- sequential: large contiguous chunks
- searching through a large file for a keyword
- random: samll requests, to different random locations on disk
- transactional workloads on a database management system
Denote the sequential and random workload’s throughput as:
- $S$ for sequential
- $R$ for random
38.5 RAID Level 1: Mirroring⌗
RAID-10 (or RAID 1+0, stripe of mirrors):
Disk 0 | Disk 1 | Disk 2 | Disk 3 |
---|---|---|---|
0 | 0 | 1 | 1 |
2 | 2 | 3 | 3 |
4 | 4 | 5 | 5 |
6 | 6 | 7 | 7 |
RAID-01
Disk 0 Disk 1 Disk 2 Disk 3 0 1 0 1 2 3 2 3 4 5 4 5 6 7 6 7
- read: either copy
- write: update both
Analysis⌗
Consistent-update problem
untimely failures on write: inconsistent copies
solution: write-ahead log and recovery procedure
- capacity: expensive, $N\cdot B/2$ for mirroring level = 2
- reliability: tolerate failure on any disk, up to $N/2$ disks can fail
- single request performance
- read: sames as on a single disk ($R$, writes to two blocks)
- write: slightly higher than writing to a single disk (worst-case seek and rotational delay)
- steady-state throughput
- write: ($\frac{N}{2}\cdot S$, writes to $N$ blocks)
- read: same as write ($\frac{N}{2}\cdot S$, each disk receives a request for every other block)
- random reads/writes
- read: $N\cdot R$
- write: $\frac{N}{2}\cdot R$
38.6 RAID Level 4: Saving Space With Parity⌗
Parity-based approaches attempt to use less capacity and thus overcome the huge space penalty paid by mirrored systems. They do so at a cost, however: performance.
Disk 0 | Disk 1 | Disk 2 | Disk 3 | Disk 4 |
---|---|---|---|---|
0 | 1 | 2 | 3 | P0 |
4 | 5 | 6 | 7 | P1 |
8 | 9 | 10 | 11 | P2 |
12 | 13 | 14 | 15 | P3 |
Simple function: XOR, which makes a row having even number of 1
Analysis⌗
- capacity: $(N-1)\cdot B$
- reliability: tolerates 1 disk failure
- performance
- steady-state: $(N-1)\cdot S$ (full-stripe write: calculate parity bits first then write all disks in parallel)
- random
- read: $(N-1)\cdot R$
- write: $R/2$ (the parity disk is the bottleneck)
- additive parity: read in parallel, calculate XOR, and write in parallel (larger RAIDs require a high number of reads)
- subtractive parity: $P_\mathrm{new}=(C_\mathrm{old}\oplus C_\mathrm{new})\oplus P_\mathrm{old}$ (Why not always better?– because it has to read the parity disk?)
Small-write problem: the parity disk is a bottleneck under small writes which prevents parallelism
38.7 RAID Level 5: Rotating Parity⌗
Disk 0 | Disk 1 | Disk 2 | Disk 3 | Disk 4 |
---|---|---|---|---|
0 | 1 | 2 | 3 | P0 |
5 | 6 | 7 | P1 | 4 |
10 | 11 | P2 | 8 | 9 |
15 | P3 | 12 | 13 | 14 |
P4 | 16 | 17 | 18 | 19 |
Analysis⌗
random write: $\frac{N}{4}\cdot R$ (four I/O operations) (aren’t the reads/writes parallel?–the data disk itself serves as parity disk as well)
39 Interlude: Files and Directories⌗
Thus, the OS must take extra care with such a device: this is where users keep data that they really care about.
39.1 Files And Directories⌗
- file: linear array of bytes
- low-level name (inode number)
- directory
- low-level name (inode number)
- a list of
(user-readable name, low-level name)
pairs
39.3 Creating Files⌗
Create a new file: call open()
and pass O_CREAT
flag:
int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, S_IRUSR|S_IWUSR);
O_CREAT
: create the file if it does not existO_WRONLY
: read onlyO_TRUNC
: if exists already, truncates it to a size of zero bytes
39.5 Reading And Writing, But Not Sequentially⌗
Reading some random offsets within the document:
off_t lseek(int fildes, off_t offset, int whence);
fildes
: file descriptoroffset
: the file offset to a particular locationwhence
SEEK_SET
: the offset is set to offset bytesSEEK_CUR
: the offset is set to its current location plus offset bytesSEEK_END
: the offset is set to the size of the file plus offset bytes
Calling
lseek()
does not perform a disk seek
The “current offset” tracked by the OS:
- implicitly updates when read or write
- explicitly updates with
lseek
The file structure:
struct file {
int ref;
char readable;
char writable;
struct inode * ip;
uint off;
};
The open file table:
struct {
struct spinlock lock;
struct file file[NFILE];
} ftable;
39.6 Shared File Table Entries: fork()
And dup()
⌗
The open file table entries are shared by 1. the parent and child processes created by fork()
; 2. one file descriptor and its duplicates created by dup()
, with the reference count incremented.
39.7 Writing Immediately With fsync()
⌗
Calls fsync()
to force all dirty (not yet written) data to disk.
int rc = write(fd, buffer, size); assert(rc == size);
In some cases, it is also needed to fsync()
the containing directory to ensure that the file is part of the directory.
Not surprisingly, this type of detail is often overlooked, leading to many application-level bugs.
mmap()
and persistent memoryBy combining the persistence of files with the access semantics of memory, file-backed memory mappings support a software abstraction called persistent memory.
39.8 Renaming Files⌗
rename(char *old, char *new)
rename()
call is usually implemented as an atomic call.
An example to use this kind of guarantee:
int fd = open("foo.txt.tmp", O_WRONLY|O_CREAT|O_TRUNC, S_IRUSR|S_IWUSR);
write(fd, buffer, size);
fsync(fd);
close(fd);
rename("foo.txt.tmp", "foo.txt");
39.9 Getting Information About Files⌗
Use stat()
or fstat()
to access metadata of files or directories.
39.10 Removing Files⌗
Use unlink()
to remove files.
39.11 Making Directories⌗
Use mkdir()
to create a directory.
An empty directory has two entries:
- refer to itself
- refer to its parent
39.12 Reading Directories⌗
Use opendir()
, readdir()
and closedir()
to read contents of a directory.
DIR * dp = opendir(".");
assert(dp != NULL);
struct dirent * d;
while ((d = readdir(dp)) != NULL) {
printf("%lu %s\n", (unsigned long) d->d_ino, d->d_name);
}
closedir(dp);
39.13 Deleting Directories⌗
Call rmdir()
to delete a directory.
39.14 Hard Links⌗
Call link()
to make an entry in the file system tree, simply creates another name in the directory and refers it to the same inode number.
ln file file2
When creating a file, two things happened:
- making a structure (the inode) that tracks relevent information
- linking a human-readable name to that file, and put that link into a directory
When the file system unlinks a file, it checks the reference count (link count) within the inode number, and only free the node and related data blocks when it reaches zero.
39.15 Symbolic Links⌗
Call symlink()
to make an symbolic link (soft link) to a file.
ln -s file file2
Differences from hard links:
- symbolic links is actually a file itself of a different type
- possibility for dangling reference
39.16 Permission Bits And Access Control Lists⌗
Permission bits consist of three groupings:
- owner
- group
- other (anyone)
The owner of the file can change the permissions.
But now you have learned an even more important lesson: there can be no secrets in a good marriage, even within the file system.
The Time Of Check To Time Of Use (TOCTTOU) problem
The problem is caused by the gap between the check and the update.
39.17 Making And Mounting A File System⌗
And mkfs said, let there be a file system!
Mount: to take an existing directory as a target mount point and paste a new file system onto the directory tree at that point.
Mount unfies all file systems into one tree, making naming uniform and convenient.
40 File System Implementation⌗
vsfs (the Very Simple File System)
The file system is pure software; unlike our development of CPU and memory virtualization, we will not be adding hardware features to make some aspect of the file system work better (though we will want to pay attention to device characteristics to make sure the file system works well).
40.2 Overall Organization⌗
The disk is divided into:
- data blocks: user data
- inodes
- allocation structures: to track whether inodes or data blocks are free or allocated
- superblock: contains information about the fie system
40.3 File Organization: The Inode⌗
Inode (index node): the data structure that holds the metadata of a given file
i-number (low-level name): the number that the inode is implicitly referred to; disk position can be calculated by the number
Calculation:
$$ block = (inumber\times sizeof(inode)) / blockSize $$
$$ sector = ((block\times blockSize) + inodeStartAddr) / sectorSize $$
Metadata in a inode:
- type: directory or file (kept in the directory entry and thus is not found in the inode)
- size
- number of blocks
- protection information
- time
- data block position on disk
The Multi-Level Index⌗
Indirect pointer: points to a block that contains pointers which each points to user data
An inode may have some fixed number of direct pointers and a single indirect pointer.
If a file grows large enough, an indirect block is allocated (from the data block) and the inode’s indirect pointer is set to point to it.
Extent-based approaches: a disk pointer plus a length; allow for more than one extent
Linked-based approaches: use a linked list to keep track all blocks of a file
Double indirect pointer: refer to a block that contains indirect pointers
Most files are small: a small number of direct pointers is sufficient
40.4 Directory Organization⌗
Deleting a file can leave an empty space in the middle of the directory; a new entry may reuse an old, bigger, deleted entry.
File systems often streat directories as a special type of file, and thus they have inodes in the inode table.
40.5 Free Space Management⌗
To track free inode/data blocks:
- use bitmap
- use free list
- use more soplisticated ds, e.g. B-tree
Pre-allocation policy: reserve a sequence of blocks for a newly-created file, to guarantee that a portion of the file is contiguous on the disk.
40.6 Access Paths: Reading and Writing⌗
Reading A File From Disk⌗
The timeline of reading a file /foo/bar
:
open("/foo/bar", O_RDONLY)
- read the root inode (
/
) (whose i-number is well known;2
in most UNIX FSs) - read
/
data (to look for an entry forfoo
) - read
foo
inode - read
foo
data (to look for an entry forbar
) - read
bar
inode
read(...)
- read
bar
inode - read specific block of
bar
’s data - write
bar
inode for last access time
Yes, life can get pretty bad when reading a file; as you’re about to find out, writing out a file (and especially, creating a new one) is even worse.
Reads don’t access allocation structures (e.g. bitmaps); they are only accessed when allocation is needed. There is no need to make sure a block is allocated when the inode already points to it
Writing A File To Disk⌗
Each write to a file logically generates 5 I/Os:
- read file inode
- read data bitmap
- write data bitmap (to mark the newly-allocated block as used)
- write file data
- write file inode
For file creation:
- read inode and inode data traversly to find the containing directory
- read containing directory data
- read inode bitmap (find a free inode)
- write inode bitmap (to mark the inode allocated)
- write directory data (link)
- read/write directory inode
- read/write inode of the new file (initialize)
- additional I/O might be needed if the directory needs to grow
40.7 Caching and Buffering⌗
Most file systems aggressively use system memory (DRAM) to cache important blocks.
Early file systems use a fixed-size cache to hold popular blocks; this static partitioning method can be wasteful.
Modern systems use a dynamic partitioning approach.
Static vs. dynamic partitioning
- static: predictable performance
- dynamic: better utilization; more complex to implement
Buffering: delay writes to reduce I/O
In this case, laziness (in writing blocks to disk) is a virtue.
41 Locality and The Fast File System⌗
41.1 The Problem: Poor Performance⌗
Main issue of the old UNIX file system: data was spread all over the place, thus had expensive positioning costs
Some problems:
- the inode and the data block are often far away
- the file system is getting more and more fragmented
- the block size is too small, so that transferring data is inefficient
41.3 Organizing Structure: The Cylinder Group⌗
Cylinder (柱面): a set of tracks on different surfaces that are the same distance from the center
FFS divides the disk into a number of cylinder groups.
As modern drives only export a logical address space of blocks, modern file systems instead organize the drive into block groups, each of which is a consecutive portion of the disk’s address space. By placing two files within the same group, FFS can ensure that accessing one after the other will not result in long seek time.
FFS needs to have the ability to place files and directories into a groups and track relevant information, thus all the structures that a file system is expected to have is included in each group.
The components of a single cylinder group:
- a copy of the super block
- a per-group inode bitmap and data bitmap
- inode region
- data block region
41.4 Policies: How To Allocate Files and Directories⌗
The basic mantra is simple: keep related stuff together (and its corollary, keep unrelated stuff far apart).
A few placement heuristics:
- directories: find the group
- with a low number of allocated directories (to balance directories across groups)
- with a high number of free inodes (to be able to allocate a bunch of files)
- files:
- allocate the data blocks in the same group as its inode
- places all files that are in the same directory in the cylinder group of the directory they are in
41.6 The Large-File Exception⌗
Filling an entire block group with large files will prevents related files from being placed within the same group.
FFS’ policy for large files: after some number of blocks are allocated into the first block group, it places the next large chunk of the file in another block group.
To address the performance problem: choosing chunk size carefully. By amotization, if the chunk size is large enough, the seek time will be relatively little.
FFS’s approach
- the first 12 blocks are placed in the same group as the inode
- each subsquent indirect block and the blocks it points to are placed in a different group
Note that the trend in disk drives is that transfer rate improves fairly rapidly, as disk manufacturers are good at cramming more bits into the same surface, but the mechanical aspects of drives related to seeks (disk arm speed and the rate of rotation) improve rather slowly.
41.7 A Few Other Things About FFS⌗
Internal fragmentation: 4KB blocks are used by many 2KB files
Sub-blocks: 512-byte blocks can be allocated to files; copy to a 4KB block when the size grows beyond 4KB
To address the performance issue caused by copies: buffer writes, to avoid this in most cases.
A refined disk layout (parameterization): skipping a block when issuing a sequential read (e.g. 0 6 1 7 2 8 3 9 4 10 5 11); to issue the performance problem: using track buffer.
FFS also supports:
- long file names
- symbolic links
Certainly all modern systems account for the main lesson of FFS: treat the disk like it’s a disk.