Operating Systems: Three Easy Pieces 学习笔记（一）

36 I/O Devices⌗

36.1 System Architecture⌗

Devices are connected to the system via:

memory bus (CPU - main memory)
general I/O bus: PCI (graphics, high performance devices)
peripheral bus: SCSI, SATA, USB (slow devices)

In modern systems:

PCIe Graphics
memory interconnect
DMI (Direct Media Interface)
PCIe
eSATA

36.2 A Canonical Device⌗

A device has two important components:

interface
- status
- command
- data
internal structure

36.3 The Canonical Protocol⌗

Programmed I/O (PIO): the main CPU is involved with the data movement

36.4 Lowering CPU Overhead With Interrupts⌗

Utilize CPU:

interrupt
a two-phased approach: polls first, using interrupts if not yet finished
coalescing

36.5 More Efficient Data Movement With DMA⌗

Direct Memory Access (DMA): transfer data between devices and main memory without much CPU intervention

36.6 Methods Of Device Interaction⌗

explicit I/O instructions (on x86, the in and out instructions), usually privileged
memory-mapped I/O (no new instructions are needed)

36.7 Fitting Into The OS: The Device Driver⌗

Read/write blocks:

file system
raw interface (support low-level storage management)

36.8 Case Study: A Simple IDE Disk Driver⌗

Registers are available by reading or writing to specific “I/O addresses” using (on x86) the in and out I/O instructions.

wait for drive to be ready
write parameters to command registers
start the I/O: issuing read/write to command register
data transfer (for writes): write data to data port
handle interrupts

37 Hard Disk Drives⌗

37.1 The Interface⌗

The drive consists of a large number of sectors (512-byte blocks), each of which can be read or written

View the disk as an array of sectors (扇区)

multi-sector operations are possible
only guarantee a single 512-byte write is atomic

Efficiency:

access closer blocks are faster than farther ones
sequential read or write are faster than the random access manner

37.2 Basic Geometry⌗

platter: 磁盘
surface: the 2 sides of a platter
spindle: 轴
track: 道, concentric circles of sectors
disk head
disk arm

37.3 A Simple Disk Drive⌗

I/O time:

rotational delay
seek time
- acceleration
- coasting
- deceleration
- settling (significant)
transfer

Track skew: sequential reads can be properly serviced when crossing track boundaries

Multi-zoned disk drives:

a zone is a consecutive set of tracks on a surface
each zone has the same number of sectors per track
outer zones have more sectors

Cache (track buffer):

read: to read in all of the sectors on that track and cache them in its memory
write
- write back: acknowledge when puts into memory
- write through: acknowledge after actual writes

37.4 I/O Time: Doing The Math⌗

The rate of I/O:

$$ T_\mathrm{I/O} = T_\mathrm{seek}+T_\mathrm{rotation}+T_\mathrm{transfer} $$

$$ R_\mathrm{I/O}=\cfrac{Size_\mathrm{transfer}}{T_\mathrm{I/O}} $$

a huge gap in drive performance between random and sequential workloads
a large difference in performance between high-end “performance” drives and low-end “capacity” drives

37.5 Disk Scheduling⌗

Principle: shortest job first, SJF

Shortest-seek-time-first (SSTF, also shortest-seek-first or SSF): picking requests on nearest track to complete first

problems:

the drive geometry is not available to the OS (fix: nearest-block-first, NBF)
starvation: a steady stream to the closest track will block the access to other tracks

Elevator algorithm⌗

SCAN: simply moves back and forth across the disks

F-SCAN: freezes the queue when doing a sweep (avoid starvation of far-away requests)

C-SCAN (Circular SCAN): only sweeps from outer-to-inner (more fair to the inner and outer tracks)

Prevent fights from taking place on elevators!

SCAN and SSTF does not adhere to SJF closly: they ignore rotation

SPTF: Shortest positioning time first⌗

Also called shortest access time first, SATF

In engineering, it turns out “it depends” is almost always the answer, reflecting that trade-offs are part of the life of the engineer.

On modern drives both seek and rotation are roughly equivalent

SPTF is usually performed inside a drive

Other scheduling issues⌗

Where to perform scheduling: The OS scheduler usually picks (what it thinks) some best few requests (say 16) and issues them all to disk; the disk then uses its internal knowledge to service said requests in the best possible (SPTF) order.

I/O merging: merge I/O requests by the OS

Non-work-conserving: waiting for some time for a possible better request

38 Redundant Arrays of Inexpensive Disks (RAIDs)⌗

RAID: a technique to use multiple disks in concert to build a faster, bigger, and more reliable disk system

externally looks like a disk
internally consisting multiple disks, memory and processor(s)

38.1 Interface And RAID Internals⌗

To perform a logical I/O:

calculate the actual disk(s)
perform one or more physical I/Os

38.2 Fault Model⌗

RAIDs are designed to detect and recover from certain kinds of disk faults.

The fail-stop fault model: a not-working disk is assumed permanently lost

38.3 How To Evaluate A RAID⌗

Characteristics:

capacity
reliability
performance

38.4 RAID Level 0: Striping⌗

RAID level 0, or striping as it is better known, serves as an excellent upper-bound on performance and capacity.

Striping: no redundancy; spread the blocks of the array across the disks in a round-robin fashion to get the most parrallelism.

Simple striping:

Disk 0	Disk 1	Disk 2	Disk 3
0	1	2	3
4	5	6	7
8	9	10	11
12	13	14	15

Stripe: blocks in the same row (0, 1, 2, 3)

Striping with a bigger chunk size (8KB, 2 blocks, a stripe consists of 32KB of data):

Disk 0	Disk 1	Disk 2	Disk 3
0	2	4	6
1	3	5	7
8	10	12	14
9	11	13	15

Chunk sizes⌗

small chunk size: on large files, positioning more on a single disk
large chunk size: less parallelism, but also less posisioning time

Thus, determining the “best” chunk size is hard to do, as it requires a great deal of knowledge about the workload presented to the disk system.

RAID-0 Analysis⌗

capacity: perfect
reliability: any disk failure will lead to data loss
performance: all disks are utilized, often in parallel

Evaluating RAID Performances⌗

Two performance metrics:

single-request latency
steady-state throughput (critical in high-perfromance environments)

Two types of workloads (considered here):

sequential: large contiguous chunks
- searching through a large file for a keyword
random: samll requests, to different random locations on disk
- transactional workloads on a database management system

Denote the sequential and random workload’s throughput as:
$S$ for sequential
$R$ for random

38.5 RAID Level 1: Mirroring⌗

RAID-10 (or RAID 1+0, stripe of mirrors):

Disk 0	Disk 1	Disk 2	Disk 3
0	0	1	1
2	2	3	3
4	4	5	5
6	6	7	7

RAID-01
Disk 0 Disk 1 Disk 2 Disk 3
0 1 0 1
2 3 2 3
4 5 4 5
6 7 6 7

Disk 0	Disk 1	Disk 2	Disk 3
0	1	0	1
2	3	2	3
4	5	4	5
6	7	6	7

read: either copy
write: update both

Analysis⌗

Consistent-update problem
untimely failures on write: inconsistent copies
solution: write-ahead log and recovery procedure

capacity: expensive, $N\cdot B/2$ for mirroring level = 2
reliability: tolerate failure on any disk, up to $N/2$ disks can fail
single request performance
- read: sames as on a single disk ($R$, writes to two blocks)
- write: slightly higher than writing to a single disk (worst-case seek and rotational delay)
steady-state throughput
- write: ($\frac{N}{2}\cdot S$, writes to $N$ blocks)
- read: same as write ($\frac{N}{2}\cdot S$, each disk receives a request for every other block)
random reads/writes
- read: $N\cdot R$
- write: $\frac{N}{2}\cdot R$

38.6 RAID Level 4: Saving Space With Parity⌗

Parity-based approaches attempt to use less capacity and thus overcome the huge space penalty paid by mirrored systems. They do so at a cost, however: performance.

Disk 0	Disk 1	Disk 2	Disk 3	Disk 4
0	1	2	3	P0
4	5	6	7	P1
8	9	10	11	P2
12	13	14	15	P3

Simple function: XOR, which makes a row having even number of 1

Analysis⌗

capacity: $(N-1)\cdot B$
reliability: tolerates 1 disk failure
performance
- steady-state: $(N-1)\cdot S$ (full-stripe write: calculate parity bits first then write all disks in parallel)
- random
  - read: $(N-1)\cdot R$
  - write: $R/2$ (the parity disk is the bottleneck)
    - additive parity: read in parallel, calculate XOR, and write in parallel (larger RAIDs require a high number of reads)
    - subtractive parity: $P_\mathrm{new}=(C_\mathrm{old}\oplus C_\mathrm{new})\oplus P_\mathrm{old}$ (Why not always better?– because it has to read the parity disk?)

Small-write problem: the parity disk is a bottleneck under small writes which prevents parallelism

38.7 RAID Level 5: Rotating Parity⌗

Disk 0	Disk 1	Disk 2	Disk 3	Disk 4
0	1	2	3	P0
5	6	7	P1	4
10	11	P2	8	9
15	P3	12	13	14
P4	16	17	18	19

Analysis⌗

random write: $\frac{N}{4}\cdot R$ (four I/O operations) (aren’t the reads/writes parallel?–the data disk itself serves as parity disk as well)

39 Interlude: Files and Directories⌗

Thus, the OS must take extra care with such a device: this is where users keep data that they really care about.

39.1 Files And Directories⌗

file: linear array of bytes
- low-level name (inode number)
directory
- low-level name (inode number)
- a list of (user-readable name, low-level name) pairs

39.3 Creating Files⌗

Create a new file: call open() and pass O_CREAT flag:

int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, S_IRUSR|S_IWUSR);

O_CREAT: create the file if it does not exist
O_WRONLY: read only
O_TRUNC: if exists already, truncates it to a size of zero bytes

39.5 Reading And Writing, But Not Sequentially⌗

Reading some random offsets within the document:

off_t lseek(int fildes, off_t offset, int whence);

fildes: file descriptor
offset: the file offset to a particular location
whence
- SEEK_SET: the offset is set to offset bytes
- SEEK_CUR: the offset is set to its current location plus offset bytes
- SEEK_END: the offset is set to the size of the file plus offset bytes

Calling lseek() does not perform a disk seek

The “current offset” tracked by the OS:

implicitly updates when read or write
explicitly updates with lseek

The file structure:

struct file {
    int ref;
    char readable;
    char writable;
    struct inode * ip;
    uint off;
};

The open file table:

struct {
    struct spinlock lock;
    struct file file[NFILE];
} ftable;

39.6 Shared File Table Entries: `fork()` And `dup()`⌗

The open file table entries are shared by 1. the parent and child processes created by fork(); 2. one file descriptor and its duplicates created by dup(), with the reference count incremented.

39.7 Writing Immediately With `fsync()`⌗

Calls fsync() to force all dirty (not yet written) data to disk.

int rc = write(fd, buffer, size); assert(rc == size);

In some cases, it is also needed to fsync() the containing directory to ensure that the file is part of the directory.

Not surprisingly, this type of detail is often overlooked, leading to many application-level bugs.

mmap() and persistent memory
By combining the persistence of files with the access semantics of memory, file-backed memory mappings support a software abstraction called persistent memory.

39.8 Renaming Files⌗

rename(char *old, char *new)

rename() call is usually implemented as an atomic call.

An example to use this kind of guarantee:

int fd = open("foo.txt.tmp", O_WRONLY|O_CREAT|O_TRUNC, S_IRUSR|S_IWUSR);
write(fd, buffer, size);
fsync(fd);
close(fd);
rename("foo.txt.tmp", "foo.txt");

39.9 Getting Information About Files⌗

Use stat() or fstat() to access metadata of files or directories.

39.10 Removing Files⌗

Use unlink() to remove files.

39.11 Making Directories⌗

Use mkdir() to create a directory.

An empty directory has two entries:

refer to itself
refer to its parent

39.12 Reading Directories⌗

Use opendir(), readdir() and closedir() to read contents of a directory.

DIR * dp = opendir(".");
assert(dp != NULL);
struct dirent * d;
while ((d = readdir(dp)) != NULL) {
    printf("%lu %s\n", (unsigned long) d->d_ino, d->d_name);
}
closedir(dp);

39.13 Deleting Directories⌗

Call rmdir() to delete a directory.

39.14 Hard Links⌗

Call link() to make an entry in the file system tree, simply creates another name in the directory and refers it to the same inode number.

ln file file2

When creating a file, two things happened:
making a structure (the inode) that tracks relevent information
linking a human-readable name to that file, and put that link into a directory

When the file system unlinks a file, it checks the reference count (link count) within the inode number, and only free the node and related data blocks when it reaches zero.

39.15 Symbolic Links⌗

Call symlink() to make an symbolic link (soft link) to a file.

ln -s file file2

Differences from hard links:

symbolic links is actually a file itself of a different type
possibility for dangling reference

39.16 Permission Bits And Access Control Lists⌗

Permission bits consist of three groupings:

owner
group
other (anyone)

The owner of the file can change the permissions.

But now you have learned an even more important lesson: there can be no secrets in a good marriage, even within the file system.

The Time Of Check To Time Of Use (TOCTTOU) problem
The problem is caused by the gap between the check and the update.

39.17 Making And Mounting A File System⌗

And mkfs said, let there be a file system!

Mount: to take an existing directory as a target mount point and paste a new file system onto the directory tree at that point.

Mount unfies all file systems into one tree, making naming uniform and convenient.

40 File System Implementation⌗

vsfs (the Very Simple File System)

The file system is pure software; unlike our development of CPU and memory virtualization, we will not be adding hardware features to make some aspect of the file system work better (though we will want to pay attention to device characteristics to make sure the file system works well).

40.2 Overall Organization⌗

The disk is divided into:

data blocks: user data
inodes
allocation structures: to track whether inodes or data blocks are free or allocated
superblock: contains information about the fie system

40.3 File Organization: The Inode⌗

Inode (index node): the data structure that holds the metadata of a given file

i-number (low-level name): the number that the inode is implicitly referred to; disk position can be calculated by the number

Calculation:

$$ block = (inumber\times sizeof(inode)) / blockSize $$

$$ sector = ((block\times blockSize) + inodeStartAddr) / sectorSize $$

Metadata in a inode:

type: directory or file (kept in the directory entry and thus is not found in the inode)
size
number of blocks
protection information
time
data block position on disk

The Multi-Level Index⌗

Indirect pointer: points to a block that contains pointers which each points to user data

An inode may have some fixed number of direct pointers and a single indirect pointer.

If a file grows large enough, an indirect block is allocated (from the data block) and the inode’s indirect pointer is set to point to it.

Extent-based approaches: a disk pointer plus a length; allow for more than one extent

Linked-based approaches: use a linked list to keep track all blocks of a file

Double indirect pointer: refer to a block that contains indirect pointers

Most files are small: a small number of direct pointers is sufficient

40.4 Directory Organization⌗

Deleting a file can leave an empty space in the middle of the directory; a new entry may reuse an old, bigger, deleted entry.

File systems often streat directories as a special type of file, and thus they have inodes in the inode table.

40.5 Free Space Management⌗

To track free inode/data blocks:

use bitmap
use free list
use more soplisticated ds, e.g. B-tree

Pre-allocation policy: reserve a sequence of blocks for a newly-created file, to guarantee that a portion of the file is contiguous on the disk.

40.6 Access Paths: Reading and Writing⌗

Reading A File From Disk⌗

The timeline of reading a file /foo/bar:

open("/foo/bar", O_RDONLY)

read the root inode (/) (whose i-number is well known; 2 in most UNIX FSs)
read / data (to look for an entry for foo)
read foo inode
read foo data (to look for an entry for bar)
read bar inode

read(...)

read bar inode
read specific block of bar’s data
write bar inode for last access time

Yes, life can get pretty bad when reading a file; as you’re about to find out, writing out a file (and especially, creating a new one) is even worse.

Reads don’t access allocation structures (e.g. bitmaps); they are only accessed when allocation is needed. There is no need to make sure a block is allocated when the inode already points to it

Writing A File To Disk⌗

Each write to a file logically generates 5 I/Os:

read file inode
read data bitmap
write data bitmap (to mark the newly-allocated block as used)
write file data
write file inode

For file creation:

read inode and inode data traversly to find the containing directory
read containing directory data
read inode bitmap (find a free inode)
write inode bitmap (to mark the inode allocated)
write directory data (link)
read/write directory inode
read/write inode of the new file (initialize)
additional I/O might be needed if the directory needs to grow

40.7 Caching and Buffering⌗

Most file systems aggressively use system memory (DRAM) to cache important blocks.

Early file systems use a fixed-size cache to hold popular blocks; this static partitioning method can be wasteful.

Modern systems use a dynamic partitioning approach.

Static vs. dynamic partitioning
static: predictable performance
dynamic: better utilization; more complex to implement

Buffering: delay writes to reduce I/O

In this case, laziness (in writing blocks to disk) is a virtue.

41 Locality and The Fast File System⌗

41.1 The Problem: Poor Performance⌗

Main issue of the old UNIX file system: data was spread all over the place, thus had expensive positioning costs

Some problems:

the inode and the data block are often far away
the file system is getting more and more fragmented
the block size is too small, so that transferring data is inefficient

41.3 Organizing Structure: The Cylinder Group⌗

Cylinder (柱面): a set of tracks on different surfaces that are the same distance from the center

FFS divides the disk into a number of cylinder groups.

As modern drives only export a logical address space of blocks, modern file systems instead organize the drive into block groups, each of which is a consecutive portion of the disk’s address space. By placing two files within the same group, FFS can ensure that accessing one after the other will not result in long seek time.

FFS needs to have the ability to place files and directories into a groups and track relevant information, thus all the structures that a file system is expected to have is included in each group.

The components of a single cylinder group:

a copy of the super block
a per-group inode bitmap and data bitmap
inode region
data block region

41.4 Policies: How To Allocate Files and Directories⌗

The basic mantra is simple: keep related stuff together (and its corollary, keep unrelated stuff far apart).

A few placement heuristics:

directories: find the group
- with a low number of allocated directories (to balance directories across groups)
- with a high number of free inodes (to be able to allocate a bunch of files)
files:
- allocate the data blocks in the same group as its inode
- places all files that are in the same directory in the cylinder group of the directory they are in

41.6 The Large-File Exception⌗

Filling an entire block group with large files will prevents related files from being placed within the same group.

FFS’ policy for large files: after some number of blocks are allocated into the first block group, it places the next large chunk of the file in another block group.

To address the performance problem: choosing chunk size carefully. By amotization, if the chunk size is large enough, the seek time will be relatively little.

FFS’s approach

the first 12 blocks are placed in the same group as the inode
each subsquent indirect block and the blocks it points to are placed in a different group

Note that the trend in disk drives is that transfer rate improves fairly rapidly, as disk manufacturers are good at cramming more bits into the same surface, but the mechanical aspects of drives related to seeks (disk arm speed and the rate of rotation) improve rather slowly.

41.7 A Few Other Things About FFS⌗

Internal fragmentation: 4KB blocks are used by many 2KB files

Sub-blocks: 512-byte blocks can be allocated to files; copy to a 4KB block when the size grows beyond 4KB

To address the performance issue caused by copies: buffer writes, to avoid this in most cases.

A refined disk layout (parameterization): skipping a block when issuing a sequential read (e.g. 0 6 1 7 2 8 3 9 4 10 5 11); to issue the performance problem: using track buffer.

FFS also supports:

long file names
symbolic links

Certainly all modern systems account for the main lesson of FFS: treat the disk like it’s a disk.

Operating Systems: Three Easy Pieces 学习笔记（一）

36 I/O Devices⌗

36.1 System Architecture⌗

36.2 A Canonical Device⌗

36.3 The Canonical Protocol⌗

36.4 Lowering CPU Overhead With Interrupts⌗

36.5 More Efficient Data Movement With DMA⌗

36.6 Methods Of Device Interaction⌗

36.7 Fitting Into The OS: The Device Driver⌗

36.8 Case Study: A Simple IDE Disk Driver⌗

37 Hard Disk Drives⌗

37.1 The Interface⌗

37.2 Basic Geometry⌗

37.3 A Simple Disk Drive⌗

37.4 I/O Time: Doing The Math⌗

37.5 Disk Scheduling⌗

Elevator algorithm⌗

SPTF: Shortest positioning time first⌗

Other scheduling issues⌗

38 Redundant Arrays of Inexpensive Disks (RAIDs)⌗

38.1 Interface And RAID Internals⌗

38.2 Fault Model⌗

38.3 How To Evaluate A RAID⌗

38.4 RAID Level 0: Striping⌗

Chunk sizes⌗

RAID-0 Analysis⌗

Evaluating RAID Performances⌗

38.5 RAID Level 1: Mirroring⌗

Analysis⌗

38.6 RAID Level 4: Saving Space With Parity⌗

Analysis⌗

38.7 RAID Level 5: Rotating Parity⌗

Analysis⌗

39 Interlude: Files and Directories⌗

39.1 Files And Directories⌗

39.3 Creating Files⌗

39.5 Reading And Writing, But Not Sequentially⌗

39.6 Shared File Table Entries: fork() And dup()⌗

39.7 Writing Immediately With fsync()⌗

39.8 Renaming Files⌗

39.9 Getting Information About Files⌗

39.10 Removing Files⌗

39.11 Making Directories⌗

39.12 Reading Directories⌗

39.13 Deleting Directories⌗

39.14 Hard Links⌗

39.15 Symbolic Links⌗

39.16 Permission Bits And Access Control Lists⌗

39.17 Making And Mounting A File System⌗

40 File System Implementation⌗

40.2 Overall Organization⌗

40.3 File Organization: The Inode⌗

The Multi-Level Index⌗

40.4 Directory Organization⌗

40.5 Free Space Management⌗

40.6 Access Paths: Reading and Writing⌗

Reading A File From Disk⌗

Writing A File To Disk⌗

40.7 Caching and Buffering⌗

41 Locality and The Fast File System⌗

41.1 The Problem: Poor Performance⌗

41.3 Organizing Structure: The Cylinder Group⌗

41.4 Policies: How To Allocate Files and Directories⌗

41.6 The Large-File Exception⌗

41.7 A Few Other Things About FFS⌗

39.6 Shared File Table Entries: `fork()` And `dup()`⌗

39.7 Writing Immediately With `fsync()`⌗