6 I/O¶
6.1 Introduction¶
Typical I/O Devices¶
对I/O根据不同的特性进行分类¶
Behavior: Input (read once), output (write only, cannot read),
or storage (can be reread and usually rewritten)
Partner: Either a human or a machine is at the other end of the I/O device, either feeding data to input or reading data from output.
Data rate: The peak rate at which data are transferred between the I/O device and the main memory or processor.
如何评估 I/O 的 performance?¶
Throughput (e.g., data centers):
I/O bandwidth can be measured in two different ways according to different situations:
(1) How much data can be transferred in a certain time?
For examples, in many supercomputer applications, most I/O requires are for long streams of data, and transfer bandwidth is the important characteristic.
(2) How many I/O operations can be performed per unit time?
For example, National Income Tax Service mainly processes large number of small files.
Response time (e.g., workstation and PC)
我们由 Amdahl’s Law 可知串行的部分会限制提速
举个例子,如果我们要使用100个处理器让程序加速到原先的90倍,那么新的时间为 \(\text{T}_{\text{new}}=\text T_{\text{parallelizable}}/100+\text T_{\text{sequential}}\)
我要加速到原来的90倍,需要 parallelizable 的部分达到 0.999 才行,显然是很难做到的。
同时在考虑为程序加速的时候,不能忽视I/O部分,仅仅提升CPU的速度会因为I/O而导致速度的提升并没有那么显著。
6.2 Disk Storage and Dependability¶
下面对硬盘进一步介绍
platters: disk consists of a collection of platters, each of which has two recordable disk surfaces
tracks: each disk surface is divided into concentric circles
sectors: each track is in turn divided into sectors, which is the smallest unit that can be read or written
在硬盘上怎么读取数据?Seek: position read/write head over the proper track
Rotational latency: wait for desired sector
For 5400 RPM
For 15000 RPM
Transfer time:
time to transfer a sector (1 \(\text{KB/sector}\)) of data, determined by the density, rotation speed, etc.
Transfer rate today’s drives - 100 to 150 \(\text{MBytes/second}\)
Disk controller:
control the transfer between the disk and the memory
Example
数据传输速度50MB/S seek time: 6ms controller time:0.2ms 转速:10000RPM
Access Time = Seek time + Rotational Latency + Transfer time + Controller Time
Assuming the measured seek time is 25% of the calculated average
Disk Performance Issues¶
\(\text{MTTF}\) (Mean Time to Failure)
\(\text{MTTR}\) (Mean Time to Repair)
\(\text{MTBF}\) (Mean Time Between Failures) = \(\text{MTTF}\) + \(\text{MTTR}\)
Availability: \(\text{Availability}=\frac{\text{MTTF}}{\text{MTTF+MTTR}}\)
Reliability(可靠性): \(\text{MTTF}\) ,多少时间出问题
Array of Disks¶
对于一个硬盘我们放满了,我们就要扩容,一种做法是扩大磁盘的容积,这种方法成本太高,比较困难。于是我们考虑增加硬盘的数量从而构成 array,我们计算一下 Array Reliability
可以发现随着数量的增加,Reliability 逐渐降低,这种错误是不可以接受的。我们现在要做的是让这种方法更可靠!
Three Ways to Improve MTTF
Fault avoidance: preventing fault occurrence by construction
Fault tolerance: using redundancy to allow the service to comply with the service specification despite faults occurring, which applies primarily to hardware faults 使用冗余
Fault forecasting: predicting the presence and creation of faults, which applies to hardware and software faults 预测错误
6.3 Redundant Arrays of (Inexpensive) Disks¶
特点:
(1)文件条带化,Files are "striped" across multiple disks
(2)冗余可以提高数据的可用性
(3)通过其他磁盘中的数据可以恢复相应数据
缺点:
(1)Capacity penalty to store redundant info 需要更多的容量存储那些冗余项
(2)Bandwidth penalty to update redundant info 因为要实时更新冗余项的内容,降低了 Bandwidth
第三行的意思是以八个磁盘为例需要多少冗余
RAID 0: No Redundancy¶
数据被分布式地存储在多个磁盘中,但是并没有冗余去容忍磁盘错误
虽然无法容忍磁盘错误,但是仍然提升了大量访问时的性能,因为很多盘可以同时进行操作。
我们的笔记本基本采取的都是 RAID 0
RAID 1: Disk Mirroring/Shadowing¶
每个盘都被一比一复制,相同的数据要被写两次,这样导致写数据速度变慢,但是读的速度更快了(因为随便读一个就行)
RAID 3: Bit-Interleaved Parity Disk¶
利用一个盘作为奇偶校验盘,任意一个磁盘坏了都可以通过校验盘恢复
P contains sum of other disks per stripe mod 2 (“parity”)
If disk fails, subtract P from sum of other disks to find missing information,但是恢复的代价还是比较大的
由于数据是条带化的分布在多个磁盘中的,所以每次读数据的时候基本都要读所有盘的信息,这样不同用户就会造成拥堵。我们希望将一些小读取分到各个磁盘中,于是我们不把数据拆散成bit,而是拆分为 block。
RAID 4: Block-Interleaved Parity¶
对于一些 small read,比如说读取 D0 和 D5,那么此时它们就可以并行地进行读取了;对于写一整行的数据,比如说 D12, D13, D14, D15,我也可以很快地写。
Note
这里我的理解是RAID 3是将数据按照为进行条带式分布在各个磁盘中,而 RAID 4则是按照数据块的形式分布在各个磁盘中的,所以在RAID 4 中小数据可能只需要访问一块磁盘,而在RAID 3中则要访问很多磁盘,无法做到并行处理
对于其他 large read, 我们遇到的问题与 RAID 3 是一样的
对于 small write,我们可以使用下面的算法更新校验盘,这样我们就不需要读其他盘的内容去确定校验盘的值
但是我们发现,如果我们同时写 D0 和 D5,会造成校验盘的冲突,我们也不能并行处理,于是我们可以提出将校验的位置不放在一个盘中,而是将其分布在不同的盘中
RAID 5: High I/O Rate Interleaved Parity¶
这样的话,比如说我们写 D0 和 D5 ,我们就是用到 0,1,3,4 四张盘,这样就可以进行并行的 small write 操作
Summary¶
Question
第一个正确,增加冗余性增加了 availability
第二个正确,RAID 1使用了一倍的冗余项
第三个正确,写任意一个数据要用到所有的盘,有一个盘堵塞了都会有影响
第四个正确,对于 large write 来说,3,4,5 所有磁盘都在写,我们优化的是 small write
6.4 Buses and Other Connections between Processors Memory, and I/O Devices¶
上面的 Memory controller 被称为 north bridge,挂载的都是对数据带宽要求比较高的设备
下面的 I/O controller hub 被称为 south bridge,它挂载在 north bridge 上,通常挂载一些低速的设备
Bus Basics¶
A bus contains two types of lines
-
Control lines: signal requests and acknowledgments, and to indicate what types of information is on the data lines.
-
Data lines: carry information (e.g., data, addresses, and complex commands) between the source and the destination.
Bus transaction include two parts:
(1) sending the address
(2) receiving or sending the data
Two operations
-
input: receiving data from the device to memory
-
output: sending data to a device from memory
Types of Buses¶
-
Processor-memory : short high speed, custom design
-
Backplane : high speed, often standardized, e.g., PCI
-
I/O : lengthy, different devices, standardized, e.g., SCSI
Synchronous vs. Asynchronous¶
同步总线:use a clock and a fixed protocol, fast and small but every device must operate at same rate and clock skew requires the bus to be short
异步总线:don’t use a clock and instead use handshaking
Handshaking protocol: A serial of steps used to coordinate asynchronous bus transfers.
Asynchronous Example
设备(要向内存请求数据,下图中橙色部分为I/O的部分
- When memory sees the ReadReq line, it reads the address from the data bus, begin the memory read operation,then raises Ack to tell the device that the ReadReq signal has been seen.
当ReadReq上升沿的时候,从数据总线中读地址,然后使Ack为1告诉设备ReadReq已经被看见了
- I/O device sees the Ack line high and releases the ReadReq data lines.
当设备发现 Ack 信号为1,这样就可以 release ReadReq信号和 data 信号了
-
Memory sees that ReadReq is low and drops the Ack line.
-
When the memory has the data ready, it places the data on the data lines and raises DataRdy.
内存数据准备好后就将数据输入到数据总线中去
-
The I/O device sees DataRdy, reads the data from the bus , and signals that it has the data by raising ACK.
-
The memory sees Ack signals, drops DataRdy, and releases the data lines.
-
Finally, the I/O device, seeing DataRdy go low, drops the ACK line, which indicates that the transmission is completed.
关于同步总线与异步总线的计算
Assume:
-
The synchronous bus has a clock cycle time of 50 ns, and each bus transmission takes 1 clock cycle.
-
The asynchronous bus requires 40 ns per handshake.
-
The data portion of both buses is 32 bits wide.
Question: Find the bandwidth for each bus when reading one word from a 200-ns memory.
由于同步总线时钟时间为 50 ns,一次总线传输需要一个时钟
- Send the address to memory: 需要一个时钟,50ns
- Read the memory: 200ns
- Send the data to the device: 50ns
从而总时间为 300 ns
故传输一个4字节的数据需要300ns,bandwidth 为 \(4\text{bytes}/300\text{ns}=13.3\text{MB}/\text{second}\)
我们注意到在内存读取的那一段时间,可以有好几步一起同时完成,第一步做完后,Ack就变为高电平了,那么数据这个时候就可以开始从内存中读取了,其中第四步是可调控的,看2,3步进行的快还是数据从内存中读取d,第5步数据准备好了就可以放入总线了
这样我们计算得到时间为
- step 1: 40ns
- step 2,3,4: max{40ns*2,200ns}=200ns
- step 5,6,7: 3*40ns=120ns
从而总时间为 360ns,相应的 bandwidth 为 \(11.1\text{MB}/\text{second}\)
Bus Arbitration¶
如果没有任何控制,那么面对多个设备对总线的访问将无法处理,所以我们需要一个 bus master
四种仲裁方案
- daisy chain arbitration (not very fair)
- centralized, parallel arbitration (requires an arbiter), e.g., PCI
- self selection, e.g., NuBus used in Macintosh
- collision detection, e.g., Ethernet
选择哪个 I/O 设备占用总线需要考虑:bus priority, fairness
Increasing the Bus Bandwidth¶
-
Increasing data bus width 直接增加线宽,最为直接!
-
Use separate address and data lines
-
Transfer multiple words
通过增加一个block中的words数量提升总线带宽
Suppose we have a system with the following characteristic:
-
A memory and bus system supporting block access of 4 to 16 32-bit words
-
A 64-bit synchronous bus clocked at 200 MHz, with each 64-bit transfer taking 1 clock cycle, and 1 clock cycle required to send an address to memory.
-
Two clock cycles needed between each bus operation.
-
A memory access time for the first four words of 200ns; each additional set of four words can be read in 20 ns. Assume that a bus transfer of the most recently read data and a read of the next four words can be overlapped.
QUESTION: Find the sustained bandwidth and the latency for a read of 256 words for transfers that use 4-word blocks and for transfers that use 16-word blocks. Also compute effective number of bus transactions per second for each case.
对于每一个 block 需要
-
一个时钟将地址传输给 memory
-
200ns/(5ns/cycle)= 40 clock cycles 去读内存
-
需要两个时钟将数据从内存输出(因为一共 4-words, 共 128 位,而一次只能传输64位)
-
相邻两个总线操作之间需要2个时钟周期
从而每一个 block 需要 45 个周期,而一共有 256 words(64 blocks), 所以需要 \(45\times 64 =2880 \text{clock cycles}\)
这样传输 256 words 需要的时间为 \(2880 \text{cycles}\times (5ns/\text{cycles})=14400\text{ns}\)
得到计算得到一秒可以在总线传输
总线的带宽为
对于每一个 block 需要
-
一个时钟将地址传输给 memory
-
40 cycles 去读取前四个字
-
用两个周期将数据传出去,同时开始从内存中读数据
-
相邻两个总线操作之间需要2个时钟周期
示意图如下
这样我们计算得到传输一个16-word的block需要 \(1+40+4\times3+2+2=57 \text{cycles}\)
一共有16个blocks,共需要 \(57\times 16=912 \text{cycles}\), 时间为 \(912 \text{cycles}\times 5 \text{ns/cycles}=4560 \text{ns}\)
得到每秒可进行多少次总线传输
总线的带宽为
这样我们发现 16-word 的带宽是 4-word 带宽的 3.16 倍
6.5 Interfacing I/O Devices to the Memory, Processor, and Operating System¶
I/O系统的三个特点
-
shared by multiple programs using the processor.
-
often use interrupts to communicate information about I/O operations.
-
The low-level control of an I/O devices is complex
需要三种类型的 communication
-
The OS must be able to give commands to the I/O devices.
-
The device must be able to notify the OS, when I/O device completed an operation or has encountered an error.
-
Data must be transferred between memory and an I/O device
对于第一种 communication ,有一种方法被称为 memory - mapped I/O,我们把内存的一部分分配给 I/O,这样我们就可以用 lw
指令或者 sw
指令用于访问 I/O。我们也可以使用一些专属于 I/O 的指令,比如说 in al,port
或者 out port,al
,或者说使用一些及控制端口,数据端口等来交流。
处理器和 I/O 之间的交流我们也有两种方法,分别是 Polling 和 Interrupt,所谓 Polling 就是处理器周期性的检测 I/O 的状态位,来判断现在是否要进行 I/O 操作,而 Interrupt 就是每当 I/O 需要进行操作时,给处理器一个信号,让处理器停止。
而对于第三种关于数据的交流,我们引入一个器件称为 DMA (direct memory access): the device controller transfer data directly to or from memory without involving processor.不用经过处理器,I/O 直接与内存进行数据的交流。
使用 DMA 传输经过三个步骤
-
The processor sets up the DMA by supplying some information, including the identity of the device, the operation, the memory address that is the source or destination of the data to be transferred, and the number of bytes to transfer.处理器初始化 DMA
-
The DMA starts the operation on the device and arbitrates for the bus. If the request requires more than one transfer on the bus, the DMA unit generates the next memory address and initiates the next transfer.
-
Once the DMA transfer is complete, the controller interrupts the processor, which then examines whether errors occur. 传完了处理器检测一下是否出现错误
Compare Polling, Interrupts, DMA
The disadvantage of polling is that it wastes a lot of processor time. When the CPU polls the I/O devices periodically, the I/O devices maybe have no request or have not get ready.
If the I/O operations is interrupt driven, the OS can work on other tasks while data is being read from or written to the device.
Because DMA doesn’t need the control of processor, it will not consume much of processor time. DMA不需要处理器的控制,减少对处理器时间的消耗
Overhead of Polling in an I/O System
Assume: that the number of clock cycles for a polling operation is 400 and that processor executes with a 500-Mhz clock.
We assuming that you poll often enough so that no data is ever lost and that those devices are potentially always busy.
We assume again that:
-
The mouse must be polled 30 times per second to ensure that we do not miss any movement made by the user.
-
The floppy disk transfers data to the processor in 16-bit units and has a data rate of 50 KB/sec. No data transfer can be missed.
-
The hard disk transfers data in four-word chunks and can transfer at 4 MB/sec. Again, no transfer can be missed.
QUESTION: Determine the fraction of CPU time consumed for the mouse, floppy disk, and hard disk.
The Mouse
clock cycles per second for polling: \(30\times400= 12000 \text{cycles}\)
Fraction of the Processor Clock Cycles Consumed
The Floppy Disk
Number of Polling Access per Second: \(\frac{50 \text{KB}}{2 \text{B}} = 25 \text{K}\) 这应该是至少这么多
Clock Cycles per Second for Polling \(25 \text{K} \times 400 \text{ cycles}\)
Fraction of the Processor Clock Cycles Consumed
The Hard Disk
The Number of Polling Access per Second \(4 \text{MB} / 16 \text{B} = 250 \text{K}\)
Clock Cycles per Second for Polling \(250 \text{K} \times 400\)
Fraction of the Processor Clock Cycles Consumed
Summary of Fractions
-
Mouse: 0.002%
-
Floppy Disk: 2%
-
Hard Disk: 20%
Clearly, polling can be used for the mouse without much performance impact on the processor, but it is unacceptable for a hard disk on this machine.
Overhead of Interrupt-Driven I/O
Suppose we have the same hard disk and processor we used in the former example, but we used interrupt-driven I/O. The overhead for each transfer, including the interrupt, is 500 clock cycles.
QUESTION: Find the fraction of the processor consumed if the hard disk is only transferring data 5% of the time.
First, we assume the disk is transferring data 100% of the time. So, the interrupt rate is the same as the polling rate.
Cycles per Second for Disk \(250 \text{K} \times 500 = 125 \times 10^6 \text{ cycles per second}\)
Fraction of the Processor Consumed during a Transfer
Now, we assume that the disk is only transferring data 5% of the time. The fraction of the processor time consumed on average is:
As we can see, no CPU time is needed when an interrupt-driven I/O device is not actually transferring. This is the major advantage of an interrupt-driven interface versus polling
不需要传输的时候,就不用interrupt了,这就比 polling 方法要一直监控来的好
Overhead of I/O Using DMA
Suppose we have the same hard disk and processor we used in the former example.
Assume that the initial setup of a DMA transfer takes 1000 clock cycles for the processor, and assume the handling of the interrupt at DMA completion requires 500 clock cycles for the processor.
The hard disk has a transfer rate of 4MB/sec and uses DMA. The average transfer from disk is 8 KB. Assume the disk is actively transferring 100% of the time.
QUESTION: Please find what fraction of the processor time is consumed.
Time for each 8KB transfer is: \(8 \text{KB} / (4 \text{MB/second}) = 2 \times 10^{-3} \text{seconds}\)
处理器每秒需要抽出来的 Cycles
Fraction of Processor Time
Unlike either polling or interrupt-driven I/O, DMA can be used to interface a hard disk without consuming all the processor cycles for a single I/O. 用DMA的时候,处理器参与的只有头和尾
6.6 Designing an I/O system¶
The general approaches to designing I/O system
Find the weakest link in the I/O system, which is the component in the I/O path that will constrain the design. Both the workload and configuration limits may dictate where the weakest link is located.
Configure this component to sustain the required bandwidth.
Determine the requirements for the rest of the system and configure them to support this bandwidth
Example
Consider the following computer system:
-
CPU Performance: - A CPU sustains 3 billion instructions per second. CPU每秒处理30亿条指令 - It takes an average of 100,000 instructions in the operating system per I/O operation. 在操作系统中每个 I/O 操作需要100,000条指令
-
Memory Backplane Bus: - A memory backplane bus is capable of sustaining a transfer rate of 1000 MB/sec.
-
SCSI-Ultra320 Controllers: - SCSI-Ultra320 controllers with a transfer rate of 320 MB/sec. - They can accommodate up to 7 disks.
-
Disk Drives: - Disk drives with a read/write bandwidth of 75 MB/sec. - An average seek plus rotational latency of 6 ms.
If the workload consists of 64-KB reads (assuming the data block is sequential on a track), and the user program need 200,000 instructions per I/O operation 用户程序中每个I/O操作需要200000条指令
QUESTION: please find the maximum sustainable I/O rate and the number of disks and SCSI controllers required.
The two fixed component of the system are the memory bus and the CPU. Let’s first find the I/O rate that these two components can sustain and determine which of these is the bottleneck.
Maximum I/O Rate of CPU
Maximum I/O Rate of Bus
Conclusion: The CPU is the bottleneck, so we can now configure the rest of the system to perform at the level dictated by the bus, 10000 I/Os per second.
Now, let's determine how many disks we need to be able to accommodate 10000 I/Os per second. To find the number of disks, we first find the time per I/O operation at the disk:
Time per I/O at Disk
This means each disk can complete \(\frac{1000 \text{ms}}{6.9 \text{ms}}\), or 146 I/Os per second. To saturate the bus, the system needs \(\frac{10000}{146} \approx 69\) disks.
To compute the number of SCSI buses, we need to know the average transfer rate per disk, which is given by:
由于 \(7 \times 9.56 \text{MB/sec} < 320 \text{MB/s}\), 可以用满7个, 所以我需要\(69 \div 7 \approx 10\)个SCSI
Assuming the disk accesses are not clustered so that we can use all the bus bandwidth, we can place 7 disks per bus and controller. This means we will need 69/7, or 10 SCSI buses and controllers.