15 Recovery System¶
Info
15.1 Failure Classification¶
Transaction failure :
- Logical errors: transaction cannot complete due to some internal error condition 逻辑错误,事务因为一些原因未能完全执行
- System errors: the database system must terminate an active transaction due to an error condition (e.g., deadlock) 系统错误,比如出现死锁问题,需要将一些事务回退
System crash: a power failure or other hardware or software failure causes the system to crash.
- 所有运行的程序都会受到影响。分为两类:一类是事务已经提交(但是数据还在缓冲区),另一类是正在执行的事务(还没有提交)。已经提交的事务要 redo(数据可能没写回去),没有完成的事务要 undo.
Disk failure: a head crash or similar disk failure destroys all or part of disk storage
- 介质故障,要防止介质故障,需要做备份(拷贝或者远程)
Storage Structure
- Volatile storage
- Nonvolatile storage survives system crashes
- Stable storage:
- a mythical(虚拟的) form of storage that survives all failures
- approximated by maintaining multiple copies on distinct nonvolatile media 可以近似实现
15.2 Log-Based Recovery¶
A log is kept on stable storage(稳定存储器).
The log is a sequence of log records, and maintains a record of update activities on the database.
- When transaction Ti starts, it registers itself by writing a “start” log record:
<Ti start>
事务开始. Ti 表示事务的 id. - Before Ti executes write(X), writing “update” log record
<Ti,X,V1,V2>
事务把 X 数据项的值从 V1(old value) 改为 V2(new value). 这个就是恢复的基础. undo 就用 old value, redo 用 new value. Insert 就是 old 为空, Delete 就是 new 为空。 - When Ti finishes it last statement, writing “commit” log record:
<Ti commit>
- When Ti complete rollback, writing “abort” log record:
Log Example
这里当执行到 T2 回滚的时候我们会进行恢复(绿色的行表示补偿日志)比如 T2 把 C 恢复为 500, T3 把 B 恢复为 300, 最后 T2 abort. (undo 操作也会记录到日志中)
发生 crash 的时候 repeat history(undo 正常的操作也会重复), 随后得到并执行 undo list.(事务开始后先把事务放进去,如果提交或者回滚了就把事务移除) 只需要把 T4 undo.(假设故障前只执行到 15 行)
Write-Ahead Logging (先写日志原则)
Before a data in main memory is output to the database, the log records pertaining to data must have been output to stable storage.
Transaction Commit
A transaction is said to have committed when its commit log record is output to stable storage 日志已经记录 commit, 说明事务已经提交。(因为后续可以根据这个恢复状态了)但此时数据不一定已经写回到数据库里(不一定高效)
Writes performed by a transaction may still be in the buffer when the transaction commits, and may be output later 不一定在磁盘。如果立刻将 block 写回磁盘可能引起大量 I/O 操作
When recovering after failure:
-
Transaction Ti needs to be undone if the log
-
contains the record
<Ti start>
, -
but does not contain either the record
<Ti commit>
or<Ti abort>
. -
Transaction Ti needs to be redone if the log
- contains the records
<Ti start>
- and contains the record
<Ti commit>
or<Ti abort>
Checkpoint¶
Redoing/undoing all transactions recorded in the log can be very slow. Streamline recovery procedure by periodically performing check- pointing.
重演历史可能很长。checkpoint 是确认之前的操作都已经反映到数据库里去了,这样重演的时候就可以直接从 checkpoint 开始。
可以定期执行检查点操作。具体步骤如下:
- 将内存中的日志记录输出到稳定存储:
将当前驻留在主存中的所有日志记录写入稳定的存储介质(如磁盘)。这一步确保了日志信息的安全性,即使发生故障也能从稳定存储中恢复。【日志不是生成就往内存写,而是有一个日志缓冲区。】
- 将修改过的缓冲区块输出到磁盘:
将所有已修改但尚未写入磁盘的缓冲区块强制写入磁盘。这样可以保证数据的一致性和持久性。
- 写入检查点日志记录:
在稳定存储上写入一个特殊的日志记录 <checkpoint L>
,其中 L
是一个列表,包含在创建检查点时所有活跃事务的标识。这个记录用于标记检查点的位置和状态。
All updates are stopped while doing check-pointing !!! 做 checkpoint 的时候其他活跃事务都要停下来
Recovery Algorithm¶
Logging (during normal operation):
<Ti start>
at transaction start<Ti, Xj, V1, V2>
for each update, and<Ti commit>
at transaction end
Transaction rollback (during normal operation)
-
Let \(T_i\) be the transaction to be rolled back
-
Scan log backwards from the end, and for each log record of \(T_i\) of the form
<Ti, Xj, V1, V2>
-
perform the undo by writing \(V_1\) to \(X_j\),
-
write a log record
<Ti , Xj, V1>
[ such log records are called compensation log records(补偿记录)] -
Once the record
<Ti start>
is found stop the scan and write the log record<Ti abort>
Recovery from failure: Two phases
- Redo phase: replay updates of all transactions, whether they committed, aborted, or are incomplete –repeating history !!!
- Undo phase: undo all incomplete transactions
Redo phase:
-
Find last
<checkpoint L>
record, and set undo-list to L. -
Scan forward from above
<checkpoint L>
record - Whenever a record
<Ti, Xj, V1, V2>
is found, redo it by writing \(V_2\) to \(X_j\) - Whenever a (compensation) log record
<Ti, Xj, V2>
is found, redo it by writing \(V_2\) to \(X_j\) - Whenever a log record
<Ti start>
is found, add \(T_i\) to undo-list - Whenever a log record
<Ti commit>
or<Ti abort>
is found, remove Ti from undo-list
Undo phase: Scan log backwards from end
-
Whenever a log record
<Ti, Xj, V1, V2>
is found where \(T_i\) is in undo-list perform same actions as for transaction rollback: -
perform undo by writing \(V_1\) to \(X_j\).
-
write a log record
<Ti , Xj, V1>
-
Whenever a log record
<Ti start>
is found where \(T_i\) is in undo-list, - Write a log record
<Ti abort>
-
Remove \(T_i\) from undo-list
-
Stop when undo-list is empty
After undo phase completes, normal transaction processing can commence
Log Record Buffering¶
Log record buffering: log records are buffered in main memory, instead of of being output directly to stable storage.
- Log records are output to stable storage when a block of log records in the buffer is full, or a log force operation is executed.
Log force is performed to commit a transaction by forcing all its log records (including the commit record) to stable storage.
Group commit
several log records can be output using a single output operation, reducing the I/O cost.
Warning
Transaction Ti enters the commit state only when the log record <Ti commit>
has been output to stable storage.
Before a block of data in main memory is output to the database, all log records pertaining to data in that block must have been output to stable storage.
- This rule is called the write-ahead logging or WAL rule
Database Buffering¶
The recovery algorithm supports the no-force policy(非强制): i.e., updated blocks need not be written to disk when transaction commits.
事务 commit 了但不强制 blocks 刷写出去。
The recovery algorithm supports the steal policy(窃取策略):i.e., blocks containing updates of uncommitted transactions can be written to disk, even before the transaction commits.
事务提交之前脏数据能不能被写到磁盘里去?(同样地需要先把日志写出去)
Fuzzy Check-pointing¶
To avoid long interruption of normal processing during check-pointing, allow updates to happen during check-pointing
为避免做 check-pointing 的时候长时间停顿,引入 fuzzy check-pointing 的概念
Fuzzy check-pointing is done as follows:
- Temporarily stop all updates by transactions 暂时停下所有事务的更新
-
Write a
<checkpoint L>
log record and force log to stable storage -
Note list M of modified buffer blocks
- Now permit transactions to proceed with their actions 重新启动
-
Output to disk all modified buffer blocks in list M
-
blocks should not be updated while being output
-
Follow WAL: all log records pertaining to a block must be output before the block is output
-
-
Store a pointer to the checkpoint record in a fixed position last_checkpoint on disk
When recovering using a fuzzy checkpoint, start scan from the checkpoint record pointed to by last_checkpoint
通过 last_checkpoint 找到 checkpoint,因为我们是在脏数据写回硬盘之后才改变的 last_checkpoint,所以即使在执行checkpoint的过程中发生 crash,便会找到上一个 checkpoint,是安全的。
15.3 Recovery with Early Lock Release and Logical Undo Operations¶
在执行 B+ 树的插入和删除操作时,系统会尽早释放锁。于是我们无法通过恢复旧值来撤销操作(物理撤销),因为一旦锁被释放,其他事务可能已经更新了 B+ 树,因此不能简单地通过恢复旧值来撤销这些操作。这时候我们应该执行的就是逻辑撤销。
【感觉目的就在于处理那些被认为无法恢复的情况】
Transaction Rollback with Logical Undo
需要把每个操作的日志项记录下来(开始和结束). 这里在 end 时会记录 logical undo 的操作(减法撤销对应加法) 注意我们是在 end 的时候记录逻辑撤销的方法,如果这个操作还没有结束,那么我们只能物理撤销。
这里我们早放锁了,没有遵循严格两阶段放锁协议。在 \(T_0\) 还没有提交的时候 \(T_1\) 就对数据进行了修改.
恢复中做的是物理撤销(old+new), begin/end 这些日志就不需要记录了。
15.4 ARIES Recovery Algorithm¶
可以参考这篇文章:万字长文解析最常见的数据库恢复算法: ARIES_数据库具体的恢复算法-CSDN博客
关于 ARIES 算法,主要介绍一下三个步骤
Analysis¶
Starts from last complete checkpoint log record
- Reads DirtyPageTable from log record 从 log record 把 DPT 读出来
- Sets RedoLSN = min of RecLSNs of all pages in DirtyPageTable (In case no pages are dirty, RedoLSN = checkpoint record’s LSN) 设置 RedoLSN 为 DPT 中最小的 LSN(因为在这个LSN之前的数据全部都已经写到磁盘里面去了)
- Sets undo-list = list of transactions in checkpoint log record
- Reads LSN of last log record for each transaction in undo-list from checkpoint log record 找到 undo-list 中各个事务最近的那条 log 记录
Scans forward from checkpoint
- If any log record found for transaction not in undo-list, adds transaction to undo-list, 在 checkpoint 之后启动的事务,加入到 undo-list
- If transaction end log record found, delete transaction from undo-list 在checkpoint 之后结束的事务,从 undo-list 删掉
-
Keeps track of last log record for each transaction in undo-list 同样的,也要跟踪 undo-list 中每个事务最后的那个记录
-
Whenever an update log record is found, If page is not in DirtyPageTable, it is added with RecLSN set to LSN of the update log record 在 checkpoint 之后更新的记录,如果相应的页面不在 DPT 中,那么加入相应的页面,设置 RecLSN 为该LSN
Redo¶
Repeats history by replaying every action not already reflected in the page on disk, as follows:
Scans forward from RedoLSN. Whenever an update log record is found:
- If the page is not in DirtyPageTable or the LSN of the log record is less than the RecLSN of the page in DirtyPageTable, then skip the log record
- Otherwise fetch the page from disk. If the PageLSN of the page fetched from disk is less than the LSN of the log record, redo the log record
首先先检查这个页面在不在 DPT中,如果不在,那么说明这个页面不是脏页面,已经写到盘里面去了,不用再做一遍。如果这个页面在DPT中,但是相应的 RecLSN 比当前操作的 LSN 要大,那么说明当前操作执行的结果已经被刷到盘里面去了,所以也不用做。都不满足的话,就要从磁盘中调用相应的页面【执行前两步就是为了不调磁盘】如果相应的页面的 PageLSN(即最近刷到盘里的LSN)比当前操作的LSN要大,那么当前操作也不用做。
所以只有在该页面在 DPT 中,在 DPT 中记录的 RecLSN 比当前LSN 小,并且从磁盘中读出的 PageLSN 也比当前的 LSN 要小时,那么需要执行该更新操作。
为什么需要检查磁盘页面的 PageLSN
因为在 checkpoint 之后,可能将记录的 DPT 的一些页面刷入磁盘中去了,实际上已经不是脏的了,但是 DPT 并没有更新,所以需要将磁盘中的页面再调出来。注意我们采用的是 steal 的策略,所以在事务 commit 之前,也可能会进行刷盘。
Undo¶
Performs backward scan on log undoing all transaction in undo-list. Backward scan optimized by skipping unneeded log records as follows:
-
Next LSN to be undone for each transaction set to LSN of last log record for transaction found by analysis pass. 将每个事务(undo-list) 的 NextLSN 设置为在 analysis 中设置的最后一条记录
-
At each step pick largest of these LSNs to undo, skip back to it and undo it 对于每一步选择所有事务中 NextLSN 最大的 LSN 来做
-
After undoing a log record
- For ordinary log records, set next LSN to be undone for transaction to PrevLSN noted in the log record
- For compensation log records (CLRs) set next LSN to be undo to UndoNextLSN noted in the log record
undo 完相应的 log 之后如何设置下一条需要 undo 的指令,如果是正常日志,那么设置为 PrevLSN; 如果是 CLR 日志,那么就设置为 UndoNextLSN
partial rollback
Partial Rollback(部分回滚) 是数据库系统中的一种事务撤销操作,它指的是 只撤销事务执行过程中的部分操作,而不是整个事务的所有操作。
CLR 日志
When an undo is performed for an update log record
- Generate a CLR containing the undo action performed (actions performed during undo are logged physicaly or physiologically). CLR for record \(n\) noted as \(n’\) in figure below
- Set UndoNextLSN of the CLR to the PrevLSN value of the update log record