Chapter 12 Query Optimization¶

文本统计：约 2611 个字 • 21 行代码

Info

本文参考查询优化 - HobbitQia的笔记本

12.1 Introduction¶

Alternative ways of evaluating a given query

Equivalent expressions 逻辑优化：关系代数表达式（尽量先做选择，投影）

Different algorithms for each operation 物理层面：每个算子选择不同的算法

Estimation of plan cost based on:

Statistical information about relations. Examples: number of tuples, number of distinct values for an attribute
Statistics estimation for intermediate results（Cardinality Estimation） to compute cost of complex expressions 估计中间结果的大小，现在有基于深度学习的估计方法
Cost formulae for algorithms, computed using statistics

12.2 Generating Equivalent Expressions¶

Two relational algebra expressions are said to be equivalent if the two expressions generate the same set of tuples on every legal database instance 形式上不一样，但是结果（输出）是一样的，产生了相同的集合。

12.2.1 Equivalence Rules¶

selection

可以把算子拆分, 如果某些属性有索引，那么可以先拆分，在索引 select 之后再执行其他算子，否则不如不拆分。
算子可交换，先执行有索引的算子。
投影的属性可以只保留最后一次的
选择算子可以和合并结合

join

自然连接是结合的且交换的（先连接中间结果小的）
如果选择算子只和一个关系有关，那么我们可以先执行选择。（选择算子要早进行，推到叶子上）

projection

同理投影也要早做。如果连接要用到投影后不保留的属性，我们在第一次投影时要把连接用的属性也保留下来。

set operation

这里的减法，减数关系就不用做选择了（减去多的总是没问题的）对交集也适用

12.2.2 Enumeration of Equivalent Expressions¶

Repeat
apply all applicable equivalence rules on every subexpression of every equivalent expression found so far
add newly generated expressions to the set of equivalent expressions
Until no new equivalent expressions are generated above

可以这样找到所有的等价表达式。

但是实际中我们基于一些经验规则进行启发式的优化

12.3 Statistics for Cost Estimation¶

代价估算需要统计信息

\(n_r\): number of tuples in a relation \(r\).
\(b_r\): number of blocks containing tuples of \(r\).
\(l_r\): size of a tuple of \(r\).
\(f_r\): blocking factor of \(r\) — i.e., the number of tuples of \(r\) that fit into one block. 一个块可以放多少个元组
\(V(A,r)\): number of distinct values that appear in \(r\) for attribute \(A\); same as the size of \(Π(r)\).
If tuples of \(r\) are stored together physically in a file, then: \(b_r=⌈\frac{n_r}{f_r}⌉\)
Histograms

12.3.1 Selection Size Estimation¶

\(σ_{A=v}(r)\)

\(n_r/V(A,r)\) : number of records that will satisfy the selection. 这样的估算基于值是平均分布的如果要找的是一个 key, 那么 size estimate=1

\(σ_{A≤v}(r)\)

Let \(c\) denote the estimated number of tuples satisfying the condition.
\(c=0\) if \(v<\min(A,r)\), \(v\) 比属性 A 的最小值还要小
c=\(n_r⋅\frac{v−\min(A,r)}{\max(A,r)−\min A(A,r)}\)
In absence of statistical information c is assumed to be \(n_r/2\) (没有最大、最小统计信息时).

对于复杂情况的选择，只需要用一点概率论的知识，假设每次选择都是独立的

\(\theta_i\) 条件的中选率为 \(\frac{s_i}{n_r}\), 其中 \(s_i\) 为符合条件的元组数目

Conjunction: \(\sigma_{\theta_1 \land \theta_2 \land \ldots \land \theta_n}(r)\)

Assuming independence, the estimate of tuples in the result is:

\[ \frac{n_r * s_1 * s_2 * \ldots * s_n}{n_r^n} \]

Disjunction: \(\sigma_{\theta_1 \lor \theta_2 \lor \ldots \lor \theta_n}(r)\)

Estimated number of tuples:

\[ n_r * \left( 1 - (1 - \frac{s_1}{n_r}) * (1 - \frac{s_2}{n_r}) * \ldots * (1 - \frac{s_n}{n_r}) \right) \]

Negation: \(\sigma_{\neg \theta}(r)\)

Estimated number of tuples:

\[ n_r - \text{size}(\sigma_\theta(r)) \]

12.3.2 Join¶

The Cartesian product \(r×s\) contains \(n_r⋅n_s\) tuples; each tuple occupies \(s_r+s_s\) bytes.

\(R∩S=∅\) 没有公共属性，\(r\bowtie s\) 等价于 \(r×s\)
\(R∩S\) is a key for \(R\), then a tuple of \(s\) will join with at most one tuple from \(r\), therefore, (the number of tuples in \(r\bowtie s\)) \(≤ n_s\)
If \(R \cap S\) in \(S\) is a foreign key in \(S\) referencing \(R\), then the number of tuples in \(r\bowtie s\) = the number of tuples in \(s\)
If \(R∩S=\{A\}\) is not a key for \(R\) or \(S\). the number of tuples in \(R\bowtie S\) is estimated to be \(\frac{n_r\times n_s}{V(A,s)}\), If the reverse is true, the estimate obtained will be: \(\frac{n_r\times n_s}{V(A,r)}\) The lower of these two estimates is probably the more accurate one.

Example

Can improve on above if histograms are available，Use formula similar to above, for each cell of histograms on the two relations 对每一格进行如上的操作

12.3.3 Other Operations¶

Projection: Estimated size of \(\Pi_A(r) = V(A, r)\)

Aggregation: Estimated size of \(_A\mathcal{G}_F(r) = V(A, r)\)

Set Operations:

For unions/intersections of selections on the same relation: Rewrite and use size estimate for selections. E.g. \(\sigma_{\theta_1}(r) \cup \sigma_{\theta_2}(r)\) can be rewritten as \(\sigma_{\theta_1 \lor \theta_2}(r)\).
For operations on different relations:
Estimated size of \(r \cup s = \text{size of } r + \text{size of } s\).
Estimated size of \(r \cap s = \min(\text{size of } r, \text{size of } s)\).
Estimated size of \(r - s = \text{size of } r\).
All three estimates may be quite inaccurate, but provide upper bounds on the sizes.

Outer join:

Estimated size of \(r ⟕ s\) = size of \(r \bowtie s\) + size of \(r\)
Estimated size of \(r ⟗ s\) = size of \(r \bowtie s\) + size of \(s\)

12.3.4 Estimation of Number of Distinct Values¶

估算 \(V(A,r)\).

Selections \(σ_θ(r)\), estimate \(V(A,σ_θ(r))\)

If \(θ\) forces \(A\) to take a specified value, \(V(A,σ_θ(r))=1\) e.g., A = 3
If \(θ\) forces \(A\) to take on one of a specified set of values: \(V(A,σ_θ(r))\)= number of specified values e.g., (\(A = 1 \lor A = 3 \lor A = 4\))
If the selection condition \(θ\) is of the form A op v, \(V(A,σ_θ(r))=V(A,r)∗s\) 利用选择率 \(s\) 计算
In all the other cases, use approximate estimate \(V(A,σ_θ(r))=\min(V(A,r), n_{σ_θ(r)})\)

joins \(r⋈s\), estimate \(V(A,r⋈s)\)

If all attributes in \(A\) are from \(r\), the estimated \(V(A,r⋈s)=\min(V(A,r),n_{r⋈s})\)
If \(A\) contains attributes A1 from r and A2 from s, then estimated \(V(A,r⋈s)=\min(V(A1,r)\times V(A2−A1,s),V(A1−A2,r)∗V(A2,s),n_{r⋈s})\)

Estimation of distinct values are straightforward for projections.

They are the same in \(\Pi_{A(r)}\) as in \(r\).

The same holds for grouping attributes of aggregation.

For aggregated values

For \(\min(A)\) and \(\max(A)\), the number of distinct values can be estimated as \(\min(V(A,r), V(G,r))\) where \(G\) denotes grouping attributes

12.4 Choice of Evaluation Plans¶

Must consider the interaction of evaluation techniques when choosing evaluation plans

choosing the cheapest algorithm for each operation independently may not yield best overall algorithm. e.g. merge-join may be costlier than hash-join, but may provide a sorted output which reduces the cost for an outer level aggregation. Merge-join 代价高，但是有个好处是 join 后是有次序的，对上层操作有利。【可能是非贪心的】

如果要找最优的执行计划，可能需要很长时间。通常按照经验规则。

我们主要考虑连接操作的优化。

12.4.1 Cost-Based Join-Order Selection¶

Consider finding the best join-order for \(r_1⋈r_2⋈…r_n\).

There are \((2(n–1))!/(n–1)!\) different join orders for above expression.

TOO MUCH!

Using dynamic programming, the least-cost join order for any subset of \(\{r_1,r_2,…r_n\}\) is computed only once and stored for future use.

Join Order Optimization Algorithm，相应的伪代码如下

利用动态规划所需的时空复杂度仍然很高，可以略微放宽，引入Left Deep Join Trees

In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join.

左边可以是中间结果，右边必须是一个关系。

这样我们将 “for each non-empty subset \(S1\) of \(S\) such that \(S1 \ne S\)” 改为 "for each relation \(r\) in \(S\) let \(S1 = S – r\) ." 虽然不一定是最佳的，但是可以大大降低时空复杂度

With dynamic programming

time complexity of optimization with bushy trees is \(O(3^n)\).
Space complexity is \(O(2^n)\)

left-deep join tree

Time complexity of finding best join order is \(O(n2^n)\)
Space complexity remains at \(O(2^n)\)

12.4.2 Heuristic Optimization¶

Cost-based optimization is expensive. 可以用启发式优化

Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance:

Perform selection early (reduces the number of tuples) 更早做选择
Perform projection early (reduces the number of attributes) 更早做投影
Perform most restrictive selection and join operations (i.e. with smallest result size) before other similar operations. 先做选择率小的
Perform left-deep join order

12.5 Additional Optimization Techniques¶

12.5.1 Nested Subqueries¶

Nested query example:

select name from instructor 
where exists 
    (select * from teaches
    where instructor.ID = teaches.ID and teaches.year = 2022)

找出 2022 开课的老师的名字。

两重循环，但是低效。

Parameters are variables from outer level query that are used in the nested subquery; such variables are called correlation variables（相关变量） 即来自外循环的变量。如果没有相关变量，我们可以先执行内部，然后再执行外部，变成两个查询。

把刚刚那个例子改为一个 select 语句

select  name
from   instructor, teaches
where instructor.ID = teaches.ID and teaches.year = 2022

那么一个老师如果开了很多门课就会出现很多个名字。但是加上 distinct 关键词后又无法区分同名情况。

半连接 \(⋉_\theta s\)，检验 \(r\) 是否满足某个关系。【回忆半连接的定义，得到的结果是前面关系的子集】

If a tuple \(r_i\) appears \(n\) times in \(r\), it appears \(n\) times in the result of \(r⋉_θs\) , if there is at least one tuple \(s_i\) in \(s\) matching with \(r_i\).

相应的关系代数语句：\(\Pi_{\text{name}}(\text{instructor} \ltimes_{\text{instructor}.\text{ID}=\text{teaches}.\text{ID} ∧ \text{teaches}.\text{year}=2022} \text{teaches})\)

这样一个老师开多门课的情况或者有同名老师的情况都解决了

decorrelation(去除相关)

The process of replacing a nested query by a query with a join/semijoin (possibly with a temporary relation) is called decorrelation(去除相关)

Decorrelation of scalar aggregate subqueries can be done using groupby/aggregation in some cases

举个例子，下列 sql 语句

SELECT e.name, e.salary,
       (SELECT AVG(salary) 
        FROM employees 
        WHERE dept_id = e.dept_id) AS avg_salary
FROM employees e;

对于 e 中每一行都要进行一次子查询，费用很高。应该怎么通过聚合的方法进行去相关？

SELECT e.name, e.salary, a.avg_salary
FROM employees e
JOIN (
    SELECT dept_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY dept_id
) a ON e.dept_id = a.dept_id;

12.5.2 Materialized Views¶

A materialized view is a view whose contents are computed and stored.

有些数据库里把 view 实例化了，真正存储在内部的临时表。

create view department_total_salary(dept_name, total_salary) 
as select dept_name, sum(salary) from instructor group by dept_name

Saves the effort of finding multiple tuples and adding up their amounts. 但是需要时刻保持这个视图和原表一致。

use incremental view maintenance(增量视图维护) The changes (inserts and deletes) to a relation or expressions are referred to as its differential(差分)

join: \(V^{new}=V^{old}∪(i_r⋈s)\),\(V^{new}=V^{old}−(d_r⋈s)\)
select: \(V^{new}=V^{old}∪σ_θ(i_r)\), \(V^{new}=V^{old}−σ_θ(d_r)\)
projection: For each tuple in a projection \(\Pi_A(r)\), we will keep a count of how many times it was derived.
- On insert of a tuple to \(r\), if the resultant tuple is already in \(\Pi_A(r)\) we increment its count, else we add a new tuple with count = 1
- On delete of a tuple from \(r\), we decrement the count of the corresponding tuple in \(\Pi_A(r)\) if the count becomes 0, we delete the tuple from \(\Pi_A(r)\)
count \(v=_A\mathcal{G}_{count}(B)\)
- insert: For each tuple \(r\) in \(i_r\), if the corresponding group is already present in \(v\), we increment its count, else we add a new tuple with count = 1
- delete: for each tuple t in ir.we look for the group t.A in \(v\), and subtract 1 from the count for the group. If the count becomes 0, we delete from v the tuple for the group t.A
sum \(v=_A\mathcal{G}_{sum}(B)\)
min, max

怎么利用这些 view?

Rewriting queries to use materialized views

Example

Replacing a use of a materialized view by the view definition

Example

这些不是一尘不变的，不一定需要使用物化视图，要结合实际情况来做

物化视图的选择是很重要的，比如说给查询次数频繁的增加物化视图。

评论区

对你有帮助的话请给我个赞和 star =>

欢迎跟我探讨！！！