- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
discovering statistical properties for query optimization
展开查看详情
1 .Discovering Statistical Properties for Query Optimization Presented by Han Fei
2 .Part I - Introduction to Query Optimization
3 .The Process of Query Optimization ● Enumerate all possible plans by transformation rules (Cascade Optimzier) ● Get the ordinary table/columns cardinality by simple predicate ● Derive cardinality for every subplan ● Estimate the cost of plan and choose the best one
4 .
5 .Most Common Statistics : Histogram ● Bucket Scheme ○ Equi-Width ○ Equi-Depth ○ Max Diff ○ V-Optimial ● Estimation Scheme ○ Continuous Spread Assumption ○ Four Level Tree
6 .What do We Need More for Modern Databases? ● Correlation Discovery ● ‘soft’ Functional Dependency ● Algebraic constraints ● Holes in joins ● Workload-aware Methods ● Incremental Maintance
7 .Part II - Proactive Methods
8 .How do we estimate multiple predicates? ● attribute value independece assumption ○ the distributions of individual attributes are independent of each other ● join uniformity assumption ○ a tuple from one relation is equally likely to join with any tuple from the second relation
9 .Automated Correlation Discovery ● Consider a table with tree attributes: ○ Education ○ Income ○ Home-holder ● some of the correlations between attributes might be indirect ones, mediate by others. ○ a high-school dropout who owns a successful Internet startup is more likely to own a home than a highly educated beach bum.
10 .Conditionally Independent ● P(H = h | E = e, I = i) = P (H = h | I = i) ● We just need hold some marginal distribution: ○ P(E) ○ P(I | E) ○ P(H | I) ● Then P (H , E, I) = P (E) P (I | E) P (H | I)
11 .Use Graphical Model to detect correlation ● We can build a Bayesian network which consists of two component ○ a DAG whose nodes correspond to A_1, ... , A_n, edges denote a direct dependency of A_i on its parents(A_i) ○ conditional probability distribution
12 .
13 .Part III - Reactive Methods
14 .Feedback based Histogram ● monitor queries on specified column gathering estimated and actual cardinalities ● create/refine maximum-entropy distribution at given condition PingCAP.com
15 .Max Entropy Principle ● We define every bucket as {l_i, r_i, f_i}, of which f_i is relative frequency. ● H(R) = - sum(m_i / ln (m_i/ h_i)) ● Use this principle to determine the boundaries PingCAP.com
16 .Meeting a Space Budget - Prunning ● PingCAP.com
17 .Meeting a Space Budget - Prunning ● PingCAP.com
18 .Thank You !