- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
深入研究Spark SQL高级性能调整
展开查看详情
1 .Deep Dive Into SQL with Advanced Performance Tuning Xiao Li & Wenchen Fan Spark Summit | SF | Jun 2018 1
2 .About US • Software Engineers at • Apache Spark Committers and PMC Members Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)
3 .Databricks’ Unified Analytics Platform Unifies Data Engineers COLLABORATIVE NOTEBOOKS and Data Scientists Data Engineers Data Scientists Unifies Data and AI DATABRICKS RUNTIME Technologies Powered by Delta SQL Streaming Eliminates infrastructure complexity CLOUD NATIVE SERVICE
4 .Spark SQL A highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. 4
5 .Run Everywhere Processes, integrates and analyzes the data from diverse data sources (e.g., Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON) 5
6 .The not-so-secret truth... SQL is not only SQL. 6
7 .Spark SQL 7
8 .Not Only SQL Powers and optimizes the other Spark applications and libraries: • Structured streaming for stream processing • MLlib for machine learning • GraphFrame for graph-parallel computation • Your own Spark applications that use SQL, DataFrame and Dataset APIs 8
9 .Lazy Evaluation Optimization happens as late as possible, therefore Spark SQL can optimize across functions and libraries Holistic optimization when using these libraries and SQL/DataFrame/Dataset APIs in the same Spark application. 9
10 .New Features of Spark SQL in Spark 2.3 • PySpark Pandas UDFs [SPARK-22216] [SPARK-21187] • Stable Codegen [SPARK-22510] [SPARK-22692] • Advanced pushdown for partition pruning predicates [SPARK-20331] • Vectorized ORC reader [SPARK-20682] [SPARK-16060] • Vectorized cache reader [SPARK-20822] • Histogram support in cost-based optimizer [SPARK-21975] • Better Hive compatibility [SPARK-20236] [SPARK-17729] [SPARK-4131] • More efficient and extensible data source API V2 10
11 .Spark SQL A compiler from queries to RDDs. 11
12 .Performance Tuning for Optimal Plans Run EXPLAIN Plan. Interpret Plan. Tune Plan. 12
13 .Get the plans by running Explain command/APIs, or the SQL tab in either Spark UI or Spark History Server 13
14 .More statistics from the Job page 14
15 .Declarative APIs 15
16 .Declarative APIs Declare your intentions by • SQL API: ANSI SQL:2003 and HiveQL. • Dataset/DataFrame APIs: richer, language- integrated and user-friendly interfaces 16
17 .Declarative APIs When should I use SQL, DataFrames or Datasets? • The DataFrame API provides untyped relational operations • The Dataset API provides a typed version, at the cost of performance due to heavy reliance on user-defined closures/lambdas. [SPARK-14083] • http://dbricks.co/29xYnqR 17
18 .Metadata Catalog 18
19 .Metadata Catalog • Persistent Hive metastore [Hive 0.12 - Hive 2.3.3] • Session-local temporary view manager • Cross-session global temporary view manager • Session-local function registry 19
20 .Metadata Catalog Session-local function registry • Easy-to-use lambda UDF • Vectorized PySpark Pandas UDF • Native UDAF interface • Support Hive UDF, UDAF and UDTF • Almost 300 built-in SQL functions • Next, SPARK-23899 adds 30+ high-order built-in functions. • Blog for high-order functions: https://dbricks.co/2rR8vAr 20
21 .Performance Tips - Catalog Time costs of partition metadata retrieval: - Upgrade your Hive metastore - Avoid very high cardinality of partition columns - Partition pruning predicates (improved in [SPARK-20331]) 21
22 .Cache Manager 22
23 .Cache Manager • Automatically replace by cached data when plan matching • Cross-session • Dropping/Inserting tables/views invalidates all the caches that depend on it • Lazy evaluation 23
24 .Performance Tips Cache: not always fast if spilled to disk. - Uncache it, if not needed. Next releases: - A new cache mechanism for building the snapshot in cache. Querying stale data. Resolved by names instead of by plans. [SPARK-24461] 24
25 .Optimizer 25
26 .Optimizer Rewrites the query plans using heuristics and cost. • Column pruning • Outer join elimination • Predicate push down• Constraint propagation • Constant folding • Join reordering and many more. 26
27 .Performance Tips Roll your own Optimizer and Planner Rules • In class ExperimentalMethods • var extraOptimizations: Seq[Rule[LogicalPlan]] = Nil • var extraStrategies: Seq[Strategy] = Nil • Examples in the Herman’s talk Deep Dive into Catalyst Optimizer • Join two intervals: http://dbricks.co/2etjIDY 27
28 .Planner 28
29 .Planner • Turn logical plans to physical plans. (what to how) • Pick the best physical plan according to the cost broadcast sort merge Join hash join join OR table1 table2 table1 table2 table1 table2 broadcast join has lower cost if one table can fit in memory 29