- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Databricks: What We Have Learned by Eating Our Dog Food
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .What We Have Learned By Eating Our Dog Food (Data engineering at Databricks using Databricks) Suraj Acharya, Databricks Xuan Wang, Databricks #UnifiedAnalytics #SparkAISummit
3 .What’s this talk about? ● Data engineering at Databricks using Databricks ● Sharing our approaches and lessons learned ● Hopefully there are a few things that are helpful to your organization and environment ● Starting a discussion and learning #UnifiedAnalytics #SparkAISummit 3
4 .Data Team at Databricks ● Mission ○ Create datasets, tools and analyses to inform decision making. ● Data Eng + Data Science ● Pipelines for Product Analytics #UnifiedAnalytics #SparkAISummit 4
5 . Data Engineering Playing Field Orchestration Sandbox CI/CD Data Quality and Workflow Dashboarding/ Compute: ETL, analytics, ML Reporting/ BI Message Log Data Catalog/ Lineage Data Model Storage https://pages.databricks.com/wb-data-engineering-best-practices.html
6 .Data Engineering at Databricks ● ETL jobs written in Spark running on Databricks ● Azure and AWS ● Structured log events (mainly) ● Scale: ○ 100s of tables ○ 100s of billions of records processed every day ○ 100s of jobs x 10s cloud regions #UnifiedAnalytics #SparkAISummit 6
7 .Challenges ● Common pitfalls ● Testing ● Deployment ● Configuration management ● Monitoring #UnifiedAnalytics #SparkAISummit 7
8 .Challenges ● Common pitfalls ○ overwrite table partition ○ optimize read performance ● Testing ● Deployment ● Configuration management ● Monitoring #UnifiedAnalytics #SparkAISummit 8
9 .Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... #UnifiedAnalytics #SparkAISummit 9
10 .Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... Hourly Notebooks, Clusters, Jobs processed logs #UnifiedAnalytics #SparkAISummit 10
11 .Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... Hourly Notebooks, Clusters, Jobs processed logs Secrets #UnifiedAnalytics #SparkAISummit 11
12 .Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... Hourly Nightly Notebooks, Clusters, Jobs processed logs Secrets #UnifiedAnalytics #SparkAISummit 12
13 .Overwrite Table Partition ● Problems ○ nightly jobs v.s. hourly jobs ○ backfill jobs ● df.write.mode(“overwrite”) ○ Crashed in the middle? ○ Read before writing completes? #UnifiedAnalytics #SparkAISummit 13
14 . Overwrite Table Partition ● Solution 1: partition swap old input output write new new output output #UnifiedAnalytics #SparkAISummit 14
15 . Overwrite Table Partition ● Solution 1: partition swap old old input input output output write new new new output output output #UnifiedAnalytics #SparkAISummit 15
16 . Overwrite Table Partition ● Solution 1: partition swap old old old input input input output output output write new new new new output output output output #UnifiedAnalytics #SparkAISummit 16
17 .Overwrite Table Partition ● Solution 2: Delta Lake ○ transactional storage layer for Apache Spark + Parquet #UnifiedAnalytics #SparkAISummit 17
18 .Optimize Read Performance ● Problem ○ Efficiently filter billions of records metric=login User 1 Processed Logs metric=clusterEvent User 2 metric=??? User ? #UnifiedAnalytics #SparkAISummit 18
19 .Optimize Read Performance ● Solution 1: partitioning ● df.write.partitionBy(“date”, “metric”) ○ too many partitions => small files ○ uneven key distribution => skewness ● Choose partition keys ○ Good: partition by date/hour ○ Bad: partition by customerId ○ Rule of thumb: > 1GB per partition #UnifiedAnalytics #SparkAISummit 19
20 .Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: 20
21 .Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: Before OPTIMIZE: 21
22 .Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: Before OPTIMIZE: After File1 File2 File3 OPTIMIZE: 22
23 .Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: Before OPTIMIZE: Compute Statistics: After File1 File2 File3 File1: min='a', max='g' OPTIMIZE: File2: min='g', max='n' File3: min='o', max='z' ○ SELECT * FROM logs WHERE metric = 'login' #UnifiedAnalytics #SparkAISummit 23
24 .Challenges ● Common pitfalls ● Testing ○ Unit tests & integration tests ● Deployment ● Configuration management ● Monitoring #UnifiedAnalytics #SparkAISummit 24
25 .Dev & Deployment Workflow #UnifiedAnalytics #SparkAISummit 25
26 .Dev & Deployment Workflow Configuration Management Monitoring Deployment Testing #UnifiedAnalytics #SparkAISummit 26
27 .A Simple ETL Example Message log JSON #UnifiedAnalytics #SparkAISummit 27
28 .A Simple ETL Example Message JSON ETL Parquet log #UnifiedAnalytics #SparkAISummit 28
29 .A Simple ETL Example Message JSON ETL Parquet log #UnifiedAnalytics #SparkAISummit 29