- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
哪些数据破坏了我的代码?检查Spark 的转变
展开查看详情
1 .Which Data Broke My Code? Inspecting Spark Transformations Vinod K. Nair, Director of Product Management @ Pepperdata #DevSAIS12
2 . Talk Outline • Introduction • Problem: ‘laziness’ makes debugging hard • Solution: interactive inspection of RDDs • Demo • Q&A #DevSAIS12 2
3 .Introduction to Pepperdata Application Performance Management (APM) for Spark (& Hadoop) #DevSAIS12 3
4 .#DevSAIS12 4
5 .#DevSAIS12 5
6 .Problem: ‘laziness’ makes debugging hard RDD data unavailable until an ‘action’ triggers execution #DevSAIS12 6
7 .Transformations are invisible RDDs support two types of operations: 1. transformations, which create a new dataset from an existing one 2. actions, which return a value to the driver program after running a computation on the dataset Transformations in Spark are lazy. They are only computed when an action requires a result to be returned to the driver program. - https://spark.apache.org/docs/latest/rdd-programming-guide.html #DevSAIS12 7
8 .filtererdData.take(10).foreach( println ) #DevSAIS12 8
9 .Solution available today • Sprinkle your code with print statements • ‘hopefully’ catch the right transformation causing the problem • If you don’t catch it – repeat process #DevSAIS12 9
10 .Our solution: interactive inspection of data in flight Trigger an ‘action’ to enable inspection of any RDD in the DAG #DevSAIS12 10
11 .Solution requirements • No user code changes required • Work with any standard Spark distribution • Provide a familiar interactive debugger interface #DevSAIS12 11
12 .Solution overview Driver Spark Context User Code RDD Graph DAG Scheduler Worker Task Task Task Cluster Scheduler Manager Worker Task Task
13 .Solution overview Driver Worker Spark Context Task User Code RDD Graph RDD DAG Scheduler Task Inspector Cluster Manager Worker RDD Task Task Metadata Scheduler Task UI REST API CLI
14 .New Spark job to display RDD[10] Original stage with RDD [10] as Debug stage with an intermediate transformation RDD [10] as output #DevSAIS12 14
15 .CLI command to inspect RDD[10] Action #DevSAIS12 15
16 .Demo Interactive ‘debugger’ experience to inspect any RDD in the DAG #DevSAIS12 16
17 . Analyze CDC’s census data • CDC’s 500 cities project (data.gov: 500 cities local data for better health ) • Input: Major chronic diseases by city • Output: Diabetes in adults by US region #DevSAIS12 17
18 .Distribution of diabetics by region 3,448,595 1,532,856 2,096,619 3,591,010 #DevSAIS12 18
19 .Region Adult Population with Diabetes WEST 3,448,595 MID WEST 1,532,856 NORTH EAST 2,096,619 SOUTH 3,407,528 UNKNOWN 183,482 #DevSAIS12 19
20 . RDD transformations through the app 2014,CA,California,Alameda,City,BRFSS,Health Region Impacted pop. Outcomes,0600562,Diagnosed diabetes among adults aged >=18 Years,%,AgeAdjPrv,Age-adjusted West prevalence,8.1,7.9,8.2,,,73812,"(37.7650849031, - 122.266489842)",HLTHOUT,DIABETES,0600562,,Diabetes Mid West … RDD [5] North East South RDD [16] Filter (“DIABETES”) & Map ( State, Impacted pop. ) Reduce (Region) State Impacted pop. Region Impacted pop. Alabama 205,764 South 27,161 Alabama 90,468 Map South 9,228 Alaska 291,826 (Region, Impacted pop.) West 21,596 Arizona 76,238 West 6,633 … RDD [11] … RDD [12] #DevSAIS12 20
21 .Spark Debugger UI URL Spark Web UI List of RDDs RDD Filter 'Breakpoint’ on any transformation First 10 records matching filter #DevSAIS12 21
22 .Spark Debugger Demo #DevSAIS12 22
23 .To recap, you can… • View data sets as they are transformed in your Spark App – no code changes are required – it works with any Spark distribution, and – it uses a familiar debugger interface to ‘set’ breakpoints and view RDDs in flight #DevSAIS12 23
24 .What’s next ? Roadmap for Pepperdata’s Spark Debugger #DevSAIS12 24
25 .Areas of focus going forward • UX Improvements • Attach to a running app (streaming use case) • Pause a job on hitting a condition • Spark SQL support To learn more visit the booth (#407) #DevSAIS12 25