- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Parallelizing with Apache Spark in Unexpected Ways
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .Parallelizing With Apache Spark In Unexpected Ways Anna Holschuh, Target #UnifiedAnalytics #SparkAISummit
3 .What This Talk is About • Tips for parallelizing with Spark • Lots of (Scala) code examples • Focus on Scala programming constructs #UnifiedAnalytics #SparkAISummit 3
4 .Who am I • Lead Data Engineer/Scientist at Target since 2016 • Deep love of all things Target • Other Spark Summit talks: o 2018: Extending Apache Spark APIs Without Going Near Spark Source Or A Compiler o 2019: Lessons In Linear Algebra At Scale With Apache Spark : Let’s Make The Sparse Details A Bit More Dense #UnifiedAnalytics #SparkAISummit 4
5 .Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data #UnifiedAnalytics #SparkAISummit 5
6 .Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data #UnifiedAnalytics #SparkAISummit 6
7 . Introduction > Hello, Spark!_ Driver Application Partition Executor Job Action Dataset Dataframe Stage Transformation RDD Task Shuffle #UnifiedAnalytics #SparkAISummit 7
8 .Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data #UnifiedAnalytics #SparkAISummit 8
9 .Parallel Job Submission and Schedulers Let’s do some data exploration • We have a system of Authors, Articles, and Comments on those Articles • We would like to do some simple data exploration as part of a batch job • We execute this code in a built jar through spark-submit on a cluster with 100 executors, 5 executor cores, 10gb/driver, and 10gb/executor. • What happens in Spark when we kick off the exploration? #UnifiedAnalytics #SparkAISummit 9
10 .Parallel Job Submission and Schedulers The Execution Starts • One job is kicked off at a time. • We asked a few independent questions in our exploration. Why can’t they be running at the same time? #UnifiedAnalytics #SparkAISummit 10
11 .Parallel Job Submission and Schedulers • All of our questions run as separate The Execution Completes jobs. • Examining the timing demonstrates that these jobs run serially. #UnifiedAnalytics #SparkAISummit 11
12 .Parallel Job Submission and Schedulers One more sanity check • All of our questions, running serially. #UnifiedAnalytics #SparkAISummit 12
13 .Parallel Job Submission and Schedulers Can we potentially speed up our exploration? • Spark turns our questions into 3 Jobs • The Jobs run serially • We notice that some of our questions are independent. Can they be run at the same time? • The answer is yes. We can leverage Scala Concurrency features and the Spark Scheduler to achieve this… #UnifiedAnalytics #SparkAISummit 13
14 . Parallel Job Submission and Schedulers Scala Futures • A placeholder for a value that may not exist. • Asynchronous • Requires an ExecutionContext • Use Await to block • Extremely flexible syntax. Supports for- comprehension chaining to manage dependencies. #UnifiedAnalytics #SparkAISummit 14
15 . Parallel Job Submission and Schedulers Let’s rework our original code using Scala Futures to parallelize Job Submission • We pull in a reference to an implicit ExecutionContext • We wrap each of our questions in a Future block to be run asynchronously • We block on our asynchronous questions all being completed • (Not seen) We properly shut down the ExecutorService when the job is complete #UnifiedAnalytics #SparkAISummit 15
16 .Parallel Job Submission and Schedulers Our questions are now • All of our questions run as separate asked concurrently jobs. • Examining the timing demonstrates that these jobs are now running concurrently. #UnifiedAnalytics #SparkAISummit 16
17 .Parallel Job Submission and Schedulers One more sanity check • All of our questions, running concurrently. #UnifiedAnalytics #SparkAISummit 17
18 .Parallel Job Submission and Schedulers A note about Spark Schedulers • The default scheduler is FIFO • Starting in Spark 0.8, Fair sharing became available, aka the Fair Scheduler • Fair Scheduling makes resources available to all queued Jobs • Turn on Fair Scheduling through SparkSession config and supporting allocation pool config • Threads that submit Spark Jobs should specify what scheduler pool to use if it’s not the default Reference: https://spark.apache.org/docs/2.2.0/job-scheduling.html #UnifiedAnalytics #SparkAISummit 18
19 .Parallel Job Submission and Schedulers The Fair Scheduler is enabled #UnifiedAnalytics #SparkAISummit 19
20 .Parallel Job Submission and Schedulers Creating a DAG of Futures on the Driver • Scala Futures syntax enables for- comprehensions to represent dependencies in asynchronous operations • Spark code can be structured with Futures to represent a DAG of work on the Driver • When reworking all code into futures, there will be some redundancy with Spark’s role in planning and optimizing, and Spark handles all of this without issue #UnifiedAnalytics #SparkAISummit 20
21 .Parallel Job Submission and Schedulers Takeaways Why use this strategy? • Actions trigger Spark to do things (i.e. create • To maximize resource utilization in your Jobs) cluster • Spark can certainly handle running multiple • To maximize the concurrency potential of your Jobs at once, you just have to tell it to job (and thus speed/efficiency) • This can be accomplished by multithreading • Fair Scheduling pools can support different the driver. In Scala, this can be accomplished notions of priority of work in jobs using Futures. • Fair Scheduling pools can support multi-user • The way tasks are executed when multiple environments to enable more even resource jobs are running at once can be further allocation in a shared cluster configured through either Spark’s FIFO or Fair Scheduler with configured supporting pools. #UnifiedAnalytics #SparkAISummit 21
22 .Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data #UnifiedAnalytics #SparkAISummit 22
23 .Partitioning Strategies A first experience with partitioning #UnifiedAnalytics #SparkAISummit 23
24 .Partitioning Strategies Getting started with partitioning • .repartition() vs .coalesce() • Custom partitioning is supported with the RDD API only (specifically through implicitly added PairRDDFunctions) • Spark supports the HashPartitioner and RangePartitioner out of the box • One can create custom partitioners by extending Partitioner to enable custom strategies in grouping data #UnifiedAnalytics #SparkAISummit 24
25 .Partitioning Strategies How can non-standard partitioning be useful? #1 : Collocating data for joins • We are joining datasets of Articles and Authors together by the Author’s id. • When we pull the raw Article dataset, author ids are likely to be distributed somewhat randomly throughout partitions. • Joins can be considered wide transformations depending on underlying data and could result in full shuffles. • We can cut down on the impact of the shuffle stage by collocating data by the id to join on within partitions so there is less cross chatter during this phase. #UnifiedAnalytics #SparkAISummit 25
26 .Partitioning Strategies #1: Collocating data for joins Articles Authors Articles Authors {author_id: 1} {author_id: 1} {author_id: 2} {author_id: 1} {id: 1} {id: 1} {author_id: 3} {author_id: 1} {author_id: 5} {author_id: 1} {author_id: 1} {author_id: 2} {author_id: 2} {author_id: 2} {author_id: 3} {id: 2} {author_id: 2} {id: 2} {author_id: 4} {author_id: 2} {author_id: 1} {author_id: 3} {author_id: 2} {author_id: 3} {author_id: 3} {id: 3} {id: 3} {author_id: 3} {author_id: 4} {author_id: 4} {author_id: 1} {author_id: 4} {author_id: 2} {id: 4} {id: 4} {author_id: 4} {author_id: 4} {id: 5} {id: 5} {author_id: 5} {author_id: 5} {author_id: 5} #UnifiedAnalytics #SparkAISummit 26
27 .Partitioning Strategies How can non-standard partitioning be useful? #2 : Grouping data to operate on partitions as a whole • We need to calculate an Author Summary report that needs to have access to all Articles for an Author to generate meaningful overall metrics • We could leverage .map and .reduceByKey to combine Articles for analysis in a pairwise fashion or by gathering groups for processing • Operating on a whole partition grouped by an Author also accomplishes this goal #UnifiedAnalytics #SparkAISummit 27
28 .Partitioning Strategies Implementing a Custom Partitioner #UnifiedAnalytics #SparkAISummit 28
29 .Partitioning Strategies Takeaways • Partitioning can help even out data skew for more reliable and performant processing. • The RDD API supports more fine-grained partitioning with Hash and Range Partitioners. • One can implement a custom partitioner to have even more control over how data is grouped, which creates opportunity for more performant joins and operations on partitions as a whole. • There is expense involved in repartitioning that has to be balanced against the cost of an operation on less organized data. #UnifiedAnalytics #SparkAISummit 29