- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .Tangram: Distributed Scheduling Framework for Apache Spark at Facebook Rui Jian, Hao Lin, Facebook Inc. rjian@fb.com, hlin@fb.com #UnifiedAnalytics #SparkAISummit
3 .About Us • Rui Jian – Software Engineer at Facebook (Data Warehouse & Graph Indexing) – Master of Computer Science (Shanghai Jiao Tong university) • Hao Lin – Research scientist at Facebook (Data Warehouse Batch Scheduling) – PhD in Parallel Computing (Purdue ECE) #UnifiedAnalytics #SparkAISummit 3
4 .Agenda • Overview • Tangram Architecture • Scheduling Policies & Resource Allocation • Future work #UnifiedAnalytics #SparkAISummit 4
5 .What is Tangram? The scheduling platform for • reliably running various batch workloads • with efficient heterogenous resource management • at scale #UnifiedAnalytics #SparkAISummit 5
6 .Tangram Scheduling Targets • Single jobs: adhoc/periodic • Batch jobs: adhoc/periodic, malleable • Gang jobs: adhoc/periodic, rigid • Long-running jobs: steady and regular; e.g. online training #UnifiedAnalytics #SparkAISummit 6
7 .Why Tangram? • Various workload characteristics – ML – Apache Spark – Apache Giraph – Single jobs • Customized scheduling policies • Scalability – Fleet size: hundreds of thousands worker nodes – Job scheduling throughput: hundreds of millions jobs per day #UnifiedAnalytics #SparkAISummit 7
8 .Overview 1 ML • What is Tangram? 2 Admin DB 3 SQL query Job Manager 6 4 5 Resource Manager Spark Master Gang Job Giraph Single Job ML Elastic Scheduler Agent Agent Agent #UnifiedAnalytics #SparkAISummit 8
9 .Client Library • Job management Application • Request/Release resources 1 4 • Resource grant 2 Tangram Resource • Preemption notification client 3 Manager • Launch containers 5 • Container status change event 6 Agent #UnifiedAnalytics #SparkAISummit 9
10 .Agent • Report schedulable resources and runtime usage • Health check reports • Detect labels • Launch/Kill Containers • Container recovery • Resource isolation with cgroup v2 #UnifiedAnalytics #SparkAISummit 10
11 .Failure Recovery • Agent failure – Scan the recovery directory and recover the running containers • RM failure – Both agent and client hold off communication to the RM until the new master shows up – Client sync session info to the new master to help it build the states – Agents add them to the new master #UnifiedAnalytics #SparkAISummit 11
12 .Scheduling Policies • Hierarchical queue structure / DRF • Jobs to be queued on leaves 80% 20% • Queue configs: DRF ads feed DRF – min/max resources 50% 50% – Policy: • FIFO FIFO pipelines interactive User Fairness • Dominant Resource Fairness (DRF) Job Job 50% 50% • User fairness FIFO user1 user2 FIFO • Global Job Job • … #UnifiedAnalytics #SparkAISummit 12
13 .Scheduling Policies • Jobs ordered by priority, submission time within queue • Gang job as first class in scheduling and resource allocation • Lookahead scheduling for better throughput and utilization • Job starvation prevention Gang 200 Gang 20 Single Gang 4 Single #UnifiedAnalytics #SparkAISummit 13
14 .Resource Allocation • Fine-grained resource specification: – {cpuMilliCores: 3000, memoryBytes: 200GB} • Constraints: – “dataCenter = dc1 & type in [1,2] & kernelVersion > 4.10” • Job Affinity: – inSameDatacenter #UnifiedAnalytics #SparkAISummit 14
15 .Resource Allocation Prefetched Host Filtering Host Scoring Commit Host Cache and Ordering Allocation • Bypass the • Hard & • Packing • Book steps of Soft efficiency keeping host constraints • Host resources filtering • Resource healthiness • Update and constraint • Data cluster & scoring • Label locality queue • Speedup constraint parameters allocation • Job affinity process #UnifiedAnalytics #SparkAISummit 15
16 .Constraint-based Scheduling • Machine type Queue • Datacenter Job Job • Region Job Job constraint: Job constraint: Job Job type=2 • CPU architecture type=1 • Host prefix Host 1 Host 4 • … Labeled with {”type”:”1”} Host 2 Host 5 Labeled with Host 3 {”type”:”2”} Merged host pool - type 1 & 2 #UnifiedAnalytics #SparkAISummit 16
17 .Preemption • Guarantee resource availability SLO within and across queues • Identify the starving jobs and overallocated jobs • Minimize preemption cost: two-phase protocol – Only candidates appearing in both phases will be preempted – Resource Manager notifies client with preemption intent s.t. necessary action can be taken, e.g. checkpointing #UnifiedAnalytics #SparkAISummit 17
18 .Cross Datacenter Scheduling • The growing demand of computation and storage for Hive tables spans across data centers • Stranded capacity with imbalanced load • Poor data locality and waste of network bandwidth • Slow reaction to recover from crisis and disaster #UnifiedAnalytics #SparkAISummit 18
19 .Cross Datacenter Scheduling • Dispatcher Proxy Job – Monitors resource consumption across data centers – Decides the Resource Manager Dispatcher for scheduling jobs – Provides location hints to the Resource Manager for enforcement Resource Manager 1 Resource Manager 2 • Planner – Decides where the data will be Datacenter 1 Datacenter 2 Datacenter 3 replaced based on utilization and available resources #UnifiedAnalytics #SparkAISummit 19
20 .Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager Dispatcher for scheduling jobs Job – Provides location hints to the Job constraint: datacenter=1 Resource Manager for enforcement Resource Manager 1 Resource Manager 2 • Planner – Decides where the data will be Datacenter 1 Datacenter 2 Datacenter 3 replaced based on utilization and available resources #UnifiedAnalytics #SparkAISummit 20
21 .Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager Dispatcher for scheduling jobs Job – Provides location hints to the Job constraint: datacenter=1 Resource Manager for enforcement Resource Manager 1 Resource Manager 2 • Planner – Decides where the data will be Datacenter 1 Datacenter 2 Datacenter 3 replaced based on utilization and Table Data Table Data available resources #UnifiedAnalytics #SparkAISummit 21
22 .Future Work • Mix workloads managed by one resource manager • Run batch workloads with off-peak resources from online services • Automatic resource tuning for high utilization • We’re hiring! Contact: rjian@fb.com #UnifiedAnalytics #SparkAISummit 22
23 .DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT