- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
使用数据流引擎进行大型矩阵操作
展开查看详情
1 .Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine
2 .Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Matrix operations on Spark
3 .Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters Wide use in both enterprises and web industry How do we program these things?
4 .Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale: How to split problem across nodes? Must consider network & data locality How to deal with failures? (inevitable at scale) Even worse: stragglers (node not failed, but slow) Rarely used in commodity datacenters
5 .Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators System picks how to split each operator into tasks and where to run each task Run parts twice fault recovery Biggest example: MapReduce Map Map Map Reduce Reduce
6 .MapReduce for Matrix Operations Matrix-vector multiply Power iteration (e.g. PageRank) Gradient descent methods Stochastic SVD Tall skinny QR Many others!
7 .Why Use a Data Flow Engine? Ease of programming High-level functions instead of message passing Wide deployment More common than MPI, especially “near” data Scalability to very largest clusters Even HPC world is now concerned about resilience
8 .Why Use a Data Flow Engine? Ease of programming High-level functions instead of message passing Wide deployment More common than MPI, especially “near” data Scalability to very largest clusters Even HPC world is now concerned about resilience
9 .Limitations of MapReduce MapReduce is great at one -pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing State between steps goes to distributed file system Slow due to replication & disk storage No control of data partitioning across steps
10 .iter . 1 iter . 2 . . . Input file system read file system write file system read file system write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . file system read Commonly spend 90% of time doing I/O Example: Iterative Apps
11 .Example: PageRank Repeatedly multiply sparse matrix and vector Requires repeatedly hashing together page adjacency lists and rank vector Neighbors (id, edges) Ranks (id, rank) … Same file grouped over and over iteration 1 iteration 2 iteration 3
12 .Spark Programming Model Extends MapReduce with primitives for efficient data sharing “Resilient distributed datasets” Open source in Apache Incubator Growing community with 100+ contributors APIs in Java, Scala & Python
13 .Resilient Distributed Datasets (RDDs) Collections of objects stored across a cluster User-controlled partitioning & storage (memory, disk, …) Automatically rebuilt on failure urls = spark.textFile ( “ hdfs ://...” ) records = urls. map ( lambda s: (s, 1) ) counts = records. reduceByKey ( lambda a, b: a + b ) bigCounts = counts. filter ( lambda ( url , cnt ): cnt > 10 ) Input file map reduce filter Known to be hash-partitioned Also known bigCounts. cache ( ) bigCounts. filter ( lambda ( k,v ): “news” in k ). count ( ) bigCounts. join ( otherPartitionedRDD )
14 .Performance Time per Iteration (s)
15 .Performance Time per Iteration (s)
16 . Neighbors (id, edges) Ranks (id, rank) PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing join join join … partitionBy
17 . PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing Neighbors (id, edges) Ranks (id, rank) join join join … same node partitionBy
18 . PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing Neighbors (id, edges) Ranks (id, rank) join partitionBy join join …
19 .PageRank Code # RDD of ( id, neighbors ) pairs links = spark.textFile (...). map ( parsePage ) . partitionBy (128). cache () ranks = links. mapValues ( lambda v: 1.0 ) # RDD of ( id, rank) for i in range(ITERATIONS): ranks = links. join (ranks). flatMap ( lambda (id, (links, rank) ): [(d, rank/ links.size ) for d in links] ). reduceByKey ( lambda a, b: a + b )
20 .PageRank Results
21 .Alternating Least Squares Start with random A 1 , B 1 Solve for A 2 to minimize ||R – A 2 B 1 T || Solve for B 2 to minimize ||R – A 2 B 2 T | | Repeat until convergence R A = B T
22 .ALS on Spark Cache 2 copies of R in memory, one partitioned by rows and one by columns Keep A & B partitioned in corresponding way Operate on blocks to lower communication R A = B T Joint work with Joey Gonzales, Virginia Smith
23 .ALS Results
24 .Benefit for Users Same engine performs data extraction, model training and interactive queries … DFS read DFS write parse DFS read DFS write train DFS read DFS write query DFS DFS read parse train query Separate engines Spark
25 .Other Projects on Spark MLlib : built- in Spark library for ML Includes ALS, K-means||, various algorithms on SGD Frankin , Gonzales et al. [MLOSS ‘13] MLI : Matlab -like language for writing apps Basic ALS in 35 lines of code Evan Sparks, Ameet Talwalkar et al . [ICDM ‘13]
26 .100+ developers, 25+ companies contributing; most active development community after Hadoop Spark Community
27 .Conclusion Data flow engines are becoming an important platform for matrix algorithms Spark offers a simple programming model that greatly speeds these up More info: spark.incubator.apache.org