- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Spark ML模型交叉验证的并行性
展开查看详情
1 .Model Parallelism in Spark ML Cross-validation Nick Pentreath Principal Engineer Bryan Cutler Software Engineer DBG / June 5, 2018 / © 2018 IBM Corporation
2 .About Nick @MLnick on Twitter & Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies Machine Learning & AI Apache Spark committer & PMC Author of Machine Learning with Spark Various conferences & meetups DBG / June 5, 2018 / © 2018 IBM Corporation
3 .About Bryan Software Engineer, IBM CODAIT Apache Spark committer Apache Arrow committer Python, Machine Learning OSS @BryanCutler on Github DBG / June 5, 2018 / © 2018 IBM Corporation
4 .Center for Open Source Data and AI Technologies CODAIT codait.org CODAIT aims to make AI solutions dramatically easier to create, deploy, Improving Enterprise AI Lifecycle in Open Source and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission DBG / June 5, 2018 / © 2018 IBM Corporation
5 .Agenda Model Tuning in Spark Scaling Model Tuning Performance Results Best Practices Future Directions in Optimizing Pipelines DBG / June 5, 2018 / © 2018 IBM Corporation
6 .Model Tuning in Spark DBG / June 5, 2018 / © 2018 IBM Corporation
7 .Model Tuning in Spark Model selection: workflow within a workflow Data Feature Model Ingest Final Model Processing Engineering Selection Candidate models Adjust Train Evaluate DBG / June 5, 2018 / © 2018 IBM Corporation
8 .Model Tuning in Spark Pipeline cross-validation Spark ML Pipeline Tokenizer CountVectorizer LogisticRegression # features: # features: regParam: regParam: 10 100 0.001 0.1 Parameters DBG / June 5, 2018 / © 2018 IBM Corporation
9 .Model Tuning in Spark Pipeline cross-validation CountVectorizer LogisticRegression Tokenizer # features: 10 regParam: 0.001 # features: 10 CountVectorizer LogisticRegression # features: Tokenizer 100 # features: 10 regParam: 0.1 regParam: CountVectorizer LogisticRegression 0.001 Tokenizer # features: 100 regParam: 0.001 regParam: 0.1 CountVectorizer LogisticRegression Tokenizer # features: 100 regParam: 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation
10 .Model Tuning in Spark Pipeline cross-validation Tokenizer CountVectorizer LogisticRegression # features: # features: 10 100 regParam: regParam: 0.001 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation
11 .Model Tuning in Spark Pipeline cross-validation DBG / June 5, 2018 / © 2018 IBM Corporation
12 .Model Tuning in Spark Pipeline cross-validation DBG / June 5, 2018 / © 2018 IBM Corporation
13 .Model Tuning in Spark Pipeline cross-validation DBG / June 5, 2018 / © 2018 IBM Corporation
14 .Model Tuning in Spark Based on XKCD comic: https://xkcd.com/303/ & https://github.com/mislavcimpersak/xkcd-excuse-generator Cross-validation is expensive! • 5 x 5 x 5 hyperparameters = 125 pipelines • ... across 4 machine learning models = 500 • If training & evaluation does not fully utilize available cluster resources then that waste is compounded for each model DBG / June 5, 2018 / © 2018 IBM Corporation
15 . Scaling Model Tuning DBG / June 5, 2018 / © 2018 IBM Corporation
16 .Scaling Model Tuning Parallel model evaluation CountVectorizer LogisticRegression Tokenizer # features: 10 regParam: 0.001 # features: 10 CountVectorizer LogisticRegression # features: Tokenizer 100 # features: 10 regParam: 0.1 regParam: CountVectorizer LogisticRegression 0.001 Tokenizer # features: 100 regParam: 0.001 regParam: 0.1 CountVectorizer LogisticRegression Tokenizer # features: 100 regParam: 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation
17 .Scaling Model Tuning Parallel model evaluation CountVectorizer LogisticRegression Tokenizer # features: 10 regParam: 0.001 # features: 10 CountVectorizer LogisticRegression # features: Tokenizer 100 # features: 10 regParam: 0.1 regParam: CountVectorizer LogisticRegression 0.001 Tokenizer # features: 100 regParam: 0.001 regParam: 0.1 CountVectorizer LogisticRegression Tokenizer # features: 100 regParam: 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation
18 .Scaling Model Tuning Parallel model evaluation CountVectorizer LogisticRegression Tokenizer # features: 10 regParam: 0.001 # features: 10 CountVectorizer LogisticRegression # features: Tokenizer 100 # features: 10 regParam: 0.1 regParam: CountVectorizer LogisticRegression 0.001 Tokenizer # features: 100 regParam: 0.001 regParam: 0.1 CountVectorizer LogisticRegression Tokenizer # features: 100 regParam: 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation
19 .Scaling Model Tuning Parallel model evaluation CountVectorizer LogisticRegression Tokenizer # features: 10 regParam: 0.001 # features: 10 CountVectorizer LogisticRegression # features: Tokenizer 100 # features: 10 regParam: 0.1 regParam: CountVectorizer LogisticRegression 0.001 Tokenizer # features: 100 regParam: 0.001 regParam: 0.1 CountVectorizer LogisticRegression Tokenizer # features: 100 regParam: 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation
20 .Scaling Model Tuning Parallel model evaluation • Added in SPARK-19357 and SPARK-21911 (PySpark) • Parallelism parameter governs the maximum # models to be trained at once DBG / June 5, 2018 / © 2018 IBM Corporation
21 .Scaling Model Tuning Parallel model evaluation Tokenizer CountVectorizer LogisticRegression # features: # features: 10 100 regParam: regParam: 0.001 0.1 DBG / June 5, 2018 / © 2018 IBM Corporation
22 .Scaling Model Tuning Parallel model evaluation DBG / June 5, 2018 / © 2018 IBM Corporation
23 .Scaling Model Tuning Parallel model evaluation DBG / June 5, 2018 / © 2018 IBM Corporation
24 .Scaling Model Tuning Parallel model evaluation DBG / June 5, 2018 / © 2018 IBM Corporation
25 .Scaling Model Tuning Implementation considerations • Parallelism parameter sets the size of threadpool under the hood • Dedicated ExecutionContext created to avoid deadlocks with using the default threadpool • Used Futures instead of parallel collections – more flexible • Model-specific parallel fitting implementations not supported • SPARK-22126 DBG / June 5, 2018 / © 2018 IBM Corporation
26 .Scaling Model Tuning Performance tests • Data size: 100,000 -> 5,000,000 • Compared parallel CV to serial CV with varying number of samples • Number features: 10 • Number partitions: 10 • Simple LogisticRegression with regParam and fitIntercept; parameter grid size 12 • Number CV folds: 5 • Parallelism: 3 • Measure elapsed time for cross-validation • Standalone cluster with 30 cores DBG / June 5, 2018 / © 2018 IBM Corporation
27 .Scaling Model Tuning Results • ±2.4x speedup • Stays roughly constant as # samples increases DBG / June 5, 2018 / © 2018 IBM Corporation
28 .Scaling Model Tuning Best practices • Simple integer parameter is the only thing you can set (for now) • Too low => under-utilize resources • Too high => could lead to memory issues or overloading cluster • Rough rule: # cores / # partitions • But depends on data and model sizes • Mid-sized cluster probably <= 10 DBG / June 5, 2018 / © 2018 IBM Corporation
29 .Optimizing Tuning for Pipeline Models DBG / June 5, 2018 / © 2018 IBM Corporation