- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Advanced Hyperparameter Optimization for Deep Learning with MLflow
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .Advanced HPO for Deep Learning Maneesh Bhide, Databricks #UnifiedAnalytics #SparkAISummit
3 .Review: HPO Approaches Grid search: • PRO: can be run in one time-step • CON: naive, computationally expensive, suffers from curse of dimensionality, probably alias over global optima. Random search: • PRO: suffers less from curse of dimensionality, can be run in one time-step • CON: naive, no certainty about results, still computationally expensive Population based: • PRO: implicit predictions, can be run in several time-steps, good at resolving many optima • CON: computationally expensive, may converge to local optima Bayesian: • PRO: explicit predictions, computationally efficient • CON: requires sequential observations #UnifiedAnalytics #SparkAISummit 3
4 .Review: Best Practices • Tune entire pipeline, not individual models • How you phrase parameters matters! – Categoricals really categorical? • [2,4,8,16,32] à {1,5,1} and 2(param) – Use transformations to your advantage • For learning_rate, instead of {0,1} à {-10,0} and 10(param) • Don’t restrict to traditional hyperparameters – SGD Flavor – Architecture #UnifiedAnalytics #SparkAISummit 4
5 .HPO for Neural Networks • Can benefit from compute efficiency of Bayesian Optimization as parameter space can explode. – Challenge of sequential training and long training time • Optimize more than just hyperparameters – Challenge of parameters depending on other parameters • Production models often have multiple criteria – Challenge of trading off between objectives #UnifiedAnalytics #SparkAISummit 5
6 .Agenda • Challenge of sequential training and long training time – Early Termination • Challenge of parameters depending on other parameters – Awkward/Conditional Spaces • Challenge of trading off between objectives – Multimetric Optimization #UnifiedAnalytics #SparkAISummit 6
7 .How Early Termination works From the HyperBand Paper… 1. Select initial candidate configuration set 2. Train configurations for Xn epochs 3. Evaluate performance (preferably, objective metric) 4. Use SuccessiveHalving (eliminate half), run remaining configurations an additional Xn epochs 5. Xn+1 = 2Xn 6. Goto step 2 #UnifiedAnalytics #SparkAISummit 7
8 . Credit: https://www.automl.org/blog_bohb/ #UnifiedAnalytics #SparkAISummit 8
9 .Assumptions • Well behaved learning curves • Model Performance: don’t need the best model, need a good model faster #UnifiedAnalytics #SparkAISummit 9
10 . Credit: https://www.automl.org/blog_bohb/ #UnifiedAnalytics #SparkAISummit 10
11 .Scenario Walkthrough ResNet-50 on ImageNet 9 Hyperparameters for HPO 128 configurations 1 p2.xlarge ($.90/hour) 12 hours training time #UnifiedAnalytics #SparkAISummit 11
12 .Standard Training 12 hours * 128 config Total Compute Time: 1536 hours Total Cost: $1382.4 #UnifiedAnalytics #SparkAISummit 12
13 .With HyperBand % train .78% 1.56% 3.12% 6.25% 12.5% 25% 50% 100% hours .09 .19 .37 .75 1.5 3 6 12 Configs 128 64 32 16 8 4 2 1 ET 64 32 16 8 4 2 1 Total 5.76 6.08 5.92 6 6 6 6 12 Total Compute Time: 53.76 hours Total Cost: $48.38 #UnifiedAnalytics #SparkAISummit 13
14 .Scenario Summary w/o Early Termination: 1536 hours w/ Early Termination: 53.76 hours 96.5% Reduction in Compute (and Cost!) #UnifiedAnalytics #SparkAISummit 14
15 .Bayesian + HyperBand 1. Articulate checkpoints 2. Optimizer selects an initial sample (bootstraping) 3. Train for “checkpoint N” epochs 4. Evaluate performance (preferably, objective metric) 5. Use Bayesian method to select new candidates 6. Increment N 7. Goto step 3 #UnifiedAnalytics #SparkAISummit 15
16 . Credit: https://www.automl.org/blog_bohb/ #UnifiedAnalytics #SparkAISummit 16
17 .Assumptions None • Black box optimization • Allows user to account for potential stagnation in checkpoint selection • Regret intrinsically accounted for #UnifiedAnalytics #SparkAISummit 17
18 .Random vs Bayesian 1. Number of initial candidates – Random: scales exponentially with number of parameters – Bayesian: scales linearly with number of parameters 2. Candidate selection – Random: naïve, static – Bayesian: adaptive 3. Regret Implementation – Random: User must explicitly define – Bayesian: Surrogate + acquisition function #UnifiedAnalytics #SparkAISummit 18
19 .Which is Better? #UnifiedAnalytics #SparkAISummit 19
20 .Does this Actually Work? #UnifiedAnalytics #SparkAISummit 20
21 .Summary • Attempts to optimize for resource allocation • Dramatically reduce compute and wall clock to convergence • Better implementations include a “regret” mechanism to recover configurations • Bayesian outperforms Random – But in principle, compatible with any underlying hyperparameter optimization technique #UnifiedAnalytics #SparkAISummit 21
22 .What about Keras/TF EarlyStopping NOT THE SAME THING Evaluates against a pre-determined rate of loss improvement for a single model 1. Terminate stagnating configurations 2. Prevent over training #UnifiedAnalytics #SparkAISummit 22
23 .Libraries Open Source: HyperBand • HpBandSter (with Random search) Open Source: Conceptually Similar • HpBandSter (with HyperOpt search) • Fabolas* (RoBo) Commercial: Conceptually Similar • SigOpt #UnifiedAnalytics #SparkAISummit 23
24 .Code • HpBandSter: https://automl.github.io/HpBandSter/build/html/a uto_examples/index.html • Fabolas: https://github.com/automl/RoBO/blob/master/exa mples/example_fabolas.py • SigOpt: https://app.sigopt.com/docs/overview/multitask #UnifiedAnalytics #SparkAISummit 24
25 .Awkward/Conditional Spaces The range, or existence, of one hyperparameter is dependent on the value of another hyperparameter #UnifiedAnalytics #SparkAISummit 25
26 .Examples • Optimize Gradient Descent algorithm selection • Neural network topology refinement • Neural Architecture Search • Ensemble models as featurizers #UnifiedAnalytics #SparkAISummit 26
27 .Credit: https://devblogs.nvidia.com/optimizing-end-to-end-memory-networks-using-sigopt-gpus/ #UnifiedAnalytics #SparkAISummit 27
28 .Why does this matter? • Bayesian/adaptive algorithms learn from the prior • For every hyperparameter, it will require some number of samples to ”learn” dependencies #UnifiedAnalytics #SparkAISummit 28
29 .Libraries Open Source • HyperOpt • HpBandSter Commercial • SigOpt #UnifiedAnalytics #SparkAISummit 29