- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming.
In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.
展开查看详情
1 .WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
2 .Augmenting Machine Learning with Databricks Labs AutoML Toolkit Denny Lee, Databricks #UnifiedDataAnalytics #SparkAISummit
3 .Agenda • Discuss traditional ML pipeline problem and all of its stages • How AutoML Toolkit solves these problems • Hyperparameter Optimization • Choosing Models • Scaling AutoML Toolkit Best Practices #UnifiedDataAnalytics #SparkAISummit 3
4 .About Speaker Denny Lee Developer Advocate, Databricks • Worked with Apache Spark™ since 0.5 • Former Senior Director Data Science Engineering at Concur • On Project Isotope incubation team that built what is now known as Azure HDInsight • Former SQLCAT DW BI Lead at Microsoft #UnifiedDataAnalytics #SparkAISummit 4
5 .Let’s start at the end 5
6 .Let’s start at the end • AutoML’s FeatureImportances automates the discovery of which feature discovery • AutoML’s AutomationRunner automates the building, training, execution, and tuning of a Machine Learning pipeline to create an optimal ML model. • Improved AUC from 0.6732 to 0.995! • Business value: $23.22M to $267.24M saved! • Less code, faster! 6
7 .ML Pipeline Stages 7
8 .ML Pipeline Stages 8
9 .Identify Important Features Traditional ML Pipelines 9
10 .Exploratory Analysis to Identify Features 10
11 .Identify Important Features AutoML Toolkit 11
12 .ML Pipeline with AutoML Toolkit 12
13 .AutoML | FeatureImportances // Calculate Feature Importance (fi) val fiConfig = ConfigurationGenerator.generateConfigFromMap("XGBoost", "classifier", genericMapOverrides) // Since we're using XGBoost, set parallelism <= 2x number of nodes fiConfig.tunerConfig.tunerParallelism = nodeCount * 2 val fiMainConfig = ConfigurationGenerator.generateFeatureImportanceConfig(fiConfig) // Generate Feature Importance val importances = new FeatureImportances(sourceData, fiMainConfig, "count", 20.0) .generateFeatureImportances() 13
14 .AutoML | FeatureImportances 14
15 .ML Pipeline Stages 15
16 .Building and Tuning Models Traditional Model Building and Tuning 16
17 .Hand-made Model • Traditionally, when we build a ML pipeline, we will need to a number of tasks including: • Defining our category (text-based) and numeric columns • Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model) • For numeric columns, ensure they are double or float data types • For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data • Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode) • For example, put together imputer, stringIndexer, One-Hot encoding of category data • Create a vector (e.g. vectorAssembler) to put together these features • Apply a standard scaler to the values to minimize the impact of outliers • Execute the model against our dataset • Review the metrics (e.g., AUC) • Tune the model using a Cross Validator • The better you understand the model, the more likely you will provide better hyperparameters for cross validation • i.e. need to choose a solid set of parameters (e.g. paramGrid) • Review the metrics again (e.g. AUC) • Review confusion matrix (in the case of binary classification) • Review business value 17
18 .Hand-made Model • er? • a i Traditionally, when we build a ML pipeline, we will need to a number of tasks including: s Defining our category (text-based) and numeric columns • • s e Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model) For numeric columns, ensure they are double or float data types • • hi For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data t Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode) • • ake For example, put together imputer, stringIndexer, One-Hot encoding of category data Create a vector (e.g. vectorAssembler) to put together these features • • m Apply a standard scaler to the values to minimize the impact of outliers e • w Execute the model against our dataset Can Review the metrics (e.g., AUC) • Tune the model using a Cross Validator • The better you understand the model, the more likely you will provide better hyperparameters for cross validation • i.e. need to choose a solid set of parameters (e.g. paramGrid) • Review the metrics again (e.g. AUC) • Review confusion matrix (in the case of binary classification) • Review business value 18
19 .Building and Tuning Models AutoML Model Building and Tuning 19
20 .ML Pipeline with AutoML Toolkit 20
21 .AutoML | AutomationRunner val modelingType = "XGBoost" val conf = ConfigurationGenerator.generateConfigFromMap(modelingType,…) // Adjust model tuner configuration conf.tunerConfig.tunerParallelism = nodeCount // Generate configuration val XGBConfig = ConfigurationGenerator.generateMainConfig(conf) // Select on the important features val runner = new AutomationRunner(sourceData).setMainConfig(XGBConfig) .runWithConfusionReport() 21
22 .Model, Metrics, Configs Saved AUC from 0.6732 to 0.995! 22
23 .Clearing up the Confusion Hand-Made Model AutoML Model Predicted Label Predicted Label 23
24 .Business Value From $23.22M to $267.24M saved! 24
25 .Let’s end at the end • AutoML’s FeatureImportances automates the discovery of which feature discovery • AutoML’s AutomationRunner automates the building, training, execution, and tuning of a Machine Learning pipeline to create an optimal ML model. • Improved AUC from 0.6732 to 0.995! • Business value: $23.22M to $267.24M saved! • Less code, faster! 25
26 .DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT