- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- <iframe src="https://www.slidestalk.com/Spark/Movinga_Fraud_Fighting_Random_Forest_from_scikit_learn_to_Spark?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
- 微信扫一扫分享
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLf
展开查看详情
1 .Moving a Fraud-Fighting Josh Johnston Random Forest from scikit- Director of AI Science learn to Spark with ML, MLflow, and Jupyter josh.johnston@kount.com
2 . Overview Model lifecycle Our fraud-detecting model Initial method with database and scikit learn Improved method with HDFS and Spark Robust model governance ©Kount Inc All Rights Reserved
3 . Manage the model lifecycle Modeling • Configuration management • Performance (speed) • Accuracy • Validation Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? Microsoft. (2017, October 19). What is the Team Data Science Process? Retrieved March 26, 2019, from Science is repeatable https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview ©Kount Inc All Rights Reserved
4 .Our fraud-detecting model
5 . Kount protects digital innovations from… Fraudulent Account Transaction/ Authentication Account Creation Takeover Fraud Payment Fraud Friction ©Kount Inc All Rights Reserved
6 . Evaluate transactions for fraud • Substantial throughput • 30-100 transactions per second • Low latency • 250 ms end-to-end system latency • ~15 ms for machine learning features and model ©Kount Inc All Rights Reserved
7 . Evaluate transactions for fraud ©Kount Inc All Rights Reserved
8 .©Kount Inc All Rights Reserved
9 . Boost Technology™ Customer View Fraud Manager Feedback: Reduced manual reviews by 20% Reduced manual reviews by 200 hours/month Reduced chargeback rate by 17% Approve an extra ~3K transactions and $1.2M USD per month Don’t hear complaints from fraud team about review queue anymore Sleep better at night ©Kount Inc All Rights Reserved
10 . Boost Technology™ Technical View Feature Engineering • 200 GB of precomputed data Model • Random forest • 250 trees • ~100k nodes per tree • ~1GB serialized representation Model Training • ~150 features • ~60M observations ©Kount Inc All Rights Reserved
11 .Initial training with database and scikit learn
12 . First approach gets to production 16 hrs Model Training Analytics Fetch observations Service 12 hrs Database Lookup compute Time 8 hrs Fetch lookups 1 hr Observation Lookup Flat File Logging Network Train Model Storage 24 hrs (Scikit Learn) Pickled Model 2.5 days 400GB RAM 1TB into swap ©Kount Inc All Rights Reserved
13 . What works • Trains a high value model ©Kount Inc All Rights Reserved
14 . What doesn’t work • Time-intensive • Errors force restarts since everything is held in memory (and swap) • Burdens production analytics database • Pickled model ties execution environment to training environment • Traceability provided by log files and manual documentation • Ad hoc experiments with little configuration control Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? ©Kount Inc All Rights Reserved
15 .Improved training with HDFS and Spark
16 . Cluster for distributed computing • Dell hardware • 6 nodes • 484 vCores • 1.35 TB RAM • Cloudera Manager • Spark 2.4 • Mostly python HDFS • Attached to 3 nodes • 171 TB usable space ©Kount Inc All Rights Reserved
17 . Improved approach through cluster 45 min Analytics sqoop data Spark Cluster HDFS Database Compute lookups Luigi 8 hrs Observation Time Lookup Perform lookups g in 2 hrs Logg Train Model (Spark ML) Zipped MLeap Model <1/2 day MLflow ©Kount Inc All Rights Reserved
18 . Remote development with Jupyter • Most criticisms of notebooks are things you COULD do, not what you MUST do • Good development practices are independent of tools Research Maturity Production Version Control (git) Juptyer Notebook Python Packages Pyspark Application Automation ©Kount Inc All Rights Reserved
19 . What works • Faster • Failures restart in the middle • Reduces burden on production analytics database • Redesign experiments without penalty • MLeap decouples evaluation environment from training environment ©Kount Inc All Rights Reserved
20 . What still doesn’t work • Non-deterministic Spark ML behavior and errors • Spark pipelines rely on configurations that change based on input data ©Kount Inc All Rights Reserved
21 .Tools and Processes for Model Governance
22 . Tools and processes for governance Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? Solution components • Data traceability • Experiment, configuration, and accuracy traceability ©Kount Inc All Rights Reserved
23 .©Kount Inc All Rights Reserved
24 .©Kount Inc All Rights Reserved
25 .©Kount Inc All Rights Reserved
26 .©Kount Inc All Rights Reserved
27 .©Kount Inc All Rights Reserved
28 .©Kount Inc All Rights Reserved
29 . • Data pipelines with error handling • Repeatable and documented data transformations • Document parameters • Trace to code and data used • Record accuracy of selected and not selected models • Store final model and configurations as artifact Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? ©Kount Inc All Rights Reserved