- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
验证大数据工作 - 使用Holden Karau在Apache Spark上生产之前停止故障
展开查看详情
1 . Validating Big Data & ML Pipelines (Apache Spark) Melinda Seckington Now mostly “works”*
2 .Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, contributor to many others (including Airflow) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos ● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
3 .
4 .What is going to be covered: Andrew ● Why my employer cares about this stuff ● My assumptions about y’all ● A super brief look at property testing ● What validation is & why you should do it for your data pipelines ● How to make simple validation rules & our current limitations ● ML Validation - Guessing if our black box is “correct” ● Cute & scary pictures ○ I promise at least one cat
5 .Some of the reasons my employer cares* Khairil Zhafri ● We have a hoted Spark/Hadoop solution (called Dataproc) ● We also have hosted pipeline management tools (based on Airflow called Cloud Composer) ● Being good open source community members *Probably, it’s not like I go to all of the meetings I’m invited to.
6 .Who I think you wonderful humans are? ● Nice* people ● Like silly pictures ● Possibly Familiar with one of Scala, Java, or Python? ● Possibly Familiar with one of Spark ● Want to make better software ○ (or models, or w/e) ● Or just want to make software good enough to not have to keep your resume up to date
7 .So why should you test? ● Makes you a better person ● Avoid making your users angry ● Save $s ○ Having an ML job fail in hour 26 to restart everything can be expensive... ● Waiting for our jobs to fail is a pretty long dev cycle ● Honestly you’re probably not watching this unless you agree
8 .So why should you validate? ● tl;dr - Your tests probably aren’t perfect ● You want to know when you're aboard the failboat ● Our code will most likely fail at some point ○ Sometimes data sources fail in new & exciting ways (see “Call me Maybe”) ○ That jerk on that other floor changed the meaning of a field :( ○ Our tests won’t catch all of the corner cases that the real world finds ● We should try and minimize the impact ○ Avoid making potentially embarrassing recommendations ○ Save having to be woken up at 3am to do a roll-back ○ Specifying a few simple invariants isn’t all that hard ○ Repeating Holden’s mistakes is still not fun
9 .So why should you test & validate: Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
10 .So why should you test & validate - cont Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
11 .Why don’t we test? ● It’s hard ○ Faking data, setting up integration tests ● Our tests can get too slow ○ Packaging and building scala is already sad ● It takes a lot of time ○ and people always want everything done yesterday ○ or I just want to go home see my partner ○ Etc. ● Distributed systems is particularly hard
12 .Why don’t we test? (continued)
13 .Why don’t we validate? ● We already tested our code ○ Riiiight? ● What could go wrong? Also extra hard in distributed systems ● Distributed metrics are hard ● not much built in (not very consistent) ● not always deterministic ● Complicated production systems
14 .What happens when we don’t itsbruce ● Personal stories go here ○ I have no comment about where these stories are from This talk is being recorded so we’ll leave it at: ● Negatively impacted the brand in difficult to quantify ways with words with multiple meanings ● Breaking a feature that cost a few million dollars ● Almost recommended illegal content (caught by a lucky manual) ● Every search result was a coffee shop
15 .Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
16 .Where do folks get the data for pipeline tests? Lori Rielly ● Most people generate data by hand ● If you have production data you can sample you are lucky! ○ If possible you can try and save in the same format ● If our data is a bunch of Vectors or Doubles Spark’s got tools :) ● Coming up with good test data can take a long time ● Important to test different distributions, input files, empty partitions etc.
17 .Property generating libs: QuickCheck / ScalaCheck PROtara hunt ● QuickCheck (haskell) generates tests data under a set of constraints ● Scala version is ScalaCheck - supported by the two unit testing libraries for Spark ● Sscheck (scala check for spark) ○ Awesome people*, supports generating DStreams too! ● spark-testing-base ○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs *I assume
18 .With spark-testing-base & a million entries test("map should not change number of elements") { implicit val generatorDrivenConfig = PropertyCheckConfig(minSize = 0, maxSize = 1000000) val property = forAll(RDDGenerator.genRDD[String](sc)){ rdd => importantBussinesLogicFunction(rdd).count() == rdd.count() } check(property) }
19 .But that can get a bit slow for all of our tests ● Not all of your tests should need a cluster (or even a cluster simulator) ● If you are ok with not using lambdas everywhere you can factor out that logic and test with traditional tools ● Or if you want to keep those lambdas - or verify the transformations logic without the overhead of running a local distributed systems you can try a library like kontextfrei ○ Don’t rely on this alone (but can work well with something like scalacheck)
20 .Lets focus on validation some more: *Can be used during integration tests to further validate integration results
21 .So how do we validate our jobs? Photo by: Paul Schadler ● The idea is, at some point, you made software which worked. ● Maybe you manually tested and sampled your results ● Hopefully you did a lot of other checks too ● But we can’t do that every time, our pipelines are no longer write-once run-once they are often write-once, run forever, and debug-forever.
22 . How many people have something like this? Lilithis val data = ... val parsed = data.flatMap(x => try { Some(parse(x)) } catch { case _ => None // Whatever, it's JSON } }
23 . But if we’re going to validate... val data = ... data.cache() val validData = data.filter(isValid) val badData = data.filter(! isValid(_)) if validData.count() < badData.count() { // Ruh Roh! Special business error handling goes here } ... Pager photo by Vitachao CC-SA 3
24 .Well that’s less fun :( Sn.Ho ● Our optimizer can’t just magically chain everything together anymore ● My flatMap.map.map is fnur :( ● Now I’m blocking on a thing in the driver
25 . Counters* to the rescue**! Miguel Olaya ● Spark has built in counters ○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc. ○ In UI can also register a listener from spark validator project ● We can add counters for things we care about ○ invalid records, users with no recommendations, etc. ○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option ● We can _pretend_ we still have nice functional code *Counters/Accumulators are your friends, but the kind of friends who steal your lunch money ** In a similar way to how regular expressions can solve problems….
26 .First counters free….
27 . Just a little bit of code for the next ones…. Phoebe Baker val parsed = data.flatMap(x => try { Some(parse(x)) happyCounter.add(1) } catch { case _ => sadCounter.add(1) None // What's it's JSON } } // Special business data logic (aka wordcount) // Much much later* business error logic goes here Pager photo by Vitachao CC-SA 3
28 .Ok but what about those *s Miguel Olaya ● Turns out accumulators aren’t really great for tracking data properties ● Turns out sometimes for validation we really care about data properties ● But we can kind of fake it and hope
29 .General Rules for making Validation rules Photo by: Paul Schadler ● According to a sad survey most people check execution time & record count ● spark-validator is still in early stages but interesting proof of concept ● Sometimes your rules will miss-fire and you’ll need to manually approve a job ● Remember those property tests? Could be Validation rules ● Historical data ● Domain specific solutions