- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
工作日交易数据的闪电快速分析
展开查看详情
1 .Lightning-Fast Analytics for Workday Transactional Data Pavel Hardak, Dir Product (Workday) Ned Borisov (Ph.D), Sr Eng Mgr (Workday) #ExpSAIS18
2 .Agenda • Workday (Pavel H) – Introduction to Workday – Business challenges – Platform for Transactional Apps • Prism Analytics (Ned B) – High Level Architecture – Functional Modules – Problems encountered • Wrap-up (Pavel H) #ExpSAIS18 2
3 .Workday • Pure SaaS company (founded in 2005) • Enterprise cloud apps – HCM and Finances – Named as “Leader” in Gartner Magic Quadrants • 2200+ customers, 175+ of Fortune 500 – Revenue: $2.1B, 36% YoY • 8600+ employees worldwide – #7 in FORTUNE "100 Best Companies to Work For” – Pleasanton (HQ), San Mateo, San Francisco – Boulder (CO), Dublin (Ireland), Victoria (BC), … #ExpSAIS18 3
4 .Workday Confidential #ExpSAIS18
5 .Continuous Innovation in Cloud #ExpSAIS18 5
6 .#ExpSAIS18 6
7 .Enterprise SaaS Challenges • Concurrency – From small to huge companies - every ‘worker’ is Workday user • Reliability – All users add and change data, generating many transactions • Security – Customers trust us with very confidential and private information • Scalability – Import several years from the previous system(s) and keep growing • Speed – Everybody wants fast response time J #ExpSAIS18 7
8 .One Source for Data | One Security Model | One Experience | One Community One Platform Business Process Object Reporting and Framework Data Model Analytics Security Machine Integration Learning Cloud #ExpSAIS18
9 . One Source for Data | One Security Model | One Experience | One Community One Platform Object Data Model Object Data Model Durable Extensible Metadata #ExpSAIS18
10 . One Source for Data | One Security Model | One Experience | One Community One Platform Reporting and Analytics Reporting and Analytics Dashboards Distribution Collaboration
11 .But we want more… • Import 3rd party data from external sources – Unknown schema, need validations and cleansing • Blend external data with Workday data – Self Service Data Preparation – Publish custom report sources – Leverage the same security paradigms • Data Discovery and Reporting – Visualize, slice and dice by any dimension – Perform faster than ever before #ExpSAIS18 11
12 .#ExpSAIS18 12
13 .Just add some … • Water (?) • Coffee (?) • Energy drink (?) • Apache Spark (!) #ExpSAIS18 13
14 .Why Apache Spark • Wanted to standardize on ONE data processing technology which keeps evolving • Needed extensibility to handle diverse use cases • Scalability for on-disk views and in-memory processing • SQL processing is a HUGE plus #ExpSAIS18
15 .High Level Prism Architecture Report Queries Web UI Requests Prism Server Data Prep: Interactive Transforms Samples Workday Data HDFS External Data #ExpSAIS18
16 .Data Preparation • A dataset may import other datasets to transform them (think SQL View) • Transforms include: Filter, Join, Union, Group By, etc. • Example data are shown to help verify the transformation #ExpSAIS18
17 .High Level Prism Architecture Report Queries Web UI Requests Prism Server Data Prep: Interactive Transforms Lens Build: Batch Transforms Samples Data Workday Data HDFS External Data #ExpSAIS18
18 . Lens Build • Materializing all transforms • Columnar format with further split into small Spark blocks Lens Jobs #ExpSAIS18
19 .High Level Prism Architecture Report Queries Web UI Requests Prism Server Query Engine: Interactive BI Queries Data Prep: Lens Interactive Transforms Lens Build: Batch Transforms Samples Data Workday Data HDFS External Data #ExpSAIS18
20 .Query Engine • Analyst-driven Analysis • Drag & drop chart creation • Analyst defined computed fields • Quick measurement aggregates • Execution • Query Engine executes the queries • Interactive response is required #ExpSAIS18
21 .High Level Prism Architecture Report Queries Web UI Requests Prism Server Query Engine: Interactive BI Queries Data Prep: Lens Interactive Transforms Lens Build: Batch Transforms Samples Data Workday Data HDFS External Data #ExpSAIS18
22 . Spark in Prism Architecture Prism Analytics launches and maintains lifecycle of three types of Spark Applications • Data Prep: a single (smaller) always-on Spark Application – executes dataset transformations over small samples of data • Lens Build: on-demand batch Application – one per Lens Build process – executes dataset transformations over full datasets • Query Engine: a single (larger) always-on Application – executes reporting queries over Lens data – caches columns of Lenses in memory #ExpSAIS18
23 .Query Engine & Spark Prism Server Spark Spark Spark Prism Executor Spark Executor Spark Executor Spark Spark Spark Executor Query Engine Spark Executor Spark Server Driver Executor Spark Executor Spark Executor Spark Executor Spark Data Prep Executor Executor ... Executor #ExpSAIS18
24 .Notable Observations • Memory Allocation Strategy • Row Level Security #ExpSAIS18
25 .Memory Allocation Strategy • Executors 30% 60% 10% Column Data Execution Buffer Cache Executor JVM à 20% faster queries • Driver 20% 60% 20% Accumulators Streaming Buffer Driver JVM #ExpSAIS18
26 .Row-Level Security • Implemented as a dimension predicate. For example: SELECT employee, SUM(quantity) FROM Employee_Stock_Grants WHERE supervisory_org IN (org1, org33, org_508) GROUP BY employee; • In-List for supervisory_org could be very large • More than one In-List • Complex list values (e.g. nested conjunctions) #ExpSAIS18
27 .Scenario Details • Customer Use Case – Predicates with 10+ In-Lists – Values between 6K and 12K – Additional mix of conjunctions and disjunctions • The Same Query With Security = 100X Without Security #ExpSAIS18
28 .Analysis • Finding 1 – Parsing, planning and optimizing was taking ~27 seconds – We did it 4 times • Finding 2 – Major cause is the number of times the Catalyst expressions (In and InSet) and their arguments were being traversed and copied during plan analysis and optimization. – Minor cause is the amount of time spent in serializing Scala’s TrieSet when shipping the plan to executors #ExpSAIS18
29 .Solution • Custom InSet-Like expressions (case classes) – Hide the large literals sets through a curried-argument – Resulted in queries going from 27 sec to 4 sec. • Further Optimizations – Our InSet-Like expression did not materialize the target in-sets until after the plan was de-serialized on the executors – Resulted in improvement from 4 sec to 2 sec. #ExpSAIS18