- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- <iframe src="https://www.slidestalk.com/Spark/Self_Service_Apache_Spark_Structured_Streaming_Applications_and?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
- 微信扫一扫分享
Self-Service Apache Spark Structured Streaming Applications and Analytics
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .Self-Service Apache Spark Structured Streaming Applications & Analytics Jayant Shekhar #UnifiedAnalytics #SparkAISummit
3 .Use Cases • IoT analytics • Real-time fraud detection • Anomaly detection • Analyzing streaming data from devices, turbines etc. #UnifiedAnalytics #SparkAISummit 3
4 .Why Self-Serve Build Quickly Easy Debugging Ability to handle error cases Build in minutes instead of hours and View what is happening inside days the jobs Visualizations Easy Operationalization Deploy immediately View the streaming data Track status of jobs, jobs history visually Failure Notifications etc. #UnifiedAnalytics #SparkAISummit 4
5 . Workflow – Find delayed Flights Display Delayed Save to Flights Hbase/MongoDB Read from Find delayed Parse Fields Kafka Flights (SQL) Aggregate by Display Delays Airports by Airport #UnifiedAnalytics #SparkAISummit 5
6 .Workflows stored as JSON #UnifiedAnalytics #SparkAISummit 6
7 .Spark Submit spark2-submit --class fire.execute.WorkflowExecuteFromFile --master yarn --deploy-mode client --proxy-user sparkflows /home/sparkflows/fire-3.1.0/fire-core-lib/fire-spark_2_1-core-3.1.0-jar-with-dependencies.jar --postback-url http://demo50:8080/messageFromSparkJob --sql-context HiveContext --job-id 1aa66851-0e69-4fc8-b525-539839f72046 --workflow-file /tmp/fire/workflows/workflow-4994719000673479151.json #UnifiedAnalytics #SparkAISummit 7
8 .Processor Types • Connectors • Visualizations • Transforms • ML Scoring • Aggregations • More • Languages (SQL, Scala, – Sessionization Python) – Dedup #UnifiedAnalytics #SparkAISummit 8
9 . Processors Details Streaming Feature Generation Machine Learning Data Profiling Geo NLP/OCR • Kafka • Correlation • IP2Geo • Tokenization • Classification • Named Entity • Kinesis • Data Summary • Spatial Joins • Stop Words • Regression Recognition • Files • Histograms • Map lat/lon to Remover • Clustering • Sentiment Analysis • Flume Zipcode • Imputer • Collaborative • Document • Sockets • Locality Sensitive Filtering Categorizer Hashing • Save / Load Model • OCR • One Hot Encoder • Cross-Validation • Predict File File Formats Formats ETL Languages Visualizations Data Sources RDBMS •Scalability CSV / TSV is an • Join, Union • Scala • Graphs • HDFS / S3 • MySQL •attribute Avro that describes • Filter • SQL • Maps • HIVE • Oracle •theParquet ability of a • Data Validation • Jython • Heatmaps • HBase • Postgres •process, JSONnetwork, • Math/String/Date • Java • Barchart • Cassandra • Teradata •software XML or Functions • Python • Piechart • Elastic Search • Etc. •organization PDF to grow • Data Cleanup • Kafka •andBinary manage • Salesforce increased demand. A • Marketo system, business or software that is described as scalable has an advantage because it is more adaptable to the #UnifiedAnalytics #SparkAISummit 9
10 .Components HDFS Kafka HBase Running Apache Spark Instances Mongo S3 Kinesis DB Browser Web Server ADLS Cassa ndra Apache Spark Clusters ES #UnifiedAnalytics #SparkAISummit 10
11 .Connectors • Cannot easily start/stop streaming jobs when designing. • Build connectors for reading from the stores and converting to DataFrames #UnifiedAnalytics #SparkAISummit 11
12 .Analytics • Slice & Dice data • Aggregations • Streaming Visualizations #UnifiedAnalytics #SparkAISummit 12
13 .Analytics • SQL • Scala #UnifiedAnalytics #SparkAISummit 13
14 .NLP • Integrated Apache OpenNLP & StanfordNLP • Processors ensure serialization of objects etc. and making things parallizable. • So users can easily start applying NLP to millions and millions of records. #UnifiedAnalytics #SparkAISummit 14
15 .Streaming Charts • Results produced by the Spark Streaming jobs are streamed back to the browser. • Displayed on streaming charts. #UnifiedAnalytics #SparkAISummit 15
16 .Streaming Connectors • Specialized code to run in the Workflow Designer reading from Streaming Sources. • We cannot run a full streaming job for interactive execution. #UnifiedAnalytics #SparkAISummit 16
17 .ML Scoring • ML Pipelines include featurization significantly simplifying things. • Ability of Processors to read models and pass them to the next Processors • VectorAssembler, VectorIndexer, StringIndexer, OneHotEncoder, Bucketizer #UnifiedAnalytics #SparkAISummit 17
18 .Storing Results • Ability to store in Hbase, ElasticSearch, HDFS etc. • Do not allow running at design mode so as not to mess up the stores. #UnifiedAnalytics #SparkAISummit 18
19 .Large Scale Deployment & Monitoring #UnifiedAnalytics #SparkAISummit 19
20 .Deployment - Executions #UnifiedAnalytics #SparkAISummit 20
21 .Track Job Status • Status : STARTING / RUNNING / COMPLETED / FAILED / KILLED • Jobs post back their status to the server • Poll the jobs in various ways – logs, YARN etc. #UnifiedAnalytics #SparkAISummit 21
22 .Scheduling & Triggering • Schedule by Time • Poll Kafka topic for events – Workflow ID – Workflow Parameters – Spark Submit Configurations #UnifiedAnalytics #SparkAISummit 22
23 .Notification & Alerts • When jobs complete / fail send email alerts etc. #UnifiedAnalytics #SparkAISummit 23
24 .Some more interesting things… • When no events received for defined time period, stop the Streaming Job. #UnifiedAnalytics #SparkAISummit 24
25 .Execution Results Execution Results stored in an RDBMS and tracked #UnifiedAnalytics #SparkAISummit 25
26 .View and compare various runs of a Workflow #UnifiedAnalytics #SparkAISummit 26
27 .Performance #UnifiedAnalytics #SparkAISummit 27
28 .Performance • Continue to update each process for best performance. Write once run many times… • Allow user control – Ability to control the Persistence Level of the DataFrames at any step • Focused on steps which took longer than expected, analyzed the code and updated it. • Ran load tests to compare various runs. #UnifiedAnalytics #SparkAISummit 28
29 .Learnings… • Many more users able to get value from data. • Reduced time from idea to deployment. • Performance become easier. • Deployment becomes one click. • Easier to write complex modules like Dedup, CDC, ML etc. and use them at many places. #UnifiedAnalytics #SparkAISummit 29