- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Apache Spark来分析大规模的远程信息
展开查看详情
1 .Large-Scaled Telematics Analytics in Apache Spark Wayne Zhang, Uber Neil Parker, Uber #DS3SAIS
2 .Agenda • Telematics introduction • Eng pipeline 2
3 .Telematics - Wide availability - Cheap - Short upgrade cycle - Lower quality - Measure phone motion Source: Smartphone-based Vehicle Telematics - A Ten-Year Anniversary 3
4 .Core Pipeline Vehicle Sensor Data Preprocessing & Driving Behavior Movement Collection Transformation Inference Inference 4
5 .Phone Sensor Data ● GPS ○ Absolute location, velocity and time ○ Low frequency (<= 1 point per second) ● IMU ○ Relative motion of phone ■ Accelerometer: 3D linear acceleration ■ Gyroscope: 3D angular velocity ○ High frequency (>20 points per second) 5
6 .GPS Map-Matching 6
7 .Phone Re-Orientation 7
8 .Long Stop Detection Long Stop Long Stop 8
9 .Vehicle Movement 9
10 .Phone Mounting 10
11 .Engineering How big is • Pipeline our data? xPbs per year – Past xTbs sensor data per day – Present • Problems millions trips per day 11
12 .Data Pipeline - Past (Streaming) Input Transform Output Schema Topic Samza Job Schema Topic (Kafka) (Kafka) • Realtime 12
13 .Data Pipeline - Present (Batch) Select logic in SparkSQL Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) • Flexible 13
14 .Data Pipeline - Actuality (λ) Input Transform Output Schema Topic Samza Job Schema Topic (Kafka) (Kafka) Business Logic (JVM) Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) 14
15 .Data Pipeline - Present Select logic in SparkSQL Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) 15
16 .Data Pipeline Join logic in SparkSQL Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) Input Hive Table (HDFS) 16
17 .Data Pipeline Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) Input Output Hive Table Hive Table (HDFS) (HDFS) 17
18 .Data Pipeline Scheduler (Every 24hrs) Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) Input Output Hive Table Hive Table (HDFS) (HDFS) 18
19 .Data Pipeline Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) 19
20 .Data Pipeline Input Transform Output / Transform Output Input Hive Table Spark Job Hive Table Spark Job Hive Table (HDFS) (HDFS) (HDFS) 20
21 .Data Pipeline - Actuality 21
22 .Eng Problems • Data Sources • OOM Errors • Too many Namenodes 22
23 .Eng Problems - Data Sources Join logic in SparkSQL Doesn’t work Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) Input Thrift Binaries (S3) 23
24 .Eng Problems - Data Sources Input Transform Output Thrift Binaries Spark Job Hive Table (S3) (HDFS) github.com/airbnb/airbnb-spark-thrift 24
25 .Aside: Encode Decode Invariant Java Thrift Spark SQL Class Instance Row 25
26 .Aside: Encode Decode Invariant Java Thrift Spark SQL Class Instance Row Generate random data to test (ScalaCheck Library) 26
27 . Eng Problems - Data Sources Join logic in SparkSQL Works Input Transform Output Hive Table Spark Job Hive Table (HDFS) (HDFS) Input Transform Output Thrift Binaries Spark Job Hive Table (S3) (HDFS) 27
28 .Eng Problems - OOM Errors • If hitting OOM related issues, usually increasing partitions help – `spark.sql.shuffle.partitions=X • Where x > 200 (default) • Might also play with executor-cores and memory settings 28
29 .Eng Problems - # Namenode • If we have more partitions, we create more files • Solution: Merge file after job run – github.com/apache/parquet-mr/tree/master/parquet-tools 29