- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Large Scale Video Data Analysis at Tubi
沈达-Tubi 数据工程师/Spark社区贡献者
展开查看详情
1 .
2 .Large Scale Video Data Analysis at Tubi 2021/10/16 © Tubi, proprietary and confidential
3 .About Me 沈达, Data Engineer @ Tubi ● GNU TeXmacs Developer ● 《Scala实用指南》译者 ● Apache Spark Contributor © Tubi, proprietary and confidential 3
4 .Tubi Stream Freely democratize content and make premium content accessible to everyone.
5 .Video Data Analysis: Why? At Tubi: ● Ads Logo Detection ● Ads Point Finding ● Cue Point Finding Other Applications: ● Self Driving Car ● Security Camera Imagine how a superhero finds villains provided with video data from all security cameras in Beijing. © Tubi, proprietary and confidential 5
6 .Challenges of Video Data Structured Data Unstructured Data Schema Schema-on-write Schema-on-read Format Predefined, using alphanumeric Binaries and can only be displayed as characters hexnumber Storage Well-formatted Binary column in NoSQL Delta/Parquet/ORC Or scattered in S3 Query SQL Special Tools Needed Applying info extractors and ML models © Tubi, proprietary and confidential 6
7 .Rikai: 理解 A parquet based ML data format built for working with unstructured data at scale ● Run ML-models via SQL ○ PyTorch ○ Scikit-Learn ○ … (more to come) ● Native support for images and videos via PySpark ● Customized data format for images and videos (TODO) https://github.com/eto-ai/rikai Open Source Driven at the very beginning © Tubi, proprietary and confidential 7
8 .Rikai vs Redshift-ML vs BigQuery-ML 1. The biggest difference is in how Rikai introduces “strong typing” for annotations and labels. This actually allows you to use SQL to understand your dataset 2. Rikai also contains many UDFs to help process images/videos (eg compute IOU, generate spectrogram etc) 3. Looking narrowly at just the feature, because it’s Spark-based it’s easily extensible and easy to diagnose issues. © Tubi, proprietary and confidential 8
9 . Rikai-enhanced Spark SQL © Tubi, proprietary and confidential 9
10 .Rikai on Images: the magic ML_PREDICT 0: person 65: remote © Tubi, proprietary and confidential 10
11 .Rikai on Videos Split the 10min videos to 14400 images 14400 = 10 * 60 * 24 Apply the Yolov5 model on images Find predictions with the label 0 © Tubi, proprietary and confidential 11
12 .Rikai Types: Image, Box2d © Tubi, proprietary and confidential 12
13 .Rikai Types: Video © Tubi, proprietary and confidential 13
14 . Run ML Models © Tubi, proprietary and confidential 14
15 .Case Study: Linear Regression Train Predict © Tubi, proprietary and confidential 15
16 .Case Study: Linear Regression with MLflow log_model Train Predict Train Predict load_model © Tubi, proprietary and confidential 16
17 .MLflow: Collaborate on versioned models S3/HDFS © Tubi, proprietary and confidential 17
18 .Case Study: Linear Regression with Rikai Train Predict LOG MODEL Train Predict CREATE MODEL Train Create Model ML_PREDICT: Generated UDF ML_PREDICT © Tubi, proprietary and confidential 18
19 .Summary Train ML Engineers: who defines and creates the model ML/Data Engineers: who manages the model Model Management Create CreateModel CreateModel Model ● Versioning (full lifecycle) ● Options Tuning ML_PREDICT ML/Data Engineers, Data Analysts/Scientists, ... Inputs Predicts Everyone can become a superhero!!! © Tubi, proprietary and confidential 19
20 . ML_PREDICT Revealed © Tubi, proprietary and confidential 20
21 .ML_PREDICT: Generated UDF Models can be cached to reduce IO latency © Tubi, proprietary and confidential 21
22 .ML_PREDICT: Vectorized UDF 0 1 25 26 50 51 75 76 1 2 26 27 51 52 UDF 76 77 UDF UDF UDF 2 3 27 28 52 53 invoked 78 79 invoked invoked invoked … … … … 24 25 49 50 74 75 99 100 © Tubi, proprietary and confidential Assuming batch_size is 4, UDF invoked for 4 times in total 22
23 .Call for Contributors!!! The NEXT AWESOME Big Data/ML Open Source Project © Tubi, proprietary and confidential 23
24 .Thanks! Work with me (da@tubi.tv) !!! https://github.com/da-tubi/rikai-example © Tubi, proprietary and confidential 24
25 .