- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Fluid-Alluxio Day in China
车漾,阿里云智能高级技术专家。从事Kubernetes和容器相关产品的开发。尤其关注利用元生技术构建机器学习平台系统,是GPU共享调度的主要作者和维护者。
展开查看详情
1 .Fluid — When Alluxio meets Kubernetes !" #$%&'()
2 .!" 1 *+,-./ 2 Fluid0123 3 Fluid4567 4 Fluid89:;
3 .#$%&'( !"#$%&'()*+(,-./0123 B() #&' JKLM CDEF HIEF NOEF DockerGKubernetes HadoopGSparkGAlluxio TensorflowGPyTorchGCaffe 1!"#$%&'AI%&( 456789:;<=>?@A01BC • !"#$%&'()*+ • Gartner预测,到2023年,70%的AI workloads将以应用容器的方式运行 • ,-.&'/0-12 或者以serverless编程模型的方式构建* • Spark 3.0.1版本开始支持Kubernetes scheduler,拥抱云原生环境* 2!)*+,-( Gartner报告:https://www.gartner.com/en/conferences/emea/data-analytics-switzerland/featured-topics/ • ()345678$%9: topic-ai-machine-learning • ;<=>?@A Spark 3.0.1 runs on k8s: https://spark.apache.org/docs/latest/running-on-kubernetes.html
4 .)*+,- DE%FGHIJKL*+MNOPQRSTUVW XYZ[\<(*+]^_` RestNet50 模型训练速度 ./01( (images/second) ESSD云盘 Cloud StoragePL2 3189.6 !"#$ !"'( %& %) Synthetic 9993.6 本地内存 !"*+ !"./ ,- 01 0 2000 4000 6000 8000 10000 12000
5 .,-+./01 n %FGabc*+MNd^JKe&^fg9hijk *+'23456789: D"#$/AIE9F*/GHI ;)<=>?@"ABC JKL#$8JM9: 计算存储分离 数据本地化 )<=%&DNOPQR #$[\]E9D#$^_ SMTUVFaaSWXYZ `@aVbcdS4efA 无状态服务 有状态计算 CNCF)<=ghijk#$lmnopqrstuvi
6 .2.3456+789:;< ),-*+2345 )>?@#$…g†/‡ 9:#$wxyzl 01 02 Iˆ‰./01 03 {|)}h~•23€•6 Z|4‚ƒ„
7 .Kubernetes3=>?@+ABCD 01 02 KubernetesŠ‹6^_( )<=BŒ236^_( • ()PQ3RPod • RookWX8CephYZ[\]I • STPQ3RPVC • ChubaoFSW!"&'^_=ST`ab • UVPQ3RService cdXQ6efST lmnPQ;opA*+qrstGuvw]^
8 .EFGHIJKL+MN •Ž!••!‘’ 商品<-->数据 超市<-->存储 客户<-->应用 “” (功能:被消费) (功能:贮藏与供应) (功能:消费) #$!23!%&
9 . EFGHIJKL+MN 23456789:;<=>?@A BCDEF?:;23GH@A? I JKLMN5OP@AQRS:;8 TUVWXQYZ[\] EFBC@A 23Hw@A 5Oxyz{|QR@A}:; LM^_1`?!"B.D^./ ab=?cde"fghi!"? ??? jKk*lm5Oabnopq? !"rs+cdtuhi!"vU] 缺乏高效的数据交付 !"B.k^./ab ^3!"rs+cd
10 .FluidOK2.3+78HPQRST Hadoop Alluxio Fluid g/hi jk*l mkno Data Fetch Data Access Data Delivery
11 .FluidOK2.3+78HPQRST 01 pq.rsWtBuYvwxyzi&'/0HI{|!}i~p BuY•€.&'PQ•‚ƒ„… 02 †‡.rsWˆX;<‰Š‹Œ&'•Ž`&'‰Š‹ŒO••Ž` c•‘a‰ŠG’“]IGLM•Ž”•|– 02 I—.rsW˜&'™š›œ•žvw‰ŠŸ6()HIŸ ¡>?¢ „…Gr£6]I
12 .Fluid+UVWX 01 •–),-#$\^_6<=n—( &'/0-12¤¥¦§‚ƒM¨©M=`ª«&'>?„…¬-5’“34 7˜™š›‰./6#$\œ•( 02 ®¯&'0°S±²•Kubernetes;<xy69³;M¨.´µ¶i`ª« &'0·¸¹o žŸ) #$8JM6%&›‰( 03 Kubernetesxy<®¯•°S±²ºµ»¼½¾.&'°S¿À`ÁÂ2Ã&' .12ÄÅÆ.|ÇxyÈÉÊ&'°S.½¾`Ë#=°S4¢o.ÌÍ
13 .Fluid+QRYZ
14 .Fluid+[\]X Fluid¡¢g23£¤'./V¥¢%&¦&6#$\£¤'./ • Concept Dataset: 01 &'0ÎÏÐÑ´Ò.œÓ&'.0i`œÔ.efÕo`Ö×aœØ)±²Â2 Runtime: 02 ª«&'0ÙÚo`Û4]I6&'ÜÝÞM¨.ßà±².áâ`ãäRœåæ YZ[\.|–ç AlluxioRuntime: 03 èéAlluixoêë`΂ƒDataset&']I6°S.ßà±²>?ª«
15 .Fluid+[\]X Fluid¡¢g23£¤'./V¥¢%&¦&6#$\£¤'./ • Key Features £¤: • Portable and Scalable: Increase the Cache Capacity on Demand 01 • Observation: Know the Cache Capacity easily • Co-locality: Bring the data close to compute, and bring the compute close to data. #$§¨©,•ªwx¡«23( 02 Miniature Data Lake. The data from the different storage can be consumed together ¬5: 03 Access control in Dataset level for the Data Scientist
16 .Fluid+[\]X How to Use Fluid apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: imagenet spec: 1.Create Dataset mounts: - mountPoint: oss://imagenet/train name: train options: fs.oss.accessKeyId: xxx fs.oss.accessKeySecret: yyy fs.oss.endpoint: oss-cn-huhehaote.aliyuncs.com - mountPoint: pvc://ceph-pvc name: validation •ª#$wxVD 2.provision #$§WX-® PV/PVC apiVersion: batch/v1 3. Create Pod kind: Pod metadata: name: resnet50 Pod spec: containers: ImageNet - name: train image: resnet50 volumeMounts: - mountPath: /data /data name: imagenet volumes: train (OSS) - name: imagenet persistentVolumeClaim: Validation(PVC) claimName: imagenet
17 .Fluid+[\]X How to check the dataset Status: Cache States: ü Understand the current cache capabilities Cache Capacity: 600GiB Cached: 76.08GiB Cached Percentage: 90.3% ü Determine if need to scale out Conditions: # more conditions Current Fuse Number Scheduled: 4 Current Master Number Scheduled: 1 Status: Current Cache Current Worker Number Scheduled: 4 Cache States: Capabilities Desired Fuse Number Scheduled: 4 Desired Master Number Scheduled: 1 Cache Capacity: 200.00GiB Desired Worker Number Scheduled: 4 Cached: 0B Fuse Number Available: 4 Cached Percentage: 0% Fuse Status: True Type: Ready Conditions: Phase: Bound Message: The ddc runtime is ready. Runtimes: Category: Accelerate Reason: DatasetReady Name: imagenet Status: True Namespace: default Type: Ready Type: alluxio Events: <none> Phase: Bound er Ready: 4 Current Number Runtimes: Fuse Phase: Ready of workers Master Number Ready: 1 Category: Accelerate Master Phase: Ready Name: imagenet Value File: imagenet-alluxio-values Namespace: default Worker Number Available: 4 Worker Number Ready: 4 Type: alluxio Data requires Worker Phase: Ready Ufs Total: 84.29GiB to cache Dataset Runtime
18 .Fluid+[\]X Schedule job with the dataset locality 2.Find the cacheable Node 1. Create Pod client Kubernetes Fluid Scheduler Scheduler 3.Query the cache capabilities apiVersion: batch/v1 of then odes kind: Pod metadata: name: resnet50 Fluid Runtime spec: 4.Start pod in N1 containers: Service - name: train image: resnet50 volumeMounts: - mountPath: /data name: imagenet volumes: - name: cifar10 persistentVolumeClaim: Alluxio Alluxio claimName: imagenet Pod 10G Cached 5G Cached N1 N2 N3
19 .FluidQRK^ Demo 1 Demo Demo22 Demo 3 Accelerate Remote File Machine Learning Accelerate PVC Accessing with Fluid with Fluid with Fluid ì’íîïðñòóWôõõö÷øùùúûõôüýþÿðïù!"üû#$ÿ"ðü#%&õû'îù!"üû#
20 .Fluid_\`a Fluid vs OSSFS(20Gb/s) Fluid vs OSSFS 70000 250 58200.1 220 60000 214 images/second 200 50000 40817.2 40000 150 minutes 21422.3 32215 110 30000 100 91 78 20000 9248.64 27529.7 55 49 50 10000 16556.2 31 8630.56 0 0 8 GPUs 32 GPUs 64 GPUs 128 GPUs 8 GPUs 32 GPUs 64 GPUs 128 GPUs ossfs(cache on) Fluid ossfs(cache) Fluid Workload: ResNet50网络 && ImageNet数据集 端到端性能提升约1倍
21 .bcdeFluid Github Repo: https://github.com/fluid-cloudnative/fluid Project Homepage: http://pasa-bigdata.nju.edu.cn/fluid/index.html DingDing Talk Group:
22 .Thank You!