后摩尔定律时代大数据处理的挑战和机遇

下载 7

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/u3507/Challenges_and_Opportunities_of_BigData_Processing_in_PostMo?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

岁月静好

发布于

6年前

3729

人观看

#信息技术

后摩尔定律时代大数据处理的挑战和机遇

展开查看详情

1 . ）后摩尔定律时代大数据处理的挑战和机遇 TC BD （会张晓东大术美国俄亥俄州立大学技据（The Ohio State University）数大国中 18 20

2 . 计算机系统和应用进程的三个进化阶段） • 计算机是为 “计算 (computing)” 而研制的(1930s -1990s) TC – 1936-1937， ABC in Iowa State, Zues in Berlin (电子管） – 1945, ENIAC in University of Pennsylvania （电子管） BD – 1945，冯诺曼模型，一个计算，存储，加上二者数据的传输体系结构 – 1950, PN结型晶体管（贝尔实验室）（ – 1958-1962， CMOS 集成电路 (美国仙童, RCA公司） – 1965 年摩尔定律， 1971年第一个CPU芯片（Intel 4004）会 – 操作系统，存贮系统，编译软件，高性能计算。。。大 – 物质和物理世界被转变为数字世界，快速计算和深度分析 – 人类社会有了前所未有的科技突破：气象，新型材料，。。。。术技 • 计算机是为 “网络 (connectivity)” 而研制的 (1980s – 2010s) – 互联网和无线上网是一个全新数据世界的基础: 据 • 1981-2017: Bandwidth: from 50K bps to 100P bps (2 M times) 数 • 1981-2017: # of devices/users: from 0.1 to 10 (100 times) 大 • 网络电话，微博，QQ, 微信，网上购物，网上查询，。。。国 • 计算机是为 “数据中心 (data)” 而研制的 (从21世纪开始) 中 – 今天大数据的爆炸并不是已有的物理和物质的数字世界的一个延续 2 18 – 这个新的数据世界精确地记录和追踪人类自身的行为 – 有史以来90%的数据是过去两年产生的 20

3 . Moore’s Law is reaching to the end ） TC BD （会大术技据数大 We are close to the final usable size limit: a transistor gate length: 5 nm (2020) 国中 Minimum cost per transistor has been up since 32 nm (2010) 18 20 Road map: 22 nm in 2012, 14 nm in 2014, 10 nm in 2017, 7 nm 2018 3

4 . A powerful and general purpose ecosystem ） TC • A simple and one-size-fit-all computing abstraction BD – Any computing task is expressed by programs and executed by ISA – A task is divided by a sequence of standard execution time unit （ – Under multiprogramming model, OS assigns time units to each task 会 – Data blocks are moved around in the memory hierarchy 大 – Programming is very flexible 术技 • Moore’s law has hidden dark side of the ecosystem 据 – Efficiency continues to become low in both power and execution 数 – Not inclusive to other modes, e.g. SIMD, data flow, and domain specific 大国中 • As Moore’s law is ending 18 – General-purpose computing needs a lot of external help (within CMOS) 20 – Efforts have been made beyond CMOS 4

5 . Limit 1: Kernel-Centric Management ） • OS Kernel controls TC everything, CPU is BD frequently interrupted （ • In-memory computing 会 and NVM technologies 大 can overload CPU 术 • We are in an era of 技 CPU-IO performance 据 inversion, where CPU is 数 a bottleneck 大国丞相李斯向秦始皇进言：“ 中今皇帝并有天下，别黑 18 白而定于一尊”. 20

6 .Limit 1: Kernel-Centric Management is not sustained ） • OS Kernel controls TC everything, CPU is BD frequently interrupted （会 • In-memory computing 大 and NVM technologies 术 can overload CPU 技据 • We are in an era of 数 CPU-IO performance 大 inversion, where CPU is 国 a bottleneck 中 18 20

7 .20 18 中国大数据技术大会（ BD TC ） 7

8 .Relative IO Performance Growth is Faster than CPU in last 10 years due to SSD and ending Moore’s Law ） TC BD （会大术技据数大国中 18 20 8 Data collected by IBM Research, 2017

9 .Limit 2: General-purpose Computing is Flexible but not Efficient ） (公平和效率发生了冲突） TC • CPU and Multicores BD – Mainstream environment and very flexible to all applications （ – Performance gain stops as the end of Moore’s law 会 – Efficiency is increasingly low 大 • GPUs 术 – Efficient for applications with high SIMD parallelism 技 – Special programming and low flexibility • FPGAs 据数 – Efficient for applications defined by the hardware 大 – Very special programming and very low flexibility 国 • ASICs 中 – Very efficient for applications by customized hardware 18 – Expert programing with no flexibility 20 9

10 . A case for the highly efficient specialization ） TC BD （会大术技据数大国 – Crypto currency mining is high computing intensive task in bitcoin system 中 – Since 2008, the mining has been evolved on CPU, GPU, FPGA, and ASIC 18 10 – The efficiency factors are CPU=1, multicore = 10, GPU=100, FPGA=1,000 20 – ASIC is the most efficient in throughput and power: 10,000 to 1,000,000

11 .Limit 3: non-inclusive to specialized devices ） TC BD （会大术技 GPU 据 FPGA ASIC CPU (general) 数大国 Due to three major differences: 中 • Programming difference, e.g. CUDA for GPU, Verilog for FPGA, … 18 • Execution model difference, e.g. SMID for GPU, flexible for FPGA • Management difference, e.g. no OS support for GPU, FPGA, … 20 11

12 . Post Moore’ Law Ecosystem ） TC l Acceleration from specialized devices is highly necessary BD – Raising speedup/scalability best utilizing external devices （会 • In-memory and NVM storage provide fast data accesses 大 – Data will be served most time from DRAM or fast NVM 术技据 • CPU bottleneck must be reduced 数 – Kernel-centric management has to be modified 大国 • The system must be flexible, specialized and inclusive 中 18 – An comprehensive abstraction and automatic tools are needed 20

13 .Case 1: Maximizing Throughput of Key Value Stores ） TC ๏ A simple but effective method in data processing where a data BD record (or a value) is stored and retrieved with its associated key • variable types and lengths of record (value) （ • simple or no schema 会 • easy software development for many applications 大术 Key-value stores have been widely used in production systems: 技 ๏ 据数大国中 18 20

14 .Key-Value Store: A Data Model Supporting Scale-out ） TC BD Command Operation GET(key) Read value （会 SET(key, value) Write a KV item 大 DEL(key) Delete a KV item 术技 Hash (Key)  Server ID 据数大国中 18 § It is horizontally scalable, Data Servers 20 § It can be potentially of (very) high performance.

15 .Workflow of an in-memory Key-Value Store ） TC BD （ Network Processing 会大术 Memory Management 技据数 Index Operations 大国中 Access Value 18 20

16 . Workflow of a Typical Key-Value Store ） TC BD （ TCP/IP Processing 会 Request Parsing 大 DELETE SET GET Network Processing 术 Extract Keys Extract Key&Value Extract Keys 技 Memory Memory Full Not Full Evict 据 Allocate Memory Management 数大 Delete from Insert into Search in Index Index Index Index Operations 国中 Read & Send Value Access Value 18 20

17 .Where does time go in KV-Store MICA [NSDI’14] ） TC Network processing & BD memory management （会大 Index Operations 术技据数大国 Access Value 中 18 20 Index operation becomes one of the major bottlenecks 1 7

18 .Where does time go in KV-Store MICA [NSDI’14]? ） TC Network BD processing & Memory （ Management 会大 Index Operations 术技据数大 Access Value 国中 18 20

19 . Data Workflow of Key-Value Stores ） TC BD Query Network Processing & （ Memory Management 会 Hash Table 大 Random 术 Memory Accesses 技据数大国中 Index Operation Access Value 18 20

20 .Random Memory Accesses of Indexing are Expensive ） TC BD Sequential memory access （ Random memory access 会大术 3 16 技 2 7 据数 CPU: Intel Xeon E5-2650v2 大 Memory: 1600 MHz DDR3 国中 18 20

21 . Reason: long latency comes from row buffer misses ） TC Processor BD Bus bandwidth time Row Buffer （会大术 Row Access DRAM Core 技据数大国中 18 • Row buffer hits: repeated accesses in the same page 20 • Row buffer misses: random accesses in different pages

22 . Inabilities for CPUs to accelerate random accesses ） TC BD ๏ Do something for Index Operations (random memory accesses) （会 1 Caching 大 • Working set is large (100 GB), CPU cache is small (10 MB) 术技 2 Prefetching • 据 Hard to predict next memory address 数 Multithreading 大 3 Limited number of hardware supported threads 国 • Limited by # of Miss Status Holding Registers (MSHRs) for large volumes 中 • 18 20

23 . CPU vs. GPU ） TC BD Control Cache （会 Massive ALUs 大术技据数大国 Intel Xeon E5-2650v2: Nvidia GTX 780: 7 billion Transistors 中 2.3 billion Transistors 8 Cores 2,304 Cores 18 59.7 GB/s memory bandwidth 288.4 GB/s memory bandwidth 20

24 . Mega-KV System Framework ） TC BD （ Network processing, 会 Index Operations Read & Send Value Memory management 大术 Pre- GPU Post- Request TX 技 Processing Processing Processing 据数大国中 18 20

25 . Mega-KV System Framework ） TC BD （ Network processing, 会 Memory management 大术 Pre- 技 Processing 据数大国中 18 20

26 . Mega-KV System Framework ） TC BD （ Network processing, 会 Memory management 大术 Pre- 技 Processing 据数大 Parallel Processing 国 in GPU 中 18 20

27 . Mega-KV System Framework ） TC BD （ Network processing, 会 Memory management 大术 Pre- 技 Processing 据数大 Post- Parallel Processing 国 Processing in GPU 中 18 Read & Send Value 20

28 .Challenges of Offloading Index Operations to GPUs ） TC BD 1. GPU’s memory capacity is small: ~10 GB （ – Keys may take hundreds of GBs 会 2. Low PCIe bandwidth 大 – PCIe is generally the bottleneck of GPU if large bulk of data needs to be transferred 术技 3. Handling variable-length input is inefficient for GPUs 据 – Search along the input buffer or transfer another offset buffer 数 – Variable-length string comparison 大国 4. Linked list in GPUs is inefficient 中 – The locks to insert/delete linked items would hinder parallelism 18 – more random memory accesses 20

29 . Our Solutions ） TC BD Input data (Keys) Index （ C C C C 会大术 Compress 技据数 C C C C key 大国中 GPU optimized cuckoo hash table that Compressed fixed-length signatures 18 stores key signatures and value locations Address challenges 1, 3, and 4 Address challenges 2, 3 20 (GPU memory capacity and variable length (PCIe bandwidth and variable length data) data) 2 9

1点赞

0收藏

7下载