- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Graphics Hardware
展开查看详情
1 .Graphics Hardware CMSC 435/634
2 .Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline
3 .67 GFLOPS 1.1 TFLOPS 75 GB/s 13 GB/s 335 GB/s Texture 45 GB/s Fragment Vertex Triangle Fragment Computation and Bandwidth Based on: • 100 Mtri /sec (1.6M/frame@60Hz) • 256 Bytes vertex data • 128 Bytes interpolated • 68 Bytes fragment output • 5x depth complexity • 16 4 -Byte textures • 223 ops/ vert • 1664 ops/frag • No caching • No compression
4 .Task Task Task Task Distribute Merge Data Parallel
5 .Vertex Distribute objects by screen tile Triangle Fragment Some pixels Some objects Vertex Triangle Fragment Vertex Triangle Fragment Objects Screen Sort First
6 .Vertex Distribute objects or vertices Merge & Redistribute by screen location Vertex Vertex Triangle Fragment Triangle Fragment Triangle Fragment Triangle Fragment Some pixels Some objects Some objects Objects Screen Sort Middle
7 .Tiled Interleaved Screen Subdivision
8 .Vertex Triangle Fragment Distribute by object Z-merge Vertex Triangle Fragment Vertex Triangle Fragment Full Screen Some objects Objects Screen Sort Last
9 .Graphics Processing Unit (GPU) Sort Middle( ish ) Fixed-Function HW for clip/cull, raster, texturing, Ztest Programmable stages Commands in, pixels out
10 .Vertex Pixel Triangle Pipeline … Parallel More Parallel More Pipeline More Parallel GPU Computation
11 .Architecture: Latency CPU: Make one thread go very fast Avoid the stalls Branch prediction Out-of-order execution Memory prefetch Big caches GPU: Make 1000 threads go very fast Hide the stalls HW thread scheduler Swap threads to hide stalls
12 .Architecture (MIMD vs SIMD) CTRL ALU ALU CTRL ALU ALU CTRL CTRL ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU MIMD(CPU-Like) SIMD (GPU-Like) CTRL Flexibility Horsepower Ease of Use
13 .SIMD Branching if( x ) // mask threads / / issue instructions else // invert mask / / issue instructions / / unmask Threads agree, issue if Threads disagree, issue if AND else Threads agree, issue else
14 .SIMD Looping while(x) // update mask // do stuff They all run ‘ till the last one ’ s done…. Useful Useless
15 .Z-Buffer Rasterize GPU graphics processing model Vertex Geometry Fragment CPU Displayed Pixels Texture/Buffer
16 .[Kilgaraff and Fernando, GPU Gems 2] NVIDIA GeForce 6 Vertex Rasterize Fragment Z-Buffer Displayed Pixels
17 .[Kilgaraff and Fernando, GPU Gems 2] NVIDIA GeForce 6 Vertex Rasterize Fragment Z-Buffer Displayed Pixels
18 .GPU graphics processing model CPU Displayed Pixels Vertex Geometry Fragment Rasterize Z-Buffer Texture/Buffer
19 .AMD/ATI R600 Dispatch
20 .SIMD Units 2x2 Quads (4 per SIMD) 20 ALU/Quad (5 per thread) “ Wavefront ” of 64 Threads, executed over 8 clocks 2 Waves interleaved Interleaving + multi-cycling hides ALU latency. Wavefront switching hides memory latency. GPR Usage determines wavefront count. General Purpose Registers 4x32bit (THOUSANDS of them)
21 .[Tom ’ s Hardware] AMD/ATI R600
22 .NVIDIA Maxwell [NVIDIA, NVIDIA GeForce GTX 980 Whitepaper, 2014]
23 .Maxwell SIMD Processing Block 32 Cores 8 Special Function NVIDIA Terminology : Warp = interleaved threads Want at least 4-8 Thread Block = Warps*Cores Flexible Registers Trade registers for warps
24 .Maxwell Streaming Multiprocessor (SMM) 4 SIMD blocks Share L1 Caches Share memory Share tessellation HW
25 .Maxwell Graphics Processing Cluster (GPC) 4 SMM Share raster
26 .Full NVIDIA Maxwell 4 GPC Share L2 Share dispatch
27 .NVIDIA Maxwell Stats 16 SM * 4 SIMD Blocks* 32 cores (2048 total) 4.6 TFLOPS, 144 Gtex /s, 224 MB/s 2MB L2 Cache Compress between Memory and L2 Saves 25%
28 .GPU Performance Tips
29 .Graphics System Architecture Your Code API Driver Current Frame (Buffering Commands) Previous Frame(s) (Submitted, Pending Execution) GPU Produce Consume GPU GPU(s) Display