- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
13-Parallel Performance Methods and Technologies
展开查看详情
1 . Parallel Performance Methods and Technologies Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 13 – Parallel Performance Methods
2 . Parallel Performance and Complexity q To use a scalable parallel computer well, you must write high-performance parallel programs q To get high-performance parallel programs, you must understand and optimize performance for the combination of programming model, algorithm, language, platform, … q Unfortunately, parallel performance measurement, analysis and optimization can be an easy process q Parallel performance is complex Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 2
3 . Parallel Performance Evaluation q Study of performance in parallel systems ❍ Models and behaviors ❍ Evaluative techniques q Evaluation methodologies ❍ Analyticalmodeling and statistical modeling ❍ Simulation-based modeling ❍ Empirical measurement, analysis, and modeling q Purposes ❍ Planning ❍ Diagnosis ❍ Tuning Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 3
4 . Parallel Performance Engineering and Productivity q Scalable, optimized applications deliver HPC promise q Optimization through performance engineering process ❍ Understand performance complexity and inefficiencies ❍ Tune application to run optimally on high-end machines q How to make the process more effective and productive? q What performance technology should be used? ❍ Performance technology part of larger environment ❍ Programmability, reusability, portability, robustness ❍ Application development and optimization productivity q Process, performance technology, and its use will change as parallel systems evolve q Goal is to deliver effective performance with high productivity value now and in the future Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 4
5 . Motivation q Parallel / distributed systems are complex ❍ Four layers ◆ application – algorithm, data structures ◆ parallel programming interface / middleware – compiler, parallel libraries, communication, synchronization ◆ operating system – process and memory management, IO ◆ hardware – CPU, memory, network q Mapping/interaction between different layers Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 5
6 . Performance Factors q Factorswhich determine a program's performance are complex, interrelated, and sometimes hidden q Application related factors ❍ Algorithms dataset sizes, task granularity, memory usage patterns, load balancing. I/O communication patterns q Hardware related factors ❍ Processor architecture, memory hierarchy, I/O network q Software related factors ❍ Operating system, compiler/preprocessor, communication protocols, libraries Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 6
7 . Utilization of Computational Resources q Resources can be under-utilized or used inefficiently ❍ Identifying these circumstances can give clues to where performance problems exist q Resources may be “virtual” ❍ Not actually a physical resource (e.g., thread, process) q Performance analysis tools are essential to optimizing an application's performance ❍ Can assist you in understanding what your program is "really doing” ❍ May provide suggestions how program performance should be improved Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 7
8 . Performance Analysis and Tuning: The Basics q Most important goal of performance tuning is to reduce a program's wall clock execution time ❍ Iterative process to optimize efficiency ❍ Efficiency is a relationship of execution time q So, where does the time go? q Find your program's hot spots and eliminate the bottlenecks in them ❍ Hot spot: an area of code within the program that uses a disproportionately high amount of processor time ❍ Bottleneck : an area of code within the program that uses processor resources inefficiently and therefore causes unnecessary delays q Understand what, where, and how time is being spent Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 8
9 . Sequential Performance q Sequential performance is all about: ❍ How time is distributed ❍ What resources are used where and when q “Sequential” factors ❍ Computation ◆ choosing the right algorithm is important ◆ compilers can help ❍ Memory systems and cache and memory ◆ more difficult to assess and determine effects ◆ modeling can help ❍ Input / output Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 9
10 . Parallel Performance q Parallel performance is about sequential performance AND parallel interactions ❍ Sequential performance is the performance within each thread of execution ❍ “Parallel” factors lead to overheads ◆ concurrency (threading, processes) ◆ interprocess communication (message passing) ◆ synchronization (both explicit and implicit) ❍ Parallel interactions also lead to parallelism inefficiency ◆ load imbalances Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 10
11 . Sequential Performance Tuning q Sequential performance tuning is a time-driven process q Find the thing that takes the most time and make it take less time (i.e., make it more efficient) q May lead to program restructuring ❍ Changes in data storage and structure ❍ Rearrangement of tasks and operations q May look for opportunities for better resource utilization ❍ Cache management is a big one ❍ Locality, locality, locality! ❍ Virtual memory management may also pay off q May look for opportunities for better processor usage Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 11
12 . Parallel Performance Tuning q In contrast to sequential performance tuning, parallel performance tuning might be described as conflict- driven or interaction-driven q Find the points of parallel interactions and determine the overheads associated with them q Overheads can be the cost of performing the interactions ❍ Transfer of data ❍ Extra operations to implement coordination q Overheads also include time spent waiting ❍ Lack of work ❍ Waiting for dependency to be satisfied Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 12
13 . Parallel Performance Engineering Process Implementation Measurement Preparation Performance Refinement Analysis Analysis Program Tuning Ranking Production Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 13
14 . Parallel Performance Engineering Process Implementation Measurement Preparation Performance Refinement Analysis Analysis Program Tuning Ranking Production Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 14
15 . Performance Observability (Guiding Thesis) q Performance evaluation problems define the requirements for performance analysis methods q Performance observability is the ability to “accurately” capture, analyze, and present (collectively observe) information about computer system/software performance q Tools for performance observability must balance the need for performance data against the cost of obtaining it (environment complexity, performance intrusion) ❍ Too little performance data makes analysis difficult ❍ Too much data perturbs the measured system. q Important to understand performance observability complexity and develop technology to address it Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 15
16 . Parallel Performance Engineering Process q Traditionally an empirically-based approach ❍ observation experimentation diagnosis tuning q Performance technology developed for each level Performance Performance Tuning Technology Performance hypotheses • Data mining • Models Technology Performance • Expert systems • Experiment Diagnosis management propertie Performance • Performance s Performance Technology storage Experimentation • Instrumenta+on characterization • Measurement Performance • Analysis Observation • Visualiza+on Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 16
17 . Performance Analysis and Optimization Cycle Instrumentation n Insertion of extra code (probes, hooks) into application Measurement n Collection of data relevant to performance analysis Analysis n Calculation of metrics, identification of performance problems Presentation n Transformation of the results into a representation that can be easily understood by a human user Optimization n Elimination of performance problems Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 17
18 . Performance Metrics and Measurement q Observability depends on measurement q What is able to be observed and measured? q A metric represents a type of measured data ❍ Count: how often some thing occurred ◆ calls to a routine, cache misses, messages sent, … ❍ Duration: how long some thing took place ◆ execution time of a routine, message communication time, … ❍ Size: how big some thing is ◆ message size, memory allocated, … q A measurement records performance data q Certain quantities can not be measured directly ❍ Derived metric: calculated from metrics ◆ rates of some thing (e.g., flops per second) are one example Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 18
19 . Performance Benchmarking q Benchmarking typically involves the measurement of metrics for a particular type of evaluation ❍ Standardize on an experimentation methodology ❍ Standardize on a collection of benchmark programs ❍ Standardize on set of metrics q High-Performance Linpack (HPL) for Top 500 q NAS Parallel Benchmarks q SPEC q Typically look at MIPS and FLOPS Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 19
20 . How Is Time Measured? q How do we determine where the time goes? “A person with one clock knows what time it is, a person with two clocks is never sure.” Confucious (attributed) q Clocks are not the same ❍ Have different resolutions and overheads for access q Time is an abstraction based on clock ❍ Only as good (accurate) as the clock we use ❍ Only as good as what we use it for Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 20
21 . Execution Time q There are different types of time q Wall-clock time ❍ Based on realtime clock (continuously running) ❍ Includes time spent in all activities q Virtual process time (aka CPU time) ❍ Time when process is executing (CPU is active) ◆ user time and system time (can mean different things) ❍ Does not include time when process is inherently waiting q Parallel execution time ❍ Runs whenever any parallel part is executing ❍ Need to define a global time basis Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 21
22 . Observation Types q There are two types of performance observation that determine different measurement methods ❍ Direct performance observation ❍ Indirect performance observation q Direct performance observation is based on a scientific theory of measurement that considers the cost (overhead) with respect to accuracy q Indirect performance observation is based on a sampling theory of measurement that assumes some degree of statistical stationarity Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 22
23 . Direct Performance Observation q Execution actions exposed as events ❍ In general, actions reflect some execution state ◆ presence at a code location or change in data ◆ occurrence in parallelism context (thread of execution) ❍ Events encode actions for observation q Observation is direct ❍ Direct instrumentation of program code (probes) ❍ Instrumentation invokes performance measurement ❍ Event measurement = performance data + context q Performance experiment ❍ Actual events + performance measurements Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 23
24 . Indirect Performance Observation q Program code instrumentation is not used q Performance is observed indirectly ❍ Execution is interrupted ◆ can be triggered by different events ❍ Execution state is queried (sampled) ◆ different performance data measured ❍ Event-based sampling (EBS) q Performance attribution is inferred ❍ Determined by execution context (state) ❍ Observation resolution determined by interrupt period ❍ Performance data associated with context for period Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 24
25 . Direct Observation: Events q Event types ❍ Interval events (begin/end events) ◆ measures performance between begin and end ◆ metrics monotonically increase ❍ Atomic events ◆ used to capture performance data state q Code events ❍ Routines,classes, templates ❍ Statement-level blocks, loops q User-defined events ❍ Specified by the user q Abstract mapping events Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 25
26 . Direct Observation: Instrumentation q Events defined by instrumentation access q Instrumentation levels ❍ Source code ❍ Library code ❍ Object code ❍ Executable code ❍ Runtime system ❍ Operating system q Levels provide different information / semantics q Different tools needed for each level q Often instrumentation on multiple levels required Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 26
27 . Direct Observation: Techniques q Static instrumentation ❍ Program instrumented prior to execution q Dynamic instrumentation ❍ Program instrumented at runtime q Manual and automatic mechanisms q Tool required for automatic support ❍ Source time: preprocessor, translator, compiler ❍ Link time: wrapper library, preload ❍ Execution time: binary rewrite, dynamic q Advantages / disadvantages Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 27
28 . Indirect Observation: Events/Triggers q Events are actions external to program code ❍ Timer countdown, HW counter overflow, … ❍ Consequence of program execution ❍ Event frequency determined by: ◆ type, setup, number enabled (exposed) q Triggers used to invoke measurement tool ❍ Trapswhen events occur (interrupt) ❍ Associated with events ❍ May add differentiation to events Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 28
29 . Indirect Observation: Context q When events trigger, execution context determined at time of trap (interrupt) ❍ Access to PC from interrupt frame ❍ Access to information about process/thread ❍ Possible access to call stack ◆ requires call stack unwinder q Assumption is that the context was the same during the preceding period ❍ Between successive triggers ❍ Statistical approximation valid for long running programs assuming repeated behavior Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 29