-
Notifications
You must be signed in to change notification settings - Fork 19
Using VTune on NERSC Systems
To use VTune on a compute node on NERSC systems, first you need to make sure you request that the VTune Linux kernel module be loaded when you allocate the job.
For example, the following command will submit a 30 minute single KNL node interactive job on Cori Phase II:
srun -N 1 -C knl -t 00:30:00 -p debug -d singleton --vtune --pty /bin/bash -l
The --vtune flag is a special option that the NERSC srun wrapper will accept.
When running on the compute nodes, you'll need to use the command-line VTune client to gather your data. Then, you can use either the command-line or GUI client to post-process and visualize the results from one of the login nodes or your personal machine.
First, make sure that you have VTune loaded into your software environment. Usually, this should be sufficient:
module load VTune
The command-line VTune client is called 'amplxe-cl' on Linux. It provides fairly useful builtin help.
amplxe-cl -help
^ This gives you an overview of all the 'actions' you can do with the client. Actions are individual analysis tasks that the client can perform.
You can get help on a particular action via:
amplxe-cl -help <action>
For the workflow described in this document, we will use three 'actions':
amplxe-cl -collect # Run the application and collect a trace from it.
# If desired, transfer collected trace to a local machine.
amplxe-cl -finalize # Post-process a collected trace.
amplxe-cl -report # Display a summary of a completed trace.
The GUI VTune client will automatically finalize a trace when it opens it, so if you are using the GUI client, the steps are:
amplxe-cl -collect # Run the application and collect a trace from it.
# If desired, transfer collected trace to a local machine.
amplxe-gui # Post-process and then display a collected trace.
# Conceptually finalize + report.
VTune analyzes the performance of an application by executing it on real hardware and regularly pausing the execution to capture snapshots of the program state. The end result is a trace (a collection of snapshots) which can be used to estimate and analyze the behavior of the profiled run of the execution.
The data gathered by these snapshots include stack traces (which can be used to attribute other types of data to a particular function or region of a function), hardware performance counter state and application/OS data (such as whether mutexes are locked or unlocked).
Due to hardware limitations and the disruption to regular application behavior caused by gathering this data, it is often not feasible to gather all the metrics that you want at the sampling rate you want in a single trace. Additionally, maintaing performance metric data and collecting snapshots will always have an impact on the behavior of the application and that impact will change depending on the types of data being gathered.
Thus, you will need to decide what type of data to gather and how it should be gathered when you are collecting a trace. There are two ways to specify the type of performance data and analysis you want to use with VTune:
Select one of the high-level builtin analysis types provided by VTune (amplxe-cl -collect).
Define a custom trace collection by explicitly selecting/defining the metrics you want to collect (amplxe-cl -collect-with).
This document focuses on the use of the builtin analysis types provided by amplxe-cl -collect. You can get a list of the builtin types that are available via:
amplxe-cl -help collect
The first argument to 'amplxe-cl -collect' should be the analysis type you want to use:
amplxe-cl -collect <anaylsis-type> <vtune args> <application> <application args>
By default, 'amplxe-cl' will automatically finalize results. It's important to specify that finalization should be deferred:
amplxe-cl -collect <anaylsis-type> -finalization-mode=deferred <vtune args> <application> <application args>
'hotspots' and 'advanced-hotspots' are the simplest of the builtin analyses and have the smallest profiling disruption on your application. Both analyses provide a per-module/per-function/per-instruction breakdown of where your application spends its time.
-
'hotspots' collects snapshots of the application. It uses the system clock for timing and does not use any hardware performance counters; therefore, it will work even if the VTune Linux kernel module is not loaded on your system.
-
'advanced-hotspots' provides the same type of breakdown as 'hotspots', but additionally uses hardware performance counters to count cycles and instructions retired. These metrics are used to estimate the cycles-per-instruction (CPI) per module/function/instruction.
'hotspots and 'advanced-hotspots' will show you where your application is spending time, but not why. One of the most useful passes for exploring the why of performance is the 'general-exploration' pass. This analysis presents a large number of performance metrics derived from hardware counters such as cache hit/miss rates, CPI and execution port usage. The 'general-exploration' pass can help you determine the bottleneck limiting a particular hotspot.