This chapter demonstrates a typical performance analysis workflow using Score-P. It consist of the following steps:

Program instrumentation (Section 'Program Instrumentation' )
Summary measurement collection (Section 'Summary Measurement Collection' )
Summary report examination (Section 'Summary report examination' )
Summary experiment scoring (Section 'Summary experiment scoring' )
Advanced summary measurement collection (Section 'Advanced summary measurement collection' )
Advanced summary report examination (Section 'Advanced summary report examination' )
Event trace collection and examination (Section 'Event trace collection and examination' )

The workflow is demonstrated using NPB BT-MZ benchmark as an example. BT-MZ solves a discretized version of unsteady, compressible Navier-Stokes equations in three spatial dimensions. It performs 200 time-steps on a regular 3-dimensional grid using ADI and verifies solution error within acceptable limit. It uses intra-zone computation with OpenMP and inter-zone communication with MPI. The benchmark can be build with a predefined data class (S,W,A,B,C,D,E,F) and any number of MPI processes and OpenMP threads.

NPB BT-MZ distribution already prepared for this example could be obtained from here.

Program Instrumentation

In order to collect performance measurements, BT-MZ has to be instrumented. There are various types of instrumentation supported by Score-P which cover a broad spectrum of performance analysis use cases (see Chapter 'Application Instrumentation' for more details).

In this example we start with automatic compiler instrumentation by prepending compiler/linker specification variable MPIF77 found in config/make.def with scorep:

#            SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS
#-----------------------------------------------------------------
# Items in this file may need to be changed for each platform.
#-----------------------------------------------------------------
...
#-----------------------------------------------------------------
# The Fortran compiler used for MPI programs
#-----------------------------------------------------------------
#MPIF77 = mpif77
# Alternative variants to perform instrumentation
...
MPIF77 = scorep mpif77
# This links MPI Fortran programs; usually the same as ${MPIF77}
FLINK   = $(MPIF77)
...

After the makefile is modified and saved, it is recommended to return to the root folder of the application and clean-up previously build files:

% make clean

Now the application is ready to be instrumented by simply issuing the standard build command:

% make bt-mz CLASS=W NPROCS=4

After the command is issued, the make command should produce the output similar to the one below:

cd BT-MZ; make CLASS=W NPROCS=4 VERSION=
make: Entering directory 'BT-MZ'
cd ../sys; cc  -o setparams setparams.c -lm
../sys/setparams bt-mz 4 W
scorep mpif77 -c  -O3 -fopenmp bt.f
[...]
cd ../common;  scorep --user mpif77 -c  -O3 -fopenmp timers.f
scorep mpif77 -O3 -fopenmp -o ../bin.scorep/bt-mz_W.4 \
bt.o initialize.o exact_solution.o exact_rhs.o set_constants.o \
adi.o rhs.o zone_setup.o x_solve.o y_solve.o exch_qbc.o \
solve_subs.o z_solve.o add.o error.o verify.o mpi_setup.o \
../common/print_results.o ../common/timers.o
Built executable ../bin.scorep/bt-mz_W.4
make: Leaving directory 'BT-MZ'

When make finishes, the built and instrumented application could be found under bin.scorep/bt-mz_W.4.

Summary Measurement Collection

Now instrumented BT-MZ is ready to be executed and to be profiled by Score-P at the same time. However before doing so, one has an opportunity to configure Score-P measurement by setting Score-P environment variables. For the complete list of variables please refer to Appendix 'Score-P Measurement Configuration' .

The typical Score-P performance analysis workflow implies collecting summary performance information first and then in detail performance exploration using execution traces. Therefore Score-P has to be configured to perform profiling and tracing has to be disabled. This is done by setting corresponding environment variables:

% export SCOREP_ENABLE_PROFILING=1

% export SCOREP_ENABLE_TRACING=0

Performance data collected by Score-P will be stored in an experiment directory. In order to efficiently manage multiple experiments, one can specify a meaningful name for the experiment directory by setting

% export SCOREP_EXPERIMENT_DIRECTORY=scorep_bt-mz_W_4x4_sum

After Score-P is prepared for summary collection, the instrumented application can be started as usual:

% cd bin.scorep
% export OM_NUM_THREADS=4
% mpiexec -np 4 ./bt-mz_W.4

The BT-MZ output should look similar to the listing below:

NAS Parallel Benchmarks (NPB3.3-MZ-MPI) BT-MZ MPI+OpenMP Benchmark
Number of zones:   4 x   4
Iterations: 200    dt:   0.000800
Number of active processes:     4
Use the default load factors with threads
Total number of threads:     16  (  4.0 threads/process)
Calculated speedup =     15.78
Time step    1
[... More application output ...]

After application execution is finished, the summary performance data collected by Score-P is stored in the experiment directory bin.scorep/scorep_bt-mz_W_4x4_sum. The directory contains the following files:

scorep.cfg - a record of the measurement configuration used in the run
profile.cubex - the analysis report that was collated after measurement

Summary report examination

After BT-MZ finishes execution, the summary performance data measured by Score-P can be investigated with CUBE or ParaProf interactive report exploration tools.

CUBE:

% cube scorep_bt-mz_W_4x4_sum/profile.cubex

ParaProf:

% paraprof scorep_bt-mz_W_4x4_sum/profile.cubex

Both tools will reveal the call-path profile of BT-MZ annotated with metrics: Time, Visits count, MPI message statistics (bytes sent/received). For more information on using the tool please refer to the corresponding documentation (CUBE, ParaProf).

Summary experiment scoring

Though we were able to collect the profile data, one can mention that the execution took longer in comparison to un-instrumented run, even when the time spent for measurement start-up/finalization is disregarded. Longer execution times of the instrumented application is a sign of high instrumentation/measurement overhead. Furthermore, this overhead might result in large trace files and buffer flushes in the later tracing experiment if Score-P is not properly configured.

In order to investigate sources of the overhead and to tune measurement configuration for consequent trace collection with Score-P, scorep-score tool (see Section 'Usage of scorep-score' for more details about scorep-score) can be used:

% scorep-score scorep_bt-mz_W_4x4_sum/profile.cubex
Estimated aggregate size of event trace (total_tbc):
                     990247448 bytes
Estimated requirements for largest trace buffer (max_tbc):
                     256229936 bytes
(hint: When tracing set SCOREP_TOTAL_MEMORY > max_tbc to avoid
intermediate flushes
or reduce requirements using file listing names of USR regions
to be filtered.)
flt type         max_tbc         time      % region
     ALL       256229936      5549.78  100.0 ALL
     USR       253654608      1758.27   31.7 USR
     OMP         5853120      3508.57   63.2 OMP
     COM          343344       183.09    3.3 COM
     MPI           93776        99.86    1.8 MPI

The textual output of the tool generates an estimation of the size of an OTF2 trace produced, should Score-P be run using the current configuration. Here the trace size estimation could be also seen as a measure of overhead, since both are proportional to the number of recorded events. Additionally, the tool shows the distribution of the required trace size over call-path classes. From the report above one can see that the estimated trace size needed is equal to 1 GB in total or 256 MB per MPI rank, which is significant. From the breakdown per call-path class one can see that most of the overhead is due to user-level computations. In order to further localize the source of the overhead, scorep-score can print the breakdown of the buffer size on per-region basis:

% scorep-score -r scorep_bt-mz_W_4x4_sum/profile.cubex
  [...]
flt  type    max_tbc       time      % region
     ALL   256229936    5549.78  100.0 ALL
     USR   253654608    1758.27   31.7 USR
     OMP     5853120    3508.57   63.2 OMP
     COM      343344     183.09    3.3 COM
     MPI       93776      99.86    1.8 MPI
     USR    79176312     559.15   31.8 binvcrhs_
     USR    79176312     532.73   30.3 matvec_sub_
     USR    79176312     532.18   30.3 matmul_sub_
     USR     7361424      50.51    2.9 binvrhs_
     USR     7361424      56.35    3.2 lhsinit_
     USR     3206688      27.32    1.6 exact_solution_
     OMP     1550400    1752.20   99.7 !$omp implicit barrier
     OMP      257280       0.44    0.0 !$omp parallel @exch_qbc.f
     OMP      257280       0.61    0.0 !$omp parallel @exch_qbc.f
     OMP      257280       0.48    0.0 !$omp parallel @exch_qbc.f

The regions marked as USR type contribute to around 32% of the total time, however, much of that is very likely to be measurement overhead due to frequently-executed small routines. Therefore, it is highly recommended to remove these routines from measurements. This can be achieved by placing them into a filter file (please refer to Section 'Defining and testing a filter' for more details about filter file specification) as regions to be excluded from measurements. There is already a filter file prepared for BT-MZ which can be used:

% cat ../config/scorep.filt
SCOREP_REGION_NAMES_BEGIN EXCLUDE
binvcrhs*
matmul_sub*
matvec_sub*
exact_solution*
binvrhs*
lhs*init*
timer_*

One can use scorep-score once again to verify the effect of the filter file :

% scorep-score -f ../config/scorep.filt scorep_bt-mz_W_4x4_sum
Estimated aggregate size of event trace (total_tbc):
                     20210360 bytes
Estimated requirements for largest trace buffer (max_tbc):
                     6290888 bytes
(hint: When tracing set SCOREP_TOTAL_MEMORY > max_tbc to avoid
intermediate flushes
or reduce requirements using file listing names of USR regions
to be filtered.)

Now one can see that the trace size is reduced to just 20MB in total or 6MB per MPI rank. The regions filtered out will be marked with "+" in the left-most column of the per-region report.

Advanced summary measurement collection

After the filtering file is prepared to exclude the sources of the overhead, it is recommended to repeat summary collection, in order to obtain more precise measurements.

In order to specify the filter file to be used during measurements, the corresponding environment variable has to be set:

% export SCOREP_FILTERING_FILE=../config/scorep.filt

It is also recommended to adjust the experiment directory name for the new run:

% export SCOREP_EXPERIMENT_DIRECTORY=\

scorep_bt-mz_W_4x4_sum_with_filter

Score-P also has a possibility to record hardware counters (see Section 'PAPI Hardware Performance Counters' ) and operating system resource usage (see Section 'Resource Usage Counters' ) in addition to default time and number of visits metrics. Hardware counters could be configured by setting Score-P environment variable SCOREP_METRIC_PAPI to the comma-separated list of PAPI events (other separator could be specified by setting SCOREP_METRIC_PAPI_SEP):

% export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_FP_INS

Note: The specified combination of the hardware events has to be valid, otherwise Score-P will abort execution. Please run papi_avail and papi_native_avail in order to get the list of the available PAPI generic and native events.

Operating system resource usage metrics are configured by setting the following variable:

% export SCOREP_METRIC_RUSAGE=ru_maxrss,ru_stime

Additionally Score-P can be configured to record only a subset of the mpi functions. This is achieved by setting SCOREP_MPI_ENABLE_GROUPS variable with a comma-separated list of sub-groups of MPI functions to be recorded (see Appendix 'MPI wrapper affiliation' ):

% export SCOREP_MPI_ENABLE_GROUPS=cg,coll,p2p,xnonblock

In case performance of the CUDA code is of interest, Score-P can be configured to measure CUPTI metrics as follows (see Section 'CUDA Performance Measurement' ):

% export SCOREP_CUDA_ENABLE=gpu,kernel,idle

In case performance of the OpenCL code is of interest, Score-P can be configured to measure OpenCL events as follows (see Section 'OpenCL Performance Measurement' ):

% export SCOREP_OPENCL_ENABLE=api,kernel,memcpy

When the granularity offered by the automatic compiler instrumentation is not sufficient, one can use Score-P manual user instrumentation API (see Section 'Manual Region Instrumentation' ) for more fine-grained annotation of particular code segments. For example initialization code, solver or any other code segment of interest can be annotated.

In order to enable user instrumentation, an --user argument has to be passed to Score-P command prepending compiler and linker specification:

% MPIF77 = scorep --user mpif77

Below, the loop found on line ... in file ... is annotated as a user region:

#include "scorep/SCOREP_User.inc"
subroutine foo(...)
  ! Declarations
  SCOREP_USER_REGION_DEFINE( solve )
  ! Some code...
  SCOREP_USER_REGION_BEGIN( solve, "<solver>", \
                            SCOREP_USER_REGION_TYPE_LOOP )
  do i=1,100
    [...]
  end do
  SCOREP_USER_REGION_END( solve )
  ! Some more code...
end subroutine

This will appear as an additional region in the report.

BT-MZ has to be recompiled and relinked in order to complete instrumentation.

% make clean

% make bt-mz CLASS=W NPROCS=4

After applying advanced configurations described above, summary collection with Score-P can be started as usual:

% mpiexec -np 4 ./bt-mz_W.4

Advanced summary report examination

After execution is finished, one can use scorep-score tool to verify the effect of filtering:

% scorep-score scorep_bt-mz_W_4x4_sum_with_filter/profile.cubex
Estimated aggregate size of event trace (total_tbc):
                     20210360 bytes
Estimated requirements for largest trace buffer (max_tbc):
                     6290888 bytes
(hint: When tracing set SCOREP_TOTAL_MEMORY > max_tbc to avoid
intermediate flushes
or reduce requirements using file listing names of USR regions
to be filtered.)
flt type         max_tbc         time      % region
     ALL         6290888       241.77  100.0 ALL
     OMP         5853120       168.94   69.9 OMP
     COM          343344        35.57   14.7 COM
     MPI           93776        37.25   15.4 MPI
     USR             672         0.01    0.0 USR

The report above shows significant reduction in runtime (due to elimination of the overhead) not only in USR regions but also in MPI/OMP regions as well.

Now, the extended summary report can be interactively explored using CUBE:

% cube scorep_bt-mz_W_4x4_sum_with_filter/profile.cubex

or ParaProf:

% paraprof scorep_bt-mz_W_4x4_sum_with_filter/profile.cubex

Event trace collection and examination

After exploring extended summary report, additional insight into performance of BT-MZ can be gained by performing trace collection. In order to do so, Score-P has to be configured to perform tracing by setting corresponding variable to true and disabling profile generation:

% export SCOREP_ENABLE_TRACING=true

% export SCOREP_ENABLE_PROFILING=false

Additionally it is recommended to set a meaningful experiment directory name:

% export SCOREP_EXPERIMENT_DIRECTORY=scorep_bt-mz_W_4x4_trace

After BT-MZ execution is finished, a separate trace file per thread is written into the new experiment directory. In order to explore it, Vampir tool can be used:

% vampir scorep_bt-mz_W_4x4_trace/traces.otf2

Please consider that traces can become extremely large and unwieldy, because the size of the trace is proportional to number of processes/threads (width), duration (length) and detail (depth) of measurement. When the trace is too large to hold in the memory allocated by Score-P, flushes can happen. Unfortunately the resulting traces are of little value, since uncoordinated flushes result in cascades of distortion.

Traces should be written to a parallel file system, e.g., to /work or /scratch which are typically provided for this purpose.