Skip to end of metadata
Go to start of metadata

Work in progress

Introduction

This guide will help users to use Extrae and Paraver performance tools. These tools have been developed at Barcelona Supercomputing Centre (BSC).

  • These tools will allow users to study the efficient use of the computational resources.
  • Provide HPC studies such as performance analysis
  • Identifying bottlenecks and solving and optimising parallel applications

Extrae is the package that generates Paraver-trace files for a post-morten analysis. Installed on cca/ccb

Paraver is the trace visualisation and analysis browser. Installed in the desktops.

Supported programming models:

MPIOpenMP (Intel, GNU or IBM runtimes)CUDAOpenCLpthreadsOmpSsJavaPython

The default behaviour of extrae is using the LD_PRELOAD mechanism. Using interposition and/or sampling mechanism this is the performance data gathered:

  • Timestamp: When analysing the behaviour of an application, it is important to have a fine-grained timestamping mechanism (up to nanoseconds). Extrae provides a set of clock functions that are specifically implemented for different target machines in order to provide the most accurate possible timing. On systems that have daemons that inhibit the usage of these timers or that do not have a specific timer implementation. In such cases, Extrae still uses advanced POSIX clocks to provide nanosecond resolution timestamps with low cost.
  • Performance and other counter metrics: Extrae uses the PAPI and the PMAPI interfaces to collect information regarding the microprocessor performance. With the advent of the components in the PAPI software, Extrae is not only able to collect information regarding how is behaving the microprocessor only, but also allows studying multiple components of the system (disk, network, operating system, among others) and also extend the study over the microprocessor (power consumption and thermal information). Extrae mainly collects these counter metrics at the parallel programming calls and at samples. It also allows capturing such information at the entry and exit points of the user routines instrumented.
  • References to the source code: Analysing the performance of an application requires relating the code that is responsible for such performance. This way the analyst can locate the performance bottlenecks and suggest improvements on the application code. Extrae provides information regarding the source code that was being executed (in terms of name of function, file name and line number) at specific location points like programming model calls or sampling points.

Load the extrae module
Load extrae
module load extrae

Choose the correct xml file

By default, extrae will choose the pure MPI configuration

/usr/local/apps/extrae/xml/MPI/extrae.xml

The module will prepare the environment to extract the traces. Extrae will use a xml file for the configuration. We have prepared 2 default xml files in this location:

/usr/local/apps/extrae/xml/

There is one for pure MPI applications and another for hybrid MPI+OPENMP.

 Click here to see how to activate OpenMP trace...
# remember that only Intel, GNU or IBM runtimes are supported
export OMP_TRACE=1

There are some xml examples in:

$EXTRAE_DIR/share/example/

In this directory you will find different folders with the programming models and inside these folders, several extrae_explained.xml files with the explanation of each section. If you want to user your own xml file you can use the environment variable:

export EXTRAE_CONFIG_FILE=<path_to_xml_file>

Set the wrapper script in the aprun line

To enable the trace we have prepared 2 scripts to enable the trace of your parallel program depending on the source code:

CFortran
trace-c.shtrace-fortran.sh

To enable it, you have to use the wrapper script between the "aprun <args>" and the "executable". For example. to trace a Fortran parallel program:

module load extrae
#aprun -n $submit_total_tasks -N $submit_tasks_per_node -d $omp_num_threads $BIN $args
# will become
aprun -n $submit_total_tasks -N $submit_tasks_per_node -d $omp_num_threads trace-fortran.sh $BIN $args

Extrae output files

The extrae library will generate several .mpit files in the current directory from where the aprun was run.

Note that the size of this files can be large! So it is recommended to place them in a Lustre filesystem (i.e. $SCRATCH)

If you have the <merge enabled="yes" in the xml file (enabled if you are using the default). The library will merge these files at the end of the execution. Once the files have been merged, 3 different files will appear in your directory:

<program_name>.prv
<program_name>.row
<program_name>.pcf

These are the merged files that will be read later with paraver

Output samples

If everything went well, you should be able to see the extrae output in your job stdout. This output is telling the user the configuration used to extrae the performance information.

Welcome to Extrae 3.4.3
Extrae: Parsing the configuration file (/usr/local/apps/extrae/xml/MPI/extrae.xml) begins
Extrae: Generating intermediate files for Paraver traces.
Extrae: MPI routines will collect HW counters information.
Extrae: Tracing 4 level(s) of MPI callers: [ 1 2 3 4 ]
Extrae: Warning! change-at-time time units not specified. Using seconds
Extrae: PAPI domain set to ALL for HWC set 1
Extrae: HWC set 1 contains following counters < PAPI_TOT_INS (0x80000032) PAPI_TOT_CYC (0x8000003b) PAPI_L1_DCM (0x80000000) > - never changes
Extrae: Resource usage is disabled at flush buffer.
Extrae: Memory usage is disabled at flush buffer.
Extrae: Tracing buffer can hold 500000 events
Extrae: Circular buffer disabled.
Extrae: Dynamic memory instrumentation is disabled.
Extrae: Basic I/O memory instrumentation is disabled.
Extrae: System calls instrumentation is disabled.
Extrae: Parsing the configuration file (/usr/local/apps/extrae/xml/MPI/extrae.xml) has ended
Extrae: Intermediate traces will be stored in <tmpdir>
Extrae: Temporal directory (tmpdir) is shared among processes.
Extrae: Final directory (/scratch/...) is shared among processes.
Extrae: Tracing mode is set to: Detail.
Extrae: Successfully initiated with 300 tasks and 1 threads
...

Then at the end of the execution:

# 1 per MPI task!
Extrae: Intermediate raw trace file created : ....mpit

Extrae: Intermediate raw sym file created : ...sym

Extrae: Deallocating memory.
Extrae: Application has ended. Tracing has been terminated.
Extrae: Proceeding with the merge of the intermediate tracefiles.
Extrae: Waiting for all tasks to reach the checkpoint.
# if the merger was on:
merger: Extrae 3.4.3
mpi2prv: Tree order is set to 16
mpi2prv: Assigned nodes < list of nodes >
mpi2prv: Assigned size per processor < 94 Mbytes, 46 Mbytes, 45 Mbytes, 41 Mbytes, 41 Mbytes, 40 Mbytes, 41 Mbytes, 41 Mbytes, 59 Mbytes, 56 Mbytes, 56 Mbytes, 56 Mbytes, 55 Mbytes, 55 Mbytes, 51 Mbytes, 55 Mbytes, 55 Mbytes, 55 
Mbytes, 54 Mbytes, 54 Mbytes, 76 Mbytes, 77 Mbytes, 76 Mbytes, 77 Mbytes, 66 Mbytes, 63 Mbytes, 62 Mbytes, 63 Mbytes, 65 Mbytes, 63 Mbytes, 63 Mbytes, 63 Mbytes, 65 Mbytes, 62 Mbytes, 63 Mbytes, 64 Mbytes, 90 Mbytes, 86 Mbytes, 8
8 Mbytes, 95 Mbytes, 78 Mbytes, 75 Mbytes, 74 Mbytes, 76 Mbytes, 78 Mbytes, 76 Mbytes, 76 Mbytes, 74 Mbytes, 77 Mbytes, 76 Mbytes, 76 Mbytes, 76 Mbytes, 77 Mbytes, 74 Mbytes, 75 Mbytes, 75 Mbytes, 78 Mbytes, 76 Mbytes, 74 Mbytes,
 74 Mbytes, 106 Mbytes, 104 Mbytes, 109 Mbytes, 107 Mbytes, 84 Mbytes, 82 Mbytes, 81 Mbytes, 82 Mbytes, 86 Mbytes, 84 Mbytes, 83 Mbytes, 83 Mbytes, 85 Mbytes, 82 Mbytes, 82 Mbytes, 82 Mbytes, 84 Mbytes, 83 Mbytes, 83 Mbytes, 83 M
bytes, 85 Mbytes, 84 Mbytes, 82 Mbytes, 81 Mbytes, 84 Mbytes, 82 Mbytes, 81 Mbytes, 83 Mbytes, 119 Mbytes, 116 Mbytes, 114 Mbytes, 114 Mbytes, 88 Mbytes, 86 Mbytes, 86 Mbytes, 86 Mbytes, 88 Mbytes, 86 Mbytes, 85 Mbytes, 86 Mbytes
, 87 Mbytes, 85 Mbytes, 85 Mbytes, 85 Mbytes, 87 Mbytes, 85 Mbytes, 85 Mbytes, 86 Mbytes, 88 Mbytes, 86 Mbytes, 85 Mbytes, 85 Mbytes, 87 Mbytes, 85 Mbytes, 86 Mbytes, 86 Mbytes, 122 Mbytes, 118 Mbytes, 120 Mbytes, 124 Mbytes, 94 
Mbytes, 92 Mbytes, 92 Mbytes, 92 Mbytes, 94 Mbytes, 91 Mbytes, 91 Mbytes, 91 Mbytes, 93 Mbytes, 91 Mbytes, 90 Mbytes, 90 Mbytes, 93 Mbytes, 90 Mbytes, 90 Mbytes, 90 Mbytes, 93 Mbytes, 90 Mbytes, 90 Mbytes, 90 Mbytes, 93 Mbytes, 9
1 Mbytes, 90 Mbytes, 90 Mbytes, 93 Mbytes, 91 Mbytes, 92 Mbytes, 92 Mbytes, 128 Mbytes, 125 Mbytes, 125 Mbytes, 125 Mbytes, 94 Mbytes, 92 Mbytes, 92 Mbytes, 91 Mbytes, 93 Mbytes, 91 Mbytes, 90 Mbytes, 90 Mbytes, 92 Mbytes, 90 Mby
tes, 90 Mbytes, 90 Mbytes, 93 Mbytes, 90 Mbytes, 90 Mbytes, 90 Mbytes, 93 Mbytes, 90 Mbytes, 91 Mbytes, 91 Mbytes, 93 Mbytes, 91 Mbytes, 91 Mbytes, 92 Mbytes, 94 Mbytes, 92 Mbytes, 92 Mbytes, 92 Mbytes, 127 Mbytes, 120 Mbytes, 11
8 Mbytes, 118 Mbytes, 88 Mbytes, 86 Mbytes, 86 Mbytes, 85 Mbytes, 87 Mbytes, 85 Mbytes, 86 Mbytes, 85 Mbytes, 88 Mbytes, 85 Mbytes, 85 Mbytes, 85 Mbytes, 87 Mbytes, 85 Mbytes, 85 Mbytes, 85 Mbytes, 88 Mbytes, 85 Mbytes, 86 Mbytes
, 85 Mbytes, 88 Mbytes, 86 Mbytes, 86 Mbytes, 86 Mbytes, 117 Mbytes, 114 Mbytes, 116 Mbytes, 116 Mbytes, 85 Mbytes, 81 Mbytes, 81 Mbytes, 81 Mbytes, 83 Mbytes, 82 Mbytes, 83 Mbytes, 82 Mbytes, 85 Mbytes, 83 Mbytes, 82 Mbytes, 81 
Mbytes, 84 Mbytes, 81 Mbytes, 81 Mbytes, 82 Mbytes, 85 Mbytes, 83 Mbytes, 84 Mbytes, 84 Mbytes, 84 Mbytes, 82 Mbytes, 81 Mbytes, 81 Mbytes, 110 Mbytes, 108 Mbytes, 104 Mbytes, 103 Mbytes, 76 Mbytes, 74 Mbytes, 75 Mbytes, 75 Mbyte
s, 77 Mbytes, 75 Mbytes, 74 Mbytes, 74 Mbytes, 77 Mbytes, 75 Mbytes, 75 Mbytes, 74 Mbytes, 76 Mbytes, 75 Mbytes, 75 Mbytes, 76 Mbytes, 77 Mbytes, 74 Mbytes, 74 Mbytes, 75 Mbytes, 97 Mbytes, 87 Mbytes, 85 Mbytes, 86 Mbytes, 65 Mby
tes, 62 Mbytes, 61 Mbytes, 62 Mbytes, 64 Mbytes, 62 Mbytes, 62 Mbytes, 62 Mbytes, 64 Mbytes, 61 Mbytes, 62 Mbytes, 63 Mbytes, 78 Mbytes, 75 Mbytes, 76 Mbytes, 73 Mbytes, 55 Mbytes, 53 Mbytes, 53 Mbytes, 53 Mbytes, 56 Mbytes, 50 M
bytes, 54 Mbytes, 53 Mbytes, 57 Mbytes, 56 Mbytes, 55 Mbytes, 56 Mbytes, 41 Mbytes, 40 Mbytes, 39 Mbytes, 39 Mbytes, 41 Mbytes, 44 Mbytes, 44 Mbytes, 38 Mbytes >
...
mpi2prv: Elapsed time sharing communications: 0 hours 0 minutes 0 seconds
mpi2prv: Sharing thread accounting information for ptask 0 done
mpi2prv: Merge tree depth for 300 tasks is 3 levels using a fan-out of 16 leaves
mpi2prv: Executing merge tree step 1 of 3.
mpi2prv: Elapsed time on tree step 1: 0 hours 0 minutes 19 seconds
mpi2prv: Executing merge tree step 2 of 3.
mpi2prv: Elapsed time on tree step 2: 0 hours 3 minutes 57 seconds
mpi2prv: Executing merge tree step 3 of 3.
mpi2prv: Generating tracefile (intermediate buffers of 26214 events)
         This process can take a while. Please, be patient.
mpi2prv: Progress ... 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% done
mpi2prv: Elapsed time on tree step 3: 0 hours 5 minutes 57 seconds
mpi2prv: Resulting tracefile occupies 13608467610 bytes
mpi2prv: Removing temporal files... done
mpi2prv: Elapsed time removing temporal files: 0 hours 0 minutes 4 seconds
mpi2prv: Congratulations! ifsMASTER.dp.x.prv has been generated.
 


How to use paraver

We suggest you to do the Paraver Tutorials to get familiar with the tool.

Paraver is installed in the workstations. You can use the paraver module and start it with 'wxparaver':

module load paraver
wxparaver

The first thing you have to do is load the .prv trace using the menu File → Load trace...

Once the trace is loaded, you can open the .cfg configurations using the menu File → Load Configuration... The default configuration files are that are in /usr/local/apps/paraver/4.6.3/cfgs/

This is an example of a MPI call view

/usr/local/apps/paraver/4.6.3/cfgs/mpi/views/MPI_call.cfg