Measuring and Improving Application Performance with PerfSuite

by Rick Kufrin

At some point, all developers of software applications, whether targeted to Linux or not, are likely to spend at least a small amount of time focusing on the performance of their applications. The reason is simple: many potential benefits can be gained from tuning software for improved performance. For example, in the scientific and engineering arenas, performance gains can make the difference between running smaller scale simulations rather than larger and potentially more accurate models that would improve the scientific quality of the results. Applications that are more user-oriented also stand to benefit from improvements that result in faster responsiveness to the user and an improved overall user experience.

Although microprocessor improvements over the past decade or so have made clock speeds well in excess of the gigahertz range commonplace, most developers are aware that a tenfold increase in processor frequency does not guarantee a tenfold reduction in the runtime of your application. Additionally, for those developing software for distribution to others, attention to performance and responsiveness can pay big dividends when you consider that your end user may be running your application on a mid-1990s era 100MHz Pentium processor.

This article is an introduction to a set of open-source software tools called PerfSuite that can help you to understand and possibly improve the performance of your application under Linux. PerfSuite consists of several related tools and libraries targeted at several different activities useful in performance-oriented analysis.

The development of PerfSuite was motivated by my own experiences in working with not only applications that I had developed, but a number of large supercomputer-class applications in both academic and corporate settings. After having worked with several research groups, I realized that developers often take advantage of only a limited set of tools that may be available to them. They typically rely on traditional time-based statistical profiling techniques such as gprof.

Of course, gprof-style profiles are invaluable and should be the mainstay of any developer's performance toolbox. However, the microprocessors of today, such as those on which you probably are using Linux, offer advanced features that can provide alternative insights into characteristics that directly affect the performance of your software. In particular, nearly all microprocessors in common use today incorporate hardware-based performance measurement support in their designs. This support can provide an alternative viewpoint of your software's performance. While time-based profiles tell you where your software spends its time, hardware performance measurements can help you understand what the processor is doing and how effectively the processor is being utilized. Hardware measurements also pinpoint particular reasons why the CPU is stalling rather than accomplishing useful work.

Hardware Performance Counter Basics

The first time I encountered the term hardware performance counters, it was in the context of having access to multimillion-dollar supercomputers where every CPU cycle is critical and research teams spend substantial amounts of time tweaking their codes in order to extract maximum performance from the system. Often, software is tailored explicitly for each type of computer on which it is to be run. Research teams sometimes pore over the numbers generated by these performance counters to measure the exact performance of their applications and to ferret out places where they might gain additional speedup. Needless to say, this all sounded exotic to me. But the purpose and function of the counters turned out to be simple: they are extra logic added to the CPU that track low-level operations or events within the processor accurately and with minimal overhead.

For example, even if you're not an expert in computer architecture, you probably already know that nearly all processors in common use are cache-based machines. Caches, which offer much higher-speed access to data and instructions than what is possible with main memory, are based on the principles of temporal and spatial locality. Put another way, cache designs hope to take advantage of many applications' tendency to reuse blocks of data not long after first use (temporal locality) and to also access data items near those already used (spatial locality). If your application follows these patterns, you have a much greater chance of achieving high performance on a cache-based processor. If not, your performance may be disappointing. If you're interested in improving a poorly performing application, your next task is to try to determine why the processor is stalling instead of completing useful work. This is where performance counters may help.

It takes a little research to learn which performance counters are available to you on a particular processor. Each CPU has a different set of available performance counters, usually with different names. In fact, different models in the same processor family can differ substantially in the specific performance counters available. In general, the counters measure similar types of things. For example, they can record the absolute number of cache misses, the number of instructions issued, the number of floating point instructions executed and the number of vector, such as SSE or MMX, instructions. The best reference for available counters on your processor are the vendor's technical reference on the processor, often available on the Web.

Another complication is kernel-level support is needed to access the performance counters. Although the Itanium (IA-64) kernel provides this support through the perfmon driver in the official kernel (authored by Stephane Eranian of HP Research), the standard x86 Linux tree currently does not.

Fortunately, efforts are underway to address these issues. The first is the development of a performance monitoring driver for the x86 kernel called perfctr. This is a very stable kernel patch developed by Mikael Pettersson of Uppsala University in Sweden. The perfctr kernel patch is becoming more widely adopted by the community and continually is improved and maintained. The second is an effort from the Innovative Computing Laboratory at the University of Tennessee-Knoxville called PAPI (Performance Application Programming Interface). PAPI defines a standard set of cross-platform performance monitoring events and a standard API that allows measurement using hardware counters in a portable way. The PAPI Project provides implementations for the library on several current processors and operating systems, including Intel/AMD x86 processors, Itanium systems and, most recently, AMD's x86-64 CPUs. On Linux, PAPI uses the perfmon and perfctr drivers as appropriate. Refer to the on-line Resources for references where you can learn much more about perfctr, perfmon and PAPI.

PerfSuite, discussed in the remainder of this article, builds upon PAPI, perfmon and perfctr to provide developers with an even higher-level user interface as well as additional functionality. A main focus of PerfSuite is ease of use. Based on my experiences in working with developers interested in performance analysis, it became clear that an ideal solution would require little or no extra work from users who simply want to know how well an application is performing on a computer. They want to know this without having to learn many details about how to configure or access the performance data at a low level.

Using Performance Counters to Measure Application Characteristics

Let's say that you do have an application that isn't cache-friendly—what might happen? In the worst case scenario, rather than loading a line of data into the cache and operating on the data contained in that line repeatedly, it may use only one piece of data and then be done with it. The next piece of data you need may require another cache line to be loaded and so forth. Each of these cache loads are relatively expensive and can result in reduced performance, because the processor is waiting primarily for the data it needs to become available. Each time the next piece of data is required, the processor attempts to load it from data already resident in the cache. If it's not found, a cache miss occurs and a corresponding hardware event is signaled. The higher the ratio of cache misses to hits, the more likely it is that the overall performance of the software degrades.

Listing 1 shows a basic but concrete example of how this might occur. The listing shows a loop that initializes each element of a matrix using the sum of the corresponding element of another matrix and a vector. Because the C language stores data in row-major order, the loop as written does not access neighboring data elements in the two matrices. Fortunately, this problem has a simple solution: interchange the nested loops so the matrices are processed on a row-by-row basis. This pattern of array access also is referred to as stride-one access. Many optimizing compilers perform this type of loop-interchange optimization automatically, depending on the optimization level you select.

Listing 1. Loop from a Program with Cache-Unfriendly Behavior


for (j = 0; j < COLS; j++)
    for (i = 0; i < ROWS; i++)
        a[i][j] = b[i][j] + c[i]

Test cases containing these two versions of the loop were compiled with a recent release of Intel's ICC compiler, run on a Pentium III computer and timed. The result of this simple change sped up the loop by a factor of ten. Not unexpectedly, the overall level 2 cache miss count decreased considerably for the optimized version of the loop (212,665,026 versus 25,287,572—see the next section for more information).

Often, it's useful to combine the raw hardware performance counts into a derived metric that can provide a normalized view of performance. For example, one of the most widely used metrics for performance measurement describes the average number of cycles required to complete an instruction (CPI). By counting the total number of cycles and instructions retired (both of which are available as hardware events), we easily can obtain this metric. Similarly, we might be interested in knowing, on average, how often a piece of data was reused once it was resident in the cache. By counting the appropriate cache-related events and combining them into a single metric, we can obtain an approximation of this information as well.

PerfSuite's hardware performance counter tools and libraries provide easy access to both the raw measurement data as well as a large number of derived metrics that you can use to learn about and hopefully improve the performance of your application. In its most basic use, PerfSuite requires nothing more than a slight modification to the command you execute to run your program. If your executable is in the file myprog, then instead of running myprog directly, you instead would enter psrun myprog. If all goes well, the output of psrun is an XML document that contains a standard set of hardware events along with additional information about the CPU. You can translate this XML document into a comprehensive performance report with the command psprocess, supplying it with the name of the XML file.

PerfSuite Basics

The current release of PerfSuite includes the following four tools for accessing and working with performance data:

  • psrun: a utility for hardware performance event counting and profiling of single-threaded, POSIX threads-based and MPI applications.

  • psprocess: a utility that assists with a number of common tasks related to pre- and post-processing of performance measurements.

  • psinv: a utility that provides access to information about the characteristics of a machine (for example, processor type, cache information and available performance counters).

  • psconfig: a graphical tool for easy creation and management of PerfSuite configuration files.

This section demonstrates the two commands psrun and psprocess. Visit the PerfSuite Web site for more information about and examples of the use of psinv and psconfig.

The easiest way to learn to use the basic PerfSuite tools is try them out on your own programs. Here is a sequence of commands you might enter to run the simple cache example discussed earlier with performance measurement enabled. Also shown are the current contents of the directory after each run with psrun to show that XML documents are created:

1% ls
badcache
goodcache

2% psrun badcache

3% ls
badcache
goodcache
psrun.22865.xml

4% psrun goodcache

5% ls
badcache
goodcache
psrun.22865.xml
psrun.22932.xml

6% psprocess psrun.22865.xml
7% psprocess psrun.22932.xml

Listings 2 and 3 show the output of the psprocess command for the unoptimized and optimized versions of the test program; these listings have been edited slightly to fit in the available space. As you can see, a substantial amount of information is gathered during the course of the measurement and the report includes not only the raw event counts measured using PAPI, but also a series of metrics that can be derived from the counts.

Listing 2. psprocess Output from the Cache-Unfriendly Version of the Loop

PerfSuite Hardware Performance Summary Report

Version                  : 1.0
Created                  : Thu Feb 19 22:43:01 2004
Generator                : psprocess 0.2
XML Source               : psrun.22865.xml

Processor and System Information
====================================================
Node CPUs                : 2
Vendor                   : Intel
Family                   : Pentium Pro (P6)
CPU Revision             : 6
Clock (MHz)              : 997.173
Memory (MB)              : 1510.82
Pagesize (KB)            : 4

Cache Information
====================================================
Cache levels             : 2
--------------------------------
Level 1
Type                     : instruction
Size (KB)                : 16
Linesize (B)             : 32
Assoc                    : 4
Type                     : data
Size (KB)                : 16
Linesize (B)             : 32
Assoc                    : 4
--------------------------------
Level 2
Type                     : unified
Size (KB)                : 256
Linesize (B)             : 32
Assoc                    : 8

Index Description                     Counter Value
===================================================
 1 Conditional branch instructions........ 52663367
 2 Branch instructions.................... 52650952
 3 Conditional branch ins mispredicted...... 112009
 4 Conditional branch instructions taken.. 52610596
 5 Branch target address cache misses........ 31020
 6 Requests for excl acc to clean cache line.. 1165
 7 Requests for cache line invalidation.......... 0
 8 Requests for cache line intervention...... 32801
 9 Requests for excl acc to shared cache ln.. 26537
10 Floating point multiply instructions.......... 0
11 Floating point divide instructions............ 0
12 Floating point instructions........... 208155552
13 Hardware interrupts....................... 22134
14 Total cycles........................ 21407855039
15 Instructions issued.................. 2010041200
16 Instructions completed................ 624104056
17 Vector/SIMD instructions...................... 0
18 Level 1 data cache accesses........... 678945043
19 Level 1 data cache misses............. 244760094
20 Level 1 instruction cache accesses.. 21332388384
21 Level 1 instruction cache misses.......... 22546
22 Level 1 instruction cache reads..... 21309322857
23 Level 1 load misses................... 244318153
24 Level 1 store misses....................... 9852
25 Level 1 cache misses.................. 243826788
26 Level 2 data cache reads.............. 243745402
27 Level 2 data cache writes................. 10317
28 Level 2 instruction cache accesses........ 24335
29 Level 2 instruction cache reads........... 21362
30 Level 2 cache misses.................. 212665026
31 Cycles stalled on any resource...... 21057880641
32 Instruction TLB misses....................... 64

Statistics
===================================================
Counting domain............................... user
Multiplexed.................................... yes
Graduated floating point ins. per cycle...... 0.010
Vector ins. per cycle........................ 0.000
Floating point ins per graduated ins ........ 0.334
Vector ins per graduated ins ................ 0.000
Floating point ins per L1 data cache access.. 0.307
Graduated ins per cycle...................... 0.029
Issued ins per cycle......................... 0.094
Graduated ins per issued ins................. 0.310
Issued ins per L1 ins cache miss......... 89152.896
Graduated ins per L1 ins cache miss...... 27681.365
Level 1 ins cache miss ratio................. 0.000
Level 1 data cache access per graduated ins.. 1.088
% floating point ins of all graduated ins... 33.353
% cycles stalled on any resource............ 98.365
Level 1 ins cache misses per issued ins...... 0.000
Level 1 cache read miss ratio (instruction).. 0.000
Level 1 cache miss ratio (data).............. 0.361
Level 1 cache miss ratio (instruction)....... 0.000
Bandwidth used to level 1 cache (MB/s)..... 363.437
Bandwidth used to level 2 cache (MB/s)..... 316.988
MFLIPS (cycles).............................. 9.696
MFLIPS (wall clock).......................... 9.530
MVOPS (cycles)............................... 0.000
MVOPS (wall clock)........................... 0.000
MIPS (cycles)............................... 29.071
MIPS (wall clock)........................... 28.572
CPU time (seconds).......................... 21.469
Wall clock time (seconds)................... 21.843
% CPU utilization........................... 98.285

Listing 3. Part of the psprocess output from the optimized version of the loop. The Processor and System Information and Cache Information sections are the same.


Index Description                     Counter Value
===================================================
 1 Conditional branch instructions........ 49627213
 2 Branch instructions.................... 49971420
 3 Conditional branch ins mispredicted....... 97630
 4 Conditional branch ins taken........... 49089592
 5 Branch target address cache misses......... 3816
 6 Requests for excl access to clean cache ln.  820
 7 Requests for cache line invalidation.......... 0
 8 Requests for cache line intervention....... 2796
 9 Requests for excl access to shared cache ln. 494
10 Floating point multiply instructions.......... 0
11 Floating point divide instructions............ 0
12 Floating point instructions........... 189564951
13 Hardware interrupts........................ 2577
14 Total cycles......................... 2471179766
15 Instructions issued................... 513936102
16 Instructions completed................ 509580537
17 Vector/SIMD instructions...................... 0
18 Level 1 data cache accesses........... 372965600
19 Level 1 data cache misses.............. 23010188
20 Level 1 instruction cache accesses... 2769671237
21 Level 1 instruction cache misses........... 2369
22 Level 1 instruction cache reads...... 2746595553
23 Level 1 load misses.................... 25980065
24 Level 1 store misses........................ 995
25 Level 1 cache misses................... 25772544
26 Level 2 data cache reads.............. .25617201
27 Level 2 data cache writes................... 935
28 Level 2 instruction cache accesses......... 2405
29 Level 2 instruction cache reads............ 2652
30 Level 2 cache misses................... 25287572
31 Cycles stalled on any resource....... 2199590592
32 Instruction TLB misses........................ 0


Statistics
==================================================
Counting domain.............................. user
Multiplexed................................... yes
Graduated floating point ins per cycle...... 0.077
Vector ins per cycle.........................0.000
Floating point ins per graduated ins........ 0.372
Vector ins per graduated ins................ 0.000
Floating point ins per L1 data cache access. 0.508
Graduated ins per cycle......................0.206
Issued ins per cycle.........................0.208
Graduated ins per issued ins................ 0.992
Issued ins per L1 ins cache miss....... 216942.213
Graduated ins per L1 ins cache miss.... 215103.646
Level 1 ins cache miss ratio................ 0.000
Level 1 data cache access per graduated ins. 0.732
% floating point ins of all graduated ins.. 37.200
% cycles stalled on any resource........... 89.010
Level 1 ins cache misses per issued ins..... 0.000
Level 1 cache read miss ratio (instruction). 0.000
Level 1 cache miss ratio (data)............. 0.062
Level 1 cache miss ratio (instruction)...... 0.000
Bandwidth used to level 1 cache (MB/s).... 332.792
Bandwidth used to level 2 cache (MB/s).... 326.530
MFLIPS (cycles)............................ 76.493
MFLIPS (wall clock)........................ 66.787
MVOPS (cycles).............................. 0.000
MVOPS (wall clock).......................... 0.000
MIPS (cycles)............................. 205.626
MIPS (wall clock)......................... 179.533
CPU time (seconds).......................... 2.478
Wall clock time (seconds)................... 2.838
% CPU utilization.......................... 87.310
Customizing Your Performance Analysis

psrun determines the performance events to be measured by consulting a configuration file you can supply, which is an XML document that describes the measurements to be taken. If you don't supply a configuration file, a default is used (the output shown in Listings 2 and 3 used the default). As an XML document, the configuration file is straightforward to modify and read. For example, if you wanted to obtain the raw events required to calculate the CPI metric discussed earlier, you'd need to ask psrun to measure the total number of graduated instructions and the total number of cycles. These events are predefined in PAPI and are called PAPI_TOT_INS and PAPI_TOT_CYC, respectively. Listing 4 shows a PerfSuite XML configuration file that could be used to measure these events. To use this configuration file with psrun, all you need to do is supply the option -c, along with the name of your custom configuration and run as usual.

Listing 4. An Example PerfSuite XML Configuration Document


<?xml version="1.0" encoding="UTF-8" ?>
<ps_hwpc_eventlist class="PAPI">
  <!-- ==================================================
       Configuration file to measure graduated instructions
       and total cycles.
       ================================================== -->
  <ps_hwpc_event type="preset" name="PAPI_TOT_INS" />
  <ps_hwpc_event type="preset" name="PAPI_TOT_CYC" />
</ps_hwpc_eventlist>

The measurements described so far have been in aggregate counting mode, where the total count of one or more performance events are measured and reported over the total runtime of your application. PerfSuite provides an additional way of looking at your application's performance. Let's say you are interested in finding out where in your application all the level 2 cache misses occur so that you can focus your optimization work there. In other words, you'd like a profile similar to gprof's time-based profile, but instead have it be based on level 2 cache misses. This can be done rather easily with psrun by specifying a configuration file tailored for profiling rather than aggregate counting. The PerfSuite distribution includes a number of similar alternative configuration files that you can tailor as needed. Here's an example of how you would ask for a profiling experiment rather than the default total count of events:

8% psrun  -c \
/usr/local/share/perfsuite/xml/pshwpc/profile.xml \
solver

9% psprocess -e solver psrun.24135.xml

In profiling mode, the psprocess tool also needs the name of your executable (solver) to do its work. This is required in order to extract the symbol information in the executable so the program address can be mapped to source code lines.

Listing 5 shows an example of a profiling run of psrun obtained in this way. Not only is the application (solver) analyzed, but it also lists shared libraries used with the application that consumed CPU time. The combination of overall performance counting and profiling can be a powerful tool for learning about bottlenecks that may exist in your software and can help you to isolate quickly those areas of your application most in need of attention.

Listing 5. A source code profile generated by psprocess based on level 2 cache misses (output has been truncated to fit in available space).

PerfSuite Hardware Performance Summary Report


Profile Information
===================================================
Class          : PAPI
Event          : PAPI_L2_TCM (Level 2 cache misses)
Period         : 10000
Samples        : 16132
Domain         : user
Run Time       : 319.72 (seconds)
Min Self %     : (all)

Module Summary
---------------------------------------------------
Samples   Self %  Total %  Module

   16131   99.99%   99.99%  /home/nobody/solver/sol
       1    0.01%  100.00%  /lib/libc-2.2.4.so

File Summary
---------------------------------------------------
Samples   Self %  Total %  File
    5093   31.57%   31.57%  matxvec2d_blk3.f
    5015   31.09%   62.66%  cg3_blk.f
    4162   25.80%   88.46%  pc_jac2d_blk3.f
    1407    8.72%   97.18%  dot_prod2d_blk3.f
     429    2.66%   99.84%  add_exchange2d_blk3.f
      20    0.12%   99.96%  glibc-2.2.4/csu/init.c
       4    0.02%   99.99%  main3.f
       1    0.01%   99.99%  linuxthreads/weaks.c
       1    0.01%  100.00%  cs_jac2d_blk3.f

Function Summary
---------------------------------------------------
Samples   Self %  Total %  Function

    5093   31.57%   31.57%  matxvec2d_blk3
    5015   31.09%   62.66%  cg3_blk
    4162   25.80%   88.46%  pc_jac2d_blk3
    1407    8.72%   97.18%  dot_prod2d_blk3
     429    2.66%   99.84%  add_exchange2d_blk3
      20    0.12%   99.96%  ?
       4    0.02%   99.99%  main3
       1    0.01%   99.99%  __pthread_return_0
       1    0.01%  100.00%  cs_jac2d_blk3

File:Line Summary
---------------------------------------------------
Samples   Self %  Total %  File:Line

    5089   31.55%   31.55%  matxvec2d_blk3.f:19
    4125   25.57%   57.12%  pc_jac2d_blk3.f:20
    2763   17.13%   74.24%  cg3_blk.f:206
    1346    8.34%   82.59%  cg3_blk.f:346
     576    3.57%   86.16%  dot_prod2d_blk3.f:24
     524    3.25%   89.41%  cg3_blk.f:278
     489    3.03%   92.44%  dot_prod2d_blk3.f:23
     332    2.06%   94.50%  dot_prod2d_blk3.f:25
     197    1.22%   95.72%  cg3_blk.f:279
     176    1.09%   96.81%  add_exchange2d_blk3.f:29
      99    0.61%   97.42%  add_exchange2d_blk3.f:50
      71    0.44%   97.86%  add_exchange2d_blk3.f:30
      71    0.44%   98.30%  add_exchange2d_blk3.f:51
      55    0.34%   98.64%  cg3_blk.f:55
      38    0.24%   98.88%  cg3_blk.f:207
      34    0.21%   99.09%  cg3_blk.f:218
      31    0.19%   99.28%  pc_jac2d_blk3.f:27
      24    0.15%   99.43%  cg3_blk.f:139
      20    0.12%   99.55%  init.c:0
       8    0.05%   99.60%  dot_prod2d_blk3.f:22
       5    0.03%   99.63%  add_exchange2d_blk3.f:44
       4    0.02%   99.66%  matxvec2d_blk3.f:17
       4    0.02%   99.68%  cg3_blk.f:140
       3    0.02%   99.70%  cg3_blk.f:347
       3    0.02%   99.72%  cg3_blk.f:268
       3    0.02%   99.74%  cg3_blk.f:280
       3    0.02%   99.76%  pc_jac2d_blk3.f:18
       3    0.02%   99.78%  cg3_blk:/home/nobody/solver/cg3_blk.f:174
Summary

This article has touched only the surface of the techniques available to you when using hardware performance counters to measure and improve the performance of your applications. Hopefully, you now have an idea of what hardware performance counters are and how they can help you gain insight into performance bottlenecks. If you would like to get started using PerfSuite or other tools and supporting software mentioned in this article, visit the on-line Resources.

Many different ways exist in which applications can be tuned for higher performance. In fact, the most effective way is not loop-level improvements or tweaking but fundamental changes to the algorithms used in your application that are more computationally efficient. Ideally, your software will use efficient algorithms further tuned to make effective use of your CPU. PerfSuite and other similar tools can go a long way toward making this process easier for you.

Acknowledgements

I would like to thank Professor Danesh Tafti of the Mechanical Engineering Department at Virginia Tech for providing the program used for the psrun profiling example in Listing 5. This is a computational kernel extracted from a computational fluid dynamics application named GenIDLEST that Tafti and his research team use, maintain and develop. I also would like to express my thanks to all the PAPI team members of the Innovative Computing Laboratory at the University of Tennessee-Knoxville for their support and encouragement during the development of PerfSuite.

Resources for this article: /article/8264.

Rick Kufrin currently is a senior member of the technical staff at the University of Illinois' National Center for Supercomputing Applications. He is the originator and technical lead for the PerfSuite software project described in this article and is available for consultation on the use of PerfSuite and other technologies for software improvement.

Load Disqus comments