Measuring and Improving Application Performance with PerfSuite
At some point, all developers of software applications, whether targeted to Linux or not, are likely to spend at least a small amount of time focusing on the performance of their applications. The reason is simple: many potential benefits can be gained from tuning software for improved performance. For example, in the scientific and engineering arenas, performance gains can make the difference between running smaller scale simulations rather than larger and potentially more accurate models that would improve the scientific quality of the results. Applications that are more user-oriented also stand to benefit from improvements that result in faster responsiveness to the user and an improved overall user experience.
Although microprocessor improvements over the past decade or so have made clock speeds well in excess of the gigahertz range commonplace, most developers are aware that a tenfold increase in processor frequency does not guarantee a tenfold reduction in the runtime of your application. Additionally, for those developing software for distribution to others, attention to performance and responsiveness can pay big dividends when you consider that your end user may be running your application on a mid-1990s era 100MHz Pentium processor.
This article is an introduction to a set of open-source software tools called PerfSuite that can help you to understand and possibly improve the performance of your application under Linux. PerfSuite consists of several related tools and libraries targeted at several different activities useful in performance-oriented analysis.
The development of PerfSuite was motivated by my own experiences in working with not only applications that I had developed, but a number of large supercomputer-class applications in both academic and corporate settings. After having worked with several research groups, I realized that developers often take advantage of only a limited set of tools that may be available to them. They typically rely on traditional time-based statistical profiling techniques such as gprof.
Of course, gprof-style profiles are invaluable and should be the mainstay of any developer's performance toolbox. However, the microprocessors of today, such as those on which you probably are using Linux, offer advanced features that can provide alternative insights into characteristics that directly affect the performance of your software. In particular, nearly all microprocessors in common use today incorporate hardware-based performance measurement support in their designs. This support can provide an alternative viewpoint of your software's performance. While time-based profiles tell you where your software spends its time, hardware performance measurements can help you understand what the processor is doing and how effectively the processor is being utilized. Hardware measurements also pinpoint particular reasons why the CPU is stalling rather than accomplishing useful work.
The first time I encountered the term hardware performance counters, it was in the context of having access to multimillion-dollar supercomputers where every CPU cycle is critical and research teams spend substantial amounts of time tweaking their codes in order to extract maximum performance from the system. Often, software is tailored explicitly for each type of computer on which it is to be run. Research teams sometimes pore over the numbers generated by these performance counters to measure the exact performance of their applications and to ferret out places where they might gain additional speedup. Needless to say, this all sounded exotic to me. But the purpose and function of the counters turned out to be simple: they are extra logic added to the CPU that track low-level operations or events within the processor accurately and with minimal overhead.
For example, even if you're not an expert in computer architecture, you probably already know that nearly all processors in common use are cache-based machines. Caches, which offer much higher-speed access to data and instructions than what is possible with main memory, are based on the principles of temporal and spatial locality. Put another way, cache designs hope to take advantage of many applications' tendency to reuse blocks of data not long after first use (temporal locality) and to also access data items near those already used (spatial locality). If your application follows these patterns, you have a much greater chance of achieving high performance on a cache-based processor. If not, your performance may be disappointing. If you're interested in improving a poorly performing application, your next task is to try to determine why the processor is stalling instead of completing useful work. This is where performance counters may help.
It takes a little research to learn which performance counters are available to you on a particular processor. Each CPU has a different set of available performance counters, usually with different names. In fact, different models in the same processor family can differ substantially in the specific performance counters available. In general, the counters measure similar types of things. For example, they can record the absolute number of cache misses, the number of instructions issued, the number of floating point instructions executed and the number of vector, such as SSE or MMX, instructions. The best reference for available counters on your processor are the vendor's technical reference on the processor, often available on the Web.
Another complication is kernel-level support is needed to access the performance counters. Although the Itanium (IA-64) kernel provides this support through the perfmon driver in the official kernel (authored by Stephane Eranian of HP Research), the standard x86 Linux tree currently does not.
Fortunately, efforts are underway to address these issues. The first is the development of a performance monitoring driver for the x86 kernel called perfctr. This is a very stable kernel patch developed by Mikael Pettersson of Uppsala University in Sweden. The perfctr kernel patch is becoming more widely adopted by the community and continually is improved and maintained. The second is an effort from the Innovative Computing Laboratory at the University of Tennessee-Knoxville called PAPI (Performance Application Programming Interface). PAPI defines a standard set of cross-platform performance monitoring events and a standard API that allows measurement using hardware counters in a portable way. The PAPI Project provides implementations for the library on several current processors and operating systems, including Intel/AMD x86 processors, Itanium systems and, most recently, AMD's x86-64 CPUs. On Linux, PAPI uses the perfmon and perfctr drivers as appropriate. Refer to the on-line Resources for references where you can learn much more about perfctr, perfmon and PAPI.
PerfSuite, discussed in the remainder of this article, builds upon PAPI, perfmon and perfctr to provide developers with an even higher-level user interface as well as additional functionality. A main focus of PerfSuite is ease of use. Based on my experiences in working with developers interested in performance analysis, it became clear that an ideal solution would require little or no extra work from users who simply want to know how well an application is performing on a computer. They want to know this without having to learn many details about how to configure or access the performance data at a low level.
Let's say that you do have an application that isn't cache-friendly—what might happen? In the worst case scenario, rather than loading a line of data into the cache and operating on the data contained in that line repeatedly, it may use only one piece of data and then be done with it. The next piece of data you need may require another cache line to be loaded and so forth. Each of these cache loads are relatively expensive and can result in reduced performance, because the processor is waiting primarily for the data it needs to become available. Each time the next piece of data is required, the processor attempts to load it from data already resident in the cache. If it's not found, a cache miss occurs and a corresponding hardware event is signaled. The higher the ratio of cache misses to hits, the more likely it is that the overall performance of the software degrades.
Listing 1 shows a basic but concrete example of how this might occur. The listing shows a loop that initializes each element of a matrix using the sum of the corresponding element of another matrix and a vector. Because the C language stores data in row-major order, the loop as written does not access neighboring data elements in the two matrices. Fortunately, this problem has a simple solution: interchange the nested loops so the matrices are processed on a row-by-row basis. This pattern of array access also is referred to as stride-one access. Many optimizing compilers perform this type of loop-interchange optimization automatically, depending on the optimization level you select.
Listing 1. Loop from a Program with Cache-Unfriendly Behavior
for (j = 0; j < COLS; j++) for (i = 0; i < ROWS; i++) a[i][j] = b[i][j] + c[i]
Test cases containing these two versions of the loop were compiled with a recent release of Intel's ICC compiler, run on a Pentium III computer and timed. The result of this simple change sped up the loop by a factor of ten. Not unexpectedly, the overall level 2 cache miss count decreased considerably for the optimized version of the loop (212,665,026 versus 25,287,572—see the next section for more information).
Often, it's useful to combine the raw hardware performance counts into a derived metric that can provide a normalized view of performance. For example, one of the most widely used metrics for performance measurement describes the average number of cycles required to complete an instruction (CPI). By counting the total number of cycles and instructions retired (both of which are available as hardware events), we easily can obtain this metric. Similarly, we might be interested in knowing, on average, how often a piece of data was reused once it was resident in the cache. By counting the appropriate cache-related events and combining them into a single metric, we can obtain an approximation of this information as well.
PerfSuite's hardware performance counter tools and libraries provide easy access to both the raw measurement data as well as a large number of derived metrics that you can use to learn about and hopefully improve the performance of your application. In its most basic use, PerfSuite requires nothing more than a slight modification to the command you execute to run your program. If your executable is in the file myprog, then instead of running myprog directly, you instead would enter psrun myprog. If all goes well, the output of psrun is an XML document that contains a standard set of hardware events along with additional information about the CPU. You can translate this XML document into a comprehensive performance report with the command psprocess, supplying it with the name of the XML file.
The current release of PerfSuite includes the following four tools for accessing and working with performance data:
psrun: a utility for hardware performance event counting and profiling of single-threaded, POSIX threads-based and MPI applications.
psprocess: a utility that assists with a number of common tasks related to pre- and post-processing of performance measurements.
psinv: a utility that provides access to information about the characteristics of a machine (for example, processor type, cache information and available performance counters).
psconfig: a graphical tool for easy creation and management of PerfSuite configuration files.
This section demonstrates the two commands psrun and psprocess. Visit the PerfSuite Web site for more information about and examples of the use of psinv and psconfig.
The easiest way to learn to use the basic PerfSuite tools is try them out on your own programs. Here is a sequence of commands you might enter to run the simple cache example discussed earlier with performance measurement enabled. Also shown are the current contents of the directory after each run with psrun to show that XML documents are created:
1% ls badcache goodcache 2% psrun badcache 3% ls badcache goodcache psrun.22865.xml 4% psrun goodcache 5% ls badcache goodcache psrun.22865.xml psrun.22932.xml 6% psprocess psrun.22865.xml 7% psprocess psrun.22932.xml
Listings 2 and 3 show the output of the psprocess command for the unoptimized and optimized versions of the test program; these listings have been edited slightly to fit in the available space. As you can see, a substantial amount of information is gathered during the course of the measurement and the report includes not only the raw event counts measured using PAPI, but also a series of metrics that can be derived from the counts.
Listing 2. psprocess Output from the Cache-Unfriendly Version of the Loop
PerfSuite Hardware Performance Summary Report Version : 1.0 Created : Thu Feb 19 22:43:01 2004 Generator : psprocess 0.2 XML Source : psrun.22865.xml Processor and System Information ==================================================== Node CPUs : 2 Vendor : Intel Family : Pentium Pro (P6) CPU Revision : 6 Clock (MHz) : 997.173 Memory (MB) : 1510.82 Pagesize (KB) : 4 Cache Information ==================================================== Cache levels : 2 -------------------------------- Level 1 Type : instruction Size (KB) : 16 Linesize (B) : 32 Assoc : 4 Type : data Size (KB) : 16 Linesize (B) : 32 Assoc : 4 -------------------------------- Level 2 Type : unified Size (KB) : 256 Linesize (B) : 32 Assoc : 8 Index Description Counter Value =================================================== 1 Conditional branch instructions........ 52663367 2 Branch instructions.................... 52650952 3 Conditional branch ins mispredicted...... 112009 4 Conditional branch instructions taken.. 52610596 5 Branch target address cache misses........ 31020 6 Requests for excl acc to clean cache line.. 1165 7 Requests for cache line invalidation.......... 0 8 Requests for cache line intervention...... 32801 9 Requests for excl acc to shared cache ln.. 26537 10 Floating point multiply instructions.......... 0 11 Floating point divide instructions............ 0 12 Floating point instructions........... 208155552 13 Hardware interrupts....................... 22134 14 Total cycles........................ 21407855039 15 Instructions issued.................. 2010041200 16 Instructions completed................ 624104056 17 Vector/SIMD instructions...................... 0 18 Level 1 data cache accesses........... 678945043 19 Level 1 data cache misses............. 244760094 20 Level 1 instruction cache accesses.. 21332388384 21 Level 1 instruction cache misses.......... 22546 22 Level 1 instruction cache reads..... 21309322857 23 Level 1 load misses................... 244318153 24 Level 1 store misses....................... 9852 25 Level 1 cache misses.................. 243826788 26 Level 2 data cache reads.............. 243745402 27 Level 2 data cache writes................. 10317 28 Level 2 instruction cache accesses........ 24335 29 Level 2 instruction cache reads........... 21362 30 Level 2 cache misses.................. 212665026 31 Cycles stalled on any resource...... 21057880641 32 Instruction TLB misses....................... 64 Statistics =================================================== Counting domain............................... user Multiplexed.................................... yes Graduated floating point ins. per cycle...... 0.010 Vector ins. per cycle........................ 0.000 Floating point ins per graduated ins ........ 0.334 Vector ins per graduated ins ................ 0.000 Floating point ins per L1 data cache access.. 0.307 Graduated ins per cycle...................... 0.029 Issued ins per cycle......................... 0.094 Graduated ins per issued ins................. 0.310 Issued ins per L1 ins cache miss......... 89152.896 Graduated ins per L1 ins cache miss...... 27681.365 Level 1 ins cache miss ratio................. 0.000 Level 1 data cache access per graduated ins.. 1.088 % floating point ins of all graduated ins... 33.353 % cycles stalled on any resource............ 98.365 Level 1 ins cache misses per issued ins...... 0.000 Level 1 cache read miss ratio (instruction).. 0.000 Level 1 cache miss ratio (data).............. 0.361 Level 1 cache miss ratio (instruction)....... 0.000 Bandwidth used to level 1 cache (MB/s)..... 363.437 Bandwidth used to level 2 cache (MB/s)..... 316.988 MFLIPS (cycles).............................. 9.696 MFLIPS (wall clock).......................... 9.530 MVOPS (cycles)............................... 0.000 MVOPS (wall clock)........................... 0.000 MIPS (cycles)............................... 29.071 MIPS (wall clock)........................... 28.572 CPU time (seconds).......................... 21.469 Wall clock time (seconds)................... 21.843 % CPU utilization........................... 98.285
Listing 3. Part of the psprocess output from the optimized version of the loop. The Processor and System Information and Cache Information sections are the same.
Index Description Counter Value =================================================== 1 Conditional branch instructions........ 49627213 2 Branch instructions.................... 49971420 3 Conditional branch ins mispredicted....... 97630 4 Conditional branch ins taken........... 49089592 5 Branch target address cache misses......... 3816 6 Requests for excl access to clean cache ln. 820 7 Requests for cache line invalidation.......... 0 8 Requests for cache line intervention....... 2796 9 Requests for excl access to shared cache ln. 494 10 Floating point multiply instructions.......... 0 11 Floating point divide instructions............ 0 12 Floating point instructions........... 189564951 13 Hardware interrupts........................ 2577 14 Total cycles......................... 2471179766 15 Instructions issued................... 513936102 16 Instructions completed................ 509580537 17 Vector/SIMD instructions...................... 0 18 Level 1 data cache accesses........... 372965600 19 Level 1 data cache misses.............. 23010188 20 Level 1 instruction cache accesses... 2769671237 21 Level 1 instruction cache misses........... 2369 22 Level 1 instruction cache reads...... 2746595553 23 Level 1 load misses.................... 25980065 24 Level 1 store misses........................ 995 25 Level 1 cache misses................... 25772544 26 Level 2 data cache reads.............. .25617201 27 Level 2 data cache writes................... 935 28 Level 2 instruction cache accesses......... 2405 29 Level 2 instruction cache reads............ 2652 30 Level 2 cache misses................... 25287572 31 Cycles stalled on any resource....... 2199590592 32 Instruction TLB misses........................ 0 Statistics ================================================== Counting domain.............................. user Multiplexed................................... yes Graduated floating point ins per cycle...... 0.077 Vector ins per cycle.........................0.000 Floating point ins per graduated ins........ 0.372 Vector ins per graduated ins................ 0.000 Floating point ins per L1 data cache access. 0.508 Graduated ins per cycle......................0.206 Issued ins per cycle.........................0.208 Graduated ins per issued ins................ 0.992 Issued ins per L1 ins cache miss....... 216942.213 Graduated ins per L1 ins cache miss.... 215103.646 Level 1 ins cache miss ratio................ 0.000 Level 1 data cache access per graduated ins. 0.732 % floating point ins of all graduated ins.. 37.200 % cycles stalled on any resource........... 89.010 Level 1 ins cache misses per issued ins..... 0.000 Level 1 cache read miss ratio (instruction). 0.000 Level 1 cache miss ratio (data)............. 0.062 Level 1 cache miss ratio (instruction)...... 0.000 Bandwidth used to level 1 cache (MB/s).... 332.792 Bandwidth used to level 2 cache (MB/s).... 326.530 MFLIPS (cycles)............................ 76.493 MFLIPS (wall clock)........................ 66.787 MVOPS (cycles).............................. 0.000 MVOPS (wall clock).......................... 0.000 MIPS (cycles)............................. 205.626 MIPS (wall clock)......................... 179.533 CPU time (seconds).......................... 2.478 Wall clock time (seconds)................... 2.838 % CPU utilization.......................... 87.310
psrun determines the performance events to be measured by consulting a configuration file you can supply, which is an XML document that describes the measurements to be taken. If you don't supply a configuration file, a default is used (the output shown in Listings 2 and 3 used the default). As an XML document, the configuration file is straightforward to modify and read. For example, if you wanted to obtain the raw events required to calculate the CPI metric discussed earlier, you'd need to ask psrun to measure the total number of graduated instructions and the total number of cycles. These events are predefined in PAPI and are called PAPI_TOT_INS and PAPI_TOT_CYC, respectively. Listing 4 shows a PerfSuite XML configuration file that could be used to measure these events. To use this configuration file with psrun, all you need to do is supply the option -c, along with the name of your custom configuration and run as usual.
Listing 4. An Example PerfSuite XML Configuration Document
<?xml version="1.0" encoding="UTF-8" ?> <ps_hwpc_eventlist class="PAPI"> <!-- ================================================== Configuration file to measure graduated instructions and total cycles. ================================================== --> <ps_hwpc_event type="preset" name="PAPI_TOT_INS" /> <ps_hwpc_event type="preset" name="PAPI_TOT_CYC" /> </ps_hwpc_eventlist>
The measurements described so far have been in aggregate counting mode, where the total count of one or more performance events are measured and reported over the total runtime of your application. PerfSuite provides an additional way of looking at your application's performance. Let's say you are interested in finding out where in your application all the level 2 cache misses occur so that you can focus your optimization work there. In other words, you'd like a profile similar to gprof's time-based profile, but instead have it be based on level 2 cache misses. This can be done rather easily with psrun by specifying a configuration file tailored for profiling rather than aggregate counting. The PerfSuite distribution includes a number of similar alternative configuration files that you can tailor as needed. Here's an example of how you would ask for a profiling experiment rather than the default total count of events:
8% psrun -c \ /usr/local/share/perfsuite/xml/pshwpc/profile.xml \ solver 9% psprocess -e solver psrun.24135.xml
In profiling mode, the psprocess tool also needs the name of your executable (solver) to do its work. This is required in order to extract the symbol information in the executable so the program address can be mapped to source code lines.
Listing 5 shows an example of a profiling run of psrun obtained in this way. Not only is the application (solver) analyzed, but it also lists shared libraries used with the application that consumed CPU time. The combination of overall performance counting and profiling can be a powerful tool for learning about bottlenecks that may exist in your software and can help you to isolate quickly those areas of your application most in need of attention.
Listing 5. A source code profile generated by psprocess based on level 2 cache misses (output has been truncated to fit in available space).
PerfSuite Hardware Performance Summary Report Profile Information =================================================== Class : PAPI Event : PAPI_L2_TCM (Level 2 cache misses) Period : 10000 Samples : 16132 Domain : user Run Time : 319.72 (seconds) Min Self % : (all) Module Summary --------------------------------------------------- Samples Self % Total % Module 16131 99.99% 99.99% /home/nobody/solver/sol 1 0.01% 100.00% /lib/libc-2.2.4.so File Summary --------------------------------------------------- Samples Self % Total % File 5093 31.57% 31.57% matxvec2d_blk3.f 5015 31.09% 62.66% cg3_blk.f 4162 25.80% 88.46% pc_jac2d_blk3.f 1407 8.72% 97.18% dot_prod2d_blk3.f 429 2.66% 99.84% add_exchange2d_blk3.f 20 0.12% 99.96% glibc-2.2.4/csu/init.c 4 0.02% 99.99% main3.f 1 0.01% 99.99% linuxthreads/weaks.c 1 0.01% 100.00% cs_jac2d_blk3.f Function Summary --------------------------------------------------- Samples Self % Total % Function 5093 31.57% 31.57% matxvec2d_blk3 5015 31.09% 62.66% cg3_blk 4162 25.80% 88.46% pc_jac2d_blk3 1407 8.72% 97.18% dot_prod2d_blk3 429 2.66% 99.84% add_exchange2d_blk3 20 0.12% 99.96% ? 4 0.02% 99.99% main3 1 0.01% 99.99% __pthread_return_0 1 0.01% 100.00% cs_jac2d_blk3 File:Line Summary --------------------------------------------------- Samples Self % Total % File:Line 5089 31.55% 31.55% matxvec2d_blk3.f:19 4125 25.57% 57.12% pc_jac2d_blk3.f:20 2763 17.13% 74.24% cg3_blk.f:206 1346 8.34% 82.59% cg3_blk.f:346 576 3.57% 86.16% dot_prod2d_blk3.f:24 524 3.25% 89.41% cg3_blk.f:278 489 3.03% 92.44% dot_prod2d_blk3.f:23 332 2.06% 94.50% dot_prod2d_blk3.f:25 197 1.22% 95.72% cg3_blk.f:279 176 1.09% 96.81% add_exchange2d_blk3.f:29 99 0.61% 97.42% add_exchange2d_blk3.f:50 71 0.44% 97.86% add_exchange2d_blk3.f:30 71 0.44% 98.30% add_exchange2d_blk3.f:51 55 0.34% 98.64% cg3_blk.f:55 38 0.24% 98.88% cg3_blk.f:207 34 0.21% 99.09% cg3_blk.f:218 31 0.19% 99.28% pc_jac2d_blk3.f:27 24 0.15% 99.43% cg3_blk.f:139 20 0.12% 99.55% init.c:0 8 0.05% 99.60% dot_prod2d_blk3.f:22 5 0.03% 99.63% add_exchange2d_blk3.f:44 4 0.02% 99.66% matxvec2d_blk3.f:17 4 0.02% 99.68% cg3_blk.f:140 3 0.02% 99.70% cg3_blk.f:347 3 0.02% 99.72% cg3_blk.f:268 3 0.02% 99.74% cg3_blk.f:280 3 0.02% 99.76% pc_jac2d_blk3.f:18 3 0.02% 99.78% cg3_blk:/home/nobody/solver/cg3_blk.f:174
This article has touched only the surface of the techniques available to you when using hardware performance counters to measure and improve the performance of your applications. Hopefully, you now have an idea of what hardware performance counters are and how they can help you gain insight into performance bottlenecks. If you would like to get started using PerfSuite or other tools and supporting software mentioned in this article, visit the on-line Resources.
Many different ways exist in which applications can be tuned for higher performance. In fact, the most effective way is not loop-level improvements or tweaking but fundamental changes to the algorithms used in your application that are more computationally efficient. Ideally, your software will use efficient algorithms further tuned to make effective use of your CPU. PerfSuite and other similar tools can go a long way toward making this process easier for you.
I would like to thank Professor Danesh Tafti of the Mechanical Engineering Department at Virginia Tech for providing the program used for the psrun profiling example in Listing 5. This is a computational kernel extracted from a computational fluid dynamics application named GenIDLEST that Tafti and his research team use, maintain and develop. I also would like to express my thanks to all the PAPI team members of the Innovative Computing Laboratory at the University of Tennessee-Knoxville for their support and encouragement during the development of PerfSuite.
Resources for this article: /article/8264.
Rick Kufrin currently is a senior member of the technical staff at the University of Illinois' National Center for Supercomputing Applications. He is the originator and technical lead for the PerfSuite software project described in this article and is available for consultation on the use of PerfSuite and other technologies for software improvement.