gprof, bprof and Time Profilers
A friend of mine was explaining to me why he thought his program wasn't running fast enough. “Have you profiled it?” I asked. “No, but I'm pretty sure I know where the bottleneck is,” he replied—famous last words. “Well, let's try the profiler,” I said. Profiling quickly revealed that 98% of the CPU was being spent in one subroutine of my friend's program and that 86% of the CPU was being spent in one line, and he was wrong about the location of the bottleneck.
Profilers are invaluable tools that let you know where a program is spending most of its time. This information is extremely valuable because it tells you where your time can best be spent in making your program more efficient and where you are wasting your time. Typical programs are not quite as lopsided as in the above story. The “80-20” rule says that a program will spend 80% of its time in about 20% of the code.
If the total running time of a program is n, then we can break up the running time into pieces:
total-time = a1*n + a2*n + a3*n + ...
where the “a”s represent fractions of the total time that your program spends in a particular segment. The sum of the “a”s must add to one. The 80-20 rule says that one of the “a”s will be quite large. For example, suppose a1 is 8/10 and a2 is 2/10. These numbers correspond to a program which spends 80% of its time doing a1 and 20% for everything else.
total-time = .8*n + .2*nNow suppose we optimize a2 so that it runs twice as fast as before—a significant speedup. The time 0.2*n is now 0.1*n. The total running time is now 0.8n+0.1n = 0.9n, meaning the whole program executes in 90% of the time that it originally did. Suppose we instead concentrate on the other piece. If we halve the running time of the first piece, it becomes 0.4n. 0.4n+0.2n=0.6n, or 60% of the original running time. As you can see, it is worth our while to concentrate on the particular portion of the program that dominates the runtime.
In my friend's case, we were able to optimize that single line a bit. The real optimization came two days later when he told me that he had removed the whole subroutine in question, simply by changing how he thought about his data.
The easiest profiler to use under Linux is the gprof profiler. gprof is a standard part of the GNU development tools. If you have gcc installed, you probably have gprof too. To use gprof, simply recompile your program with gcc using the -pg switch. This option causes gcc to insert a bit of extra code into the beginning of each subroutine in your program. The -pg switch must also be used when you link your program, since another snippet of code must be present to tie the pieces together.
After recompiling, run your program. It will execute slightly slower because of the work needed to profile the code, but it shouldn't be too slow. After the program finishes, there will be a file named gmon.out in the current directory. This file contains the profiling information collected during the program's run. gprof is used to print this in a human readable form.
gprof outputs information in two ways: a flat profile and a call graph. The flat profile tells you how much time the program spent in all of the subroutines, and the call graph tells you which subroutines called which subroutines.
The first part of a flat profile is shown in Listing 1. The “self seconds” column shows how much time was taken up by each subroutine. The number of times each subroutine was invoked is shown in the “calls” column. The “self ms/call” columns gives the average time in milliseconds spent in a given subroutine, while the “total ms/call” includes time spent in subroutines called by that subroutine as well. For reasons explained in the gprof documentation, this last column is actually only a guess and should not be relied upon.
In this example, an important thing to note is that the mcount subroutine is the actual profiling subroutine call inserted by gcc when it compiles code with the -pg switch. The fact that the program spends nearly 20% of its time here indicates that a lot of calling and returning is happening, and that one way to speed up the code would be to eliminate some of the subroutine calls. Which ones? The subroutines rnd and uni are likely candidates, since they are called 142 million times in 900 seconds.
The other profiler in common use on Linux is bprof. The major difference between bprof and gprof is that bprof gives timings on a source line basis while gprof has only subroutine-level resolution, and also includes information like invocation counts. To use bprof, link an object file, bprof.o, into your program. After you've run your program, a file named bmon.out contains the timing information. Run bprof on this data file, and it makes copies of your source files with timing numbers prepended to each line.
The profilers under Linux work by examining the program counter at regular intervals to see where in the code the program is actually working. Although easy to implement, there is a randomness to this process that results in a certain amount of noise in the measurements. Over “long” intervals, the amount of good data overwhelms the noise. This is usually good enough in practice, when we are only interested in finding the bottlenecks.
Both gprof and bprof use timers that only run while your program is actually running. One of the ways that the kernel can use a lot of time on your program's behalf is by reading or writing huge amounts of data. These times do not accumulate into the profile. A more subtle way in which the kernel can eat a good deal of time is if your program uses so much memory that the kernel must swap part of your program in and out of memory. This situation is called page thrashing because you can usually hear your disk thrashing around.
A simple way of checking the system time is to run your program with this command:
time <
After your program finishes, three times are printed: user time (the time the CPU spent running your program), system time (the time the CPU spent in the kernel serving your program) and elapsed time (real time, sometimes referred to as “wall clock” time). By comparing these times you can get a rough view of how much work the kernel is having to do for you.
My favorite profiler is a program called pixie, which is unfortunately not available for Linux. Pixie works by actually reading the executable, inserting counting code into “basic blocks” of code that can only be entered at the start and exited at the end. Support exists in gcc today for counting execution of these basic blocks (the -a option), but getting actual times for each block is not yet supported.
So now you know the location of the bottleneck in your program. There are a couple of simple techniques for making things go faster. The easiest is of course to use the -O flag of gcc to optimize the code. Be warned that optimizers are notorious for generating bad code.
It is often possible to decrease running time in exchange for an increase in space. Consider the following (FORTRAN) code fragment in Listing 2.
Profiling revealed that the program was spending over a quarter of its time in this loop, not just because it is slow, but because it was also being called frequently. Since it is called frequently, the variables cx, cy and cz are recalculated for each loop iteration. If we precalculate these values into the tcentr() array, four array references, three floating-point additions and a multiplication are replaced by a single array reference. A lot of code in a critical loop is thereby eliminated.
The cherry on this is moving the multiplication by 0.25 (multiplication by 0.25 is faster than dividing by 4.0) out of the loop altogether; thus, instead of multiplying each element of the sum by 0.25, we multiply the whole sum by 0.25. Since the loop happened to execute about 100,000 times or so, we've eliminated 99,999 floating-point multiplies. The new code is shown in Listing 3.
At this point, the program was profiled again, and this subroutine had dropped to taking 15% of the total time with 10% being taken up by the square root calculation. Since a different subroutine was now dominating the run time, the focus of the optimization effort moved away from this subroutine.
There is one more semi-easy thing that can be done for this code. It happens that the square roots are known to be needed for a very limited range of values. So, we can replace the square root function with a function that outputs the same value by looking up precalculated values in a table and returning an interpolated value for values that occur between table entries. Again, we have traded space for time.
Another common way of speeding up a program is to replace array references with pointers. In the example at the start of this article, the line that accounted for 98% of my friend's program looked like:
int i, array[1000000]; ... i = 0; while(array[i] == 0) i++;
This line searches for the next non-zero element of an array. The operation of referencing the array consists of multiplying i times a scaling factor (which is implemented as binary left shifts), adding this value to the start of array[] and fetching from that location. Replacing this code with:
int *p, i, array[1000000]; ... p = array; while(*p == 0) p++; i = p - array;eliminates the scaling and addition and sped things up by about 10%.
Another change we tried in this case was unrolling the loop. The code is replaced by:
int *p, i, array[1000000]; ... p = array; for(;;) { if (*p++ != 0) break; if (*p++ != 0) break; if (*p++ != 0) break; if (*p++ != 0) break; }
The idea here is that on certain types of machines, taking a branch is expensive while rejecting a branch is cheap. In a very tight loop, the overhead of the loop can end up being a significant part of the total time. The code in the second example has been rewritten so that most branches are not taken and that more work is done in the body of the loop for each iteration of the loop.
As it turned out, this “optimization” didn't speed anything up for the machine we were using. A good compiler compiling with -O will unroll short loops for you. It is important to profile before and after to see if what you've done has helped or hurt.
The other main option for optimizing code consists of simply looking for a better algorithm. For example, suppose we want to search an array for a particular entry. If the array is very small, we can simply check each element in turn. When the number of elements becomes large, hash tables are a quick and easy way to prune the number of elements that must be searched. For data that cannot be hashed, tree seaches provide another alternative. Hashing and trees are beyond the scope of this article, but should be a part of any programmer's bag of tricks. Any good book on data structures can show you how they work.
Machine specific optimizations are generally best left to the compiler. Compilers are becoming quite good with the simplest sort of optimizations, and gcc is one of the best. Once the profiler has located the slow portion of the program, the best way to optimize it is to simply imagine having to calculate everything by hand. Hopefully, you will notice improvements that a compiler will miss.
After all, the perfect compiler will never be written. There is an old joke that once a computer is built that can write code as well as a person, that computer will expect to be paid for its work. Hey, it's job security.
Andy Vaught is currently a Ph.D. candidate in computational physics at Arizona State University and has been running Linux since 1.1. He enjoys flying with the Civil Air Patrol as well as skiing. He can be reached at andy@maxwell.la.asu.edu.