Beyond Stack Sampling: C++ Profilers

What happened when you tried random pausing? I use it all the time on a monster app. You said it did not give enough information, and you've suggested you need high resolution. Sometimes people need a little help in understanding how to use it.

What I do, under VS, is configure the stack display so it doesn't show me the function arguments, because that makes the stack display totally unreadable, IMO.

Then I take about 10 samples by hitting "pause" during the time it's making me wait. I use ^A, ^C, and ^V to copy them into notepad, for reference. Then I study each one, to try to figure out what it was in the process of trying to accomplish at that time.

If it was trying to accomplish something on 2 or more samples, and that thing is not strictly necessary, then I've found a live problem, and I know roughly how much fixing it will save.

There are things you don't really need to know, like precise percents are not important, and what goes on inside 3rd-party code is not important, because you can't do anything about those. What you can do something about is the rich set of call-points in code you can modify displayed on each stack sample. That's your happy hunting ground.

Examples of the kinds of things I find:

  • During startup, it can be about 30 layers deep, in the process of trying to extract internationalized character strings from DLL resources. If the actual strings are examined, it can easily turn out that the strings don't really need to be internationalized, like they are strings the user never actually sees.

  • During normal usage, some code innocently sets a Modified property in some object. That object comes from a super-class that captures the change and triggers notifications that ripple throughout the entire data structure, manipulating the UI, creating and desroying obects in ways hard to foresee. This can happen a lot - the unexpected consequences of notifications.

  • Filling in a worksheet row-by-row, cell-by-cell. It turns out if you build the row all at once, from an array of values, it's a lot faster.

P.S. If you're multi-threaded, when you pause it, all threads pause. Take a look at the call stack of each thread. Chances are, only one of them is the real culprit, and the others are idling.


First:

Time sampling profilers are more robust than CPU sampling profilers. I'm not extremely familiar with Windows development tools so I can't say which ones are which. Most profilers are CPU sampling.

A CPU sampling profiler grabs a stack trace every N instructions.
This technique will reveal portions of your code that are CPU bound. Which is awesome if that is the bottle neck in your application. Not so great if your application threads spend most of their time fighting over a mutex.

A time sampling profiler grabs a stack trace every N microseconds.
This technique will zero in on "slow" code. Whether the cause is CPU bound, blocking IO bound, mutex bound, or cache thrashing sections of code. In short what ever piece of code is slowing your application will standout.

So use a time sampling profiler if at all possible especially when profiling threaded code.

Second:

Sampling profilers generate gobs of data. The data is extremely useful, but there is often too much to be easily useful. A profile data visualizer helps tremendously here. The best tool I've found for profile data visualization is gprof2dot. Don't let the name fool you, it handles all kinds of sampling profiler output (AQtime, Sleepy, XPerf, etc). Once the visualization has pointed out the offending function(s), jump back to the raw profile data to get better hints on what the real cause is.

The gprof2dot tool generates a dot graph description that you then feed into a graphviz tool. The output is basically a callgraph with functions color coded by their impact on the application. alt text

A few hints to get gprof2dot to generate nice output.

  • I use a --skew of 0.001 on my graphs so I can easily see the hot code paths. Otherwise the int main() dominates the graph.
  • If you're doing anything crazy with C++ templates you'll probably want to add --strip. This is especially true with Boost.
  • I use OProfile to generate my sampling data. To get good output I need configure it to load the debug symbols from my 3rd party and system libraries. Be sure to do the same, otherwise you'll see that CRT is taking 20% of your application's time when what's really going on is malloc is trashing the heap and eating up 15%.