How does a system wide profiler (e.g. perf) correlate counters with instructions?

So I think there's some kernel module that launches software interrupts at a certain sampling rate.

Perf is not module, it is part of the Linux kernel, implemented in kernel/events/core.c and for every supported architecture and cpu model, for example arch/x86/kernel/cpu/perf_event*.c. But Oprofile was a module, with similar approach.

Perf generally works by asking PMU (Performance monitoring unit) of CPU to generate interrupt after N events of some hardware performance counter (Yokohama, slide 5 "• Interrupt when threshold reached: allows sampling"). Actually it may be implemented as:

  • select some PMU counter
  • initialize it to -N, where N is the sampling period (we want interrupt after N events, for example, after 2 millions of cycles perf record -c 2000000 -e cycles, or some N computed and tuned by perf when no extra option is set or -F is given)
  • set this counter to wanted event, and ask PMU to generate interrupt on overflow (ARCH_PERFMON_EVENTSEL_INT). It will happen after N increments of our counter.

All modern Intel chips supports this, for example, Nehalem: https://software.intel.com/sites/default/files/76/87/30320 - Nehalem Performance Monitoring Unit Programming Guide

EBS - Event Based Sampling. A technique in which counters are pre-loaded with a large negative count, and they are configured to interrupt the processor on overflow. When the counter overflows the interrupt service routine capture profiling data.

So, when you use hardware PMU, there is no additional work at timer interrupt with special reading of hardware PMU counters. There is some work to save/restore PMU state at task switch, but this (*_sched_in/*_sched_out of kernel/events/core.c) will not change PMU counter value for current thread nor will export it to user-space.

There is a handler: arch/x86/kernel/cpu/perf_event.c: x86_pmu_handle_irq which finds the overflowed counter and calls perf_sample_data_init(&data, 0, event->hw.last_period); to record the current time, IP of last executed command (it can be inexact because of out-of-order nature of most Intel microarchitetures, there is limited workaround for some events - PEBS, perf record -e cycles:pp), stacktrace data (if -g was used in record), etc. Then handler resets the counter value to the -N (x86_perf_event_set_period, wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); - note the minus before left)

The lower the sampling rate, the lower the profiler overhead.

Perf allows you to set target sampling rate with -F option, -F 1000 means around 1000 irq/s. High rates are not recommended due to high overhead. Ten years ago Intel VTune recommended not more than 1000 irq/s (http://www.cs.utah.edu/~mhall/cs4961f09/VTune-1.pdf "Try to get about a 1000 samples per second per logical CPU."), perf usually don't allow too high rate for non-root (autotuned to lower rate when "perf interrupt took too long" - check in your dmesg; also check sysctl -a|grep perf, for example kernel.perf_cpu_time_max_percent=25 - which means that perf will try to use not more then 25 % of CPU)

Can you interrogate for example the task scheduler to find out what was running when you interrupted him?

No. But you can enable tracepoint at sched_switch or other sched event (list all available in sched: perf list 'sched:*'), and use it as profiling event for the perf. You can even ask perf to record stacktrace at this tracepoint:

 perf record -a -g -e "sched:sched_switch" sleep 10

Won't that affect the execution of the scheduler

Enabled tracepoint will make add some perf event sampling work to the function with tracepoint

Is the list of task_struct objects available?

Only via ftrace...

Information about context switches

This is software perf event, just call to perf_sw_event with PERF_COUNT_SW_CONTEXT_SWITCHES event from sched/core.c (indirectly). Example of direct call - migration software event: kernel/sched/core.c set_task_cpu(): p->se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);

PS: there are good slides on perf, ftrace and other profiling and tracing subsystems in Linux by Gregg: http://www.brendangregg.com/linuxperf.html


This is pretty much answers all three of your questions.

Profiling consits of two types: Counting and sampling. Counting measures the overall number of events during the entire execution without offering any insight regarding the instructions or functions that generated them . On the other hand, sampling gives a correlation of the events to the code through captured samples of the Instruction Pointer . When sampling, the kernel instructs the processor to issue an interrupt when a chosen event counter exceeds a threshold. T his interrupt is caught by the kernel and the sampled data including the Instruction Pointer value are stored into a ring buffer. The buffer is polled periodically by the userspace perf tool and its contents written to disk. In post processing, the Instruction Pointer is matched to addresses in binary files, which can be translated into function names and such

Refer http://openlab.web.cern.ch/sites/openlab.web.cern.ch/files/technical_documents/TheOverheadOfProfilingUsingPMUhardwareCounters.pdf