How can I programmatically find the CPU frequency with C

For the sake of completeness, already there is a simple, fast, accurate, user mode solution with a huge drawback: it works on Intel Skylake, Kabylake and newer processors only. The exact requirement is the CPUID level 16h support. According to the Intel Software Developer's Manual 325462 release 59, page 770:

CPUID.16h.EAX = Processor Base Frequency (in MHz);
CPUID.16h.EBX = Maximum Frequency (in MHz);
CPUID.16h.ECX = Bus (Reference) Frequency (in MHz).

Visual Studio 2015 sample code:

#include <stdio.h>
#include <intrin.h>

int main(void) {
    int cpuInfo[4] = { 0, 0, 0, 0 };
    __cpuid(cpuInfo, 0);
    if (cpuInfo[0] >= 0x16) {
        __cpuid(cpuInfo, 0x16);

        //Example 1
        //Intel Core i7-6700K Skylake-H/S Family 6 model 94 (506E3)
        //cpuInfo[0] = 0x00000FA0; //= 4000 MHz
        //cpuInfo[1] = 0x00001068; //= 4200 MHz
        //cpuInfo[2] = 0x00000064; //=  100 MHz

        //Example 2
        //Intel Core m3-6Y30 Skylake-U/Y Family 6 model 78 (406E3)
        //cpuInfo[0] = 0x000005DC; //= 1500 MHz
        //cpuInfo[1] = 0x00000898; //= 2200 MHz
        //cpuInfo[2] = 0x00000064; //=  100 MHz

        //Example 3
        //Intel Core i5-7200 Kabylake-U/Y Family 6 model 142 (806E9)
        //cpuInfo[0] = 0x00000A8C; //= 2700 MHz
        //cpuInfo[1] = 0x00000C1C; //= 3100 MHz
        //cpuInfo[2] = 0x00000064; //=  100 MHz

        printf("EAX: 0x%08x EBX: 0x%08x ECX: %08x\r\n", cpuInfo[0], cpuInfo[1], cpuInfo[2]);
        printf("Processor Base Frequency:  %04d MHz\r\n", cpuInfo[0]);
        printf("Maximum Frequency:         %04d MHz\r\n", cpuInfo[1]);
        printf("Bus (Reference) Frequency: %04d MHz\r\n", cpuInfo[2]);
    } else {
        printf("CPUID level 16h unsupported\r\n");
    }
    return 0;
}

It is possible to find a general solution which gets the operating frequency correctly for one thread or many threads. This does not need admin/root privileges or access to model specific registers. I have tested this on Linux and Windows on Intel processors including, Nahalem, Ivy Bridge, and Haswell with one socket up to four sockets (40 threads). The results all deviate less than 0.5% from the correct answers. Before I show you how to do this let me show the results (from GCC 4.9 and MSVC2013):

Linux:    E5-1620 (Ivy Bridge) @ 3.60GHz    
1 thread: 3.789, 4 threads: 3.689 GHz:  (3.8-3.789)/3.8 = 0.3%, 3.7-3.689)/3.7 = 0.3%

Windows:  E5-1620 (Ivy Bridge) @ 3.60GHz
1 thread: 3.792, 4 threads: 3.692 GHz: (3.8-3.789)/3.8 = 0.2%, (3.7-3.689)/3.7 = 0.2%

Linux:  4xE7-4850 (Nahalem) @ 2.00GHz
1 thread: 2.390, 40 threads: 2.125 GHz:, (2.4-2.390)/2.4 = 0.4%, (2.133-2.125)/2.133 = 0.4%

Linux:    i5-4250U (Haswell) CPU @ 1.30GHz
1 thread: within 0.5% of 2.6 GHz, 2 threads wthin 0.5% of 2.3 GHz

Windows: 2xE5-2667 v2 (Ivy Bridge) @ 3.3 GHz
1 thread: 4.000 GHz, 16 threads: 3.601 GHz: (4.0-4.0)/4.0 = 0.0%, (3.6-3.601)/3.6 = 0.0%

I got the idea for this from this link http://randomascii.wordpress.com/2013/08/06/defective-heat-sinks-causing-garbage-gaming/

To do this you you first do what you do from 20 years ago. You write some code with a loop where you know the latency and time it. Here is what I used:

static int inline SpinALot(int spinCount)
{
    __m128 x = _mm_setzero_ps();
    for(int i=0; i<spinCount; i++) {
        x = _mm_add_ps(x,_mm_set1_ps(1.0f));
    }
    return _mm_cvt_ss2si(x);
}

This has a carried loop dependency so the CPU can't reorder this to reduce the latency. It always takes 3 clock cycles per iteration. The OS won't migrate the thread to another core because we will bind the threads.

Then you run this function on each physical core. I did this with OpenMP. The threads must be bound for this. In linux with GCC you can use export OMP_PROC_BIND=true to bind the threads and assuming you have ncores physical core do also export OMP_NUM_THREADS=ncores. If you want to programmatically bind and find the number of physical cores for Intel processors see this programatically-detect-number-of-physical-processors-cores-or-if-hyper-threading and thread-affinity-with-windows-msvc-and-openmp.

void sample_frequency(const int nsamples, const int n, float *max, int nthreads) {
    *max = 0;
    volatile int x = 0;
    double min_time = DBL_MAX;
    #pragma omp parallel reduction(+:x) num_threads(nthreads)
    {
        double dtime, min_time_private = DBL_MAX;
        for(int i=0; i<nsamples; i++) {
             #pragma omp barrier
             dtime = omp_get_wtime();
             x += SpinALot(n);
             dtime = omp_get_wtime() - dtime;
             if(dtime<min_time_private) min_time_private = dtime;
        }
        #pragma omp critical
        {
            if(min_time_private<min_time) min_time = min_time_private;
        }
    }
    *max = 3.0f*n/min_time*1E-9f;
}

Finally run the sampler in a loop and print the results

int main(void) {
    int ncores = getNumCores();
    printf("num_threads %d, num_cores %d\n", omp_get_max_threads(), ncores);       
    while(1) {
        float max1, median1, max2, median2;
        sample_frequency(1000, 1000000, &max2, &median2, ncores);
        sample_frequency(1000, 1000000, &max1, &median1,1);          
        printf("1 thread: %.3f, %d threads: %.3f GHz\n" ,max1, ncores, max2);
    }
}

I have not tested this on AMD processors. I think AMD processors with modules (e.g Bulldozer) will have to bind to each module not each AMD "core". This could be done with export GOMP_CPU_AFFINITY with GCC. You can find a full working example at https://bitbucket.org/zboson/frequency which works on Windows and Linux on Intel processors and will correctly find the number of physical cores for Intel processors (at least since Nahalem) and binds them to each physical core (without using OMP_PROC_BIND which MSVC does not have).

This method has to be modified a bit for modern processors due to different frequency scaling for SSE, AVX, and AVX512.

Here is a new table I get after modifying my method (see the code after table) with four Xeon 6142 processors (16 cores per processor).

        sums  1-thread  64-threads
SSE        1       3.7         3.3
SSE        8       3.7         3.3
AVX        1       3.7         3.3
AVX        2       3.7         3.3
AVX        4       3.6         2.9
AVX        8       3.6         2.9
AVX512     1       3.6         2.9
AVX512     2       3.6         2.9
AVX512     4       3.5         2.2
AVX512     8       3.5         2.2

These numbers agree with the frequencies in this table https://en.wikichip.org/wiki/intel/xeon_gold/6142#Frequencies

The interesting thing is that I need to now do at least 4 parallel sums to achieve the lower frequencies. The latency for addps on Skylake is 4 clock cycles. These can go to two ports (with AVX512 ports 0 and 1 fuse to count and one AVX512 port and the other AVX512 operations goes to port 5).

Here is how I did eight parallel sums.

static int inline SpinALot(int spinCount) {
  __m512 x1 = _mm512_set1_ps(1.0);
  __m512 x2 = _mm512_set1_ps(2.0);
  __m512 x3 = _mm512_set1_ps(3.0);
  __m512 x4 = _mm512_set1_ps(4.0);
  __m512 x5 = _mm512_set1_ps(5.0);
  __m512 x6 = _mm512_set1_ps(6.0);
  __m512 x7 = _mm512_set1_ps(7.0);
  __m512 x8 = _mm512_set1_ps(8.0);
  __m512 one = _mm512_set1_ps(1.0);
  for(int i=0; i<spinCount; i++) {
    x1 = _mm512_add_ps(x1,one);
    x2 = _mm512_add_ps(x2,one);
    x3 = _mm512_add_ps(x3,one);
    x4 = _mm512_add_ps(x4,one);
    x5 = _mm512_add_ps(x5,one);
    x6 = _mm512_add_ps(x6,one);
    x7 = _mm512_add_ps(x7,one);
    x8 = _mm512_add_ps(x8,one);
  }
  __m512 t1 = _mm512_add_ps(x1,x2);
  __m512 t2 = _mm512_add_ps(x3,x4);
  __m512 t3 = _mm512_add_ps(x5,x6);
  __m512 t4 = _mm512_add_ps(x7,x8);
  __m512 t6 = _mm512_add_ps(t1,t2);
  __m512 t7 = _mm512_add_ps(t3,t4);
  __m512  x = _mm512_add_ps(t6,t7);
  return _mm_cvt_ss2si(_mm512_castps512_ps128(x));
}

The CPU frequency is a hardware related thing, so there's no general method that you can apply to get it, it also depend on the OS you are using.

For example if you are using Linux, you can either read the file /proc/cpuinfo or you can parse the dmesg boot log to get this value or if you want you can see how linux kernel handle this stuff here and try to customize the code to meet your need :

https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/proc.c

Regards.

How you find the CPU frequency is both architecture AND OS dependent, and there is no abstract solution.

If we were 20+ years ago and you were using an OS with no context switching and the CPU executed the instructions given it in order, you could write some C code in a loop and time it, then based on the assembly it was compiled into compute the number of instructions at runtime. This is already making the assumption that each instruction takes 1 clock cycle, which is a rather poor assumption ever since pipelined processors.

But any modern OS will switch between multiple processes. Even then you can attempt to time a bunch of identical for loop runs (ignoring time needed for page faults and multiple other reasons why your processor might stall) and get a median value.

And even if the previous solution works, you have multi-issue processors. With any modern processor, it's fair game to re-order your instructions, issue a bunch of them in the same clock cycle, or even split them across cores.

How can I programmatically find the CPU frequency with C

Tags:

C

Cpu Speed

Related

Recent Posts