Fastest language for FOR loops

This for loop looks no more complex than this when it hits the CPU:

for(int i = 0; i != 1024; i++) translates to

mov r0, 0           ;;start the counter    
top:

;;some processing

add r0, r0, 1       ;;increment the counter by 1
jne top: r0, 1024   ;;jump to the loop top if we havn't hit the top of the for loop (1024 elements)

;;continue on

As you can tell, this is sufficiently simple you can't really optimize it very well[1]... Refocus towards the algorithm level.

The first cut at the problem is to look at cache locality. Look up the classic example of matrix multiplication and swapping the i and j indexes.

edit: As a second cut, I would suggest evaluating the algorithm for data-dependencies between iterations and data dependency between localities in your 'matrix' of data. It may be a good candidate for parallelization.

[1] There are some micro-optimizations possible, but those will not produce the speedsups you're looking for.


~300k * ~150 * ~30 * ~12 = ~16G iterations, right? This number of primitive operations should complete in a matter of minutes (if not seconds) in any compiled language on any decent CPU. Fortran, C/C++ should do it almost equally well. Even managed languages, such as Java and C#, would only fall behind by a small margin (if at all).

If you have a problem of ~16G iterations running 55 hours, this means that they are very far from being primitive (80k per second? this is ridiculous), so maybe we should know more. (as was already suggested, is disk access limiting performance? is it network access?)


As @Rotsor said, 16G operations / 55 hours is about 80,000 operations per second, or one operation every 12.5 microseconds. That's a lot of time per operation.

That means your loops are not the cause of poor performance, it's what's in the innermost loop that's taking the time. And Octave is an interpreted language. That alone means an order of magnitude slowdown.

If you want speed, you first need to be in a compiled language. Then you need to do performance tuning (aka profiling) or, just single step it in a debugger at the instruction level. That will tell you what it is actually doing in its heart of hearts. Once you've got it to where it's not wasting cycles, fancier hardware, cores, CUDA, etc. will give you a parallelism speedup. But it's silly to do that if your code is taking unnecessarily many cycles. (Here's an example of performance tuning - a 43x speedup just by trimming the fat.)

I can't believe the number of responders talking about matlab, APL, and other vectorized languages. Those are interpreters. They give you concise source code, which is not at all the same thing as fast execution. When it comes down to the bare metal, they are stuck with the same hardware as every other language.


Added: to show you what I mean, I just ran this C++ code, which does 16G operations, on this moldy old laptop, and it took 94 seconds, or about 6ns per iteration. (I can't believe you baby-sat that thing for 2 whole days.)

void doit(){
  double sum = 0;
  for (int i = 0; i < 1000; i++){
    for (int j = 0; j < 16000000; j++){
      sum += j * 3.1415926;
    }
  }
}