How does OpenMP handle nested loops?

You will be able to better understand this with the following example. Let's do this with two threads.

#pragma omp parallel for num_threads(2)
for(int i=0; i< 3; i++) {
    for (int j=0; j< 3; j++) {
        printf("i = %d, j= %d, threadId = %d \n", i, j, omp_get_thread_num());
    }
}

then the result will be,

i = 0, j= 0, threadId = 0 
i = 0, j= 1, threadId = 0 
i = 0, j= 2, threadId = 0 
i = 1, j= 0, threadId = 0 
i = 1, j= 1, threadId = 0 
i = 1, j= 2, threadId = 0 
i = 2, j= 0, threadId = 1 
i = 2, j= 1, threadId = 1 
i = 2, j= 2, threadId = 1

That means, when you add #pragma omp parallel for to the uppermost for loop, the index of that for loop is divided among the threads. As you can see, when index of i is same the thread id is also the same.

Instead of that, we can parallel the combinations that we have in a nested for loop. In this example we can have following combinations of i and j.

i = 0, j= 0
i = 0, j= 1
i = 0, j= 2
i = 1, j= 0
i = 1, j= 1
i = 1, j= 2
i = 2, j= 0
i = 2, j= 1
i = 2, j= 2

In order to parallelize the code combination wise, we can add the collapse keyword as follows.

#pragma omp parallel for num_threads(2) collapse(2)
for(int i=0; i< 3; i++) {
    for (int j=0; j< 3; j++) {
        printf("i = %d, j= %d, threadId = %d \n", i, j, omp_get_thread_num());
    }
}

then the result will be as follows.

i = 0, j= 0, threadId = 0 
i = 0, j= 1, threadId = 0 
i = 1, j= 2, threadId = 1 
i = 2, j= 0, threadId = 1 
i = 2, j= 1, threadId = 1 
i = 2, j= 2, threadId = 1 
i = 0, j= 2, threadId = 0 
i = 1, j= 0, threadId = 0 
i = 1, j= 1, threadId = 0 

Then you can see that unlike before, for the same index i, there can be different thread ids ( when (i=1 and j=2 threadId=1) also (i=1 and j=0 threadId=0)). That means in this scenario, the combinations of i and j are divided among the threads.


The lines you have written will parallelize only the outer loop. To parallelize both you need to add a collapse clause:

#pragma omp parallel for collapse(2)
    for (int i=0;i<N;i++)
    { 
      for (int j=0;j<M;j++)
      {
       //do task(i,j)//
      }
    }

You may want to check OpenMP 3.1 specifications (sec 2.5.1) for more details.