Is it possible to make thread join to 'parallel for' region after its job?

What about something like this?

#pragma omp parallel
     // note the nowait here so that other threads jump directly to the for loop
    #pragma omp single nowait

    #pragma omp for schedule(dynamic, 32)
    for (int i = 0 ; i < 10000000; ++i) {

I did not test this but the single will be executed by only one threads while all others will jump directly to the for loop thanks to the nowait. Also I think it is easier to read than with sections.

Another way (and potentially the better way) to express this would be to use OpenMP tasks:

#pragma omp parallel master
    #pragma omp task // job(2)
    { // 'printf' is not real job. It is just used for simplicity.
        printf("i'm single: %d\n", omp_get_thread_num());
    #pragma omp taskloop // job(1)
    for (int i = 0 ; i < 10000000; ++i) {
        // 'printf' is not real job. It is just used for simplicity.
        printf("%d\n", omp_get_thread_num());

If you have a compiler that does not understand OpenMP version 5.0, then you have to split the parallel and master:

#pragma omp parallel
#pragma omp master
    #pragma omp task // job(2)
    { // 'printf' is not real job. It is just used for simplicity.
        printf("i'm single: %d\n", omp_get_thread_num());
    #pragma omp taskloop ]
    for (int i = 0 ; i < 10000000; ++i) {
        // 'printf' is not real job. It is just used for simplicity.
        printf("%d\n", omp_get_thread_num());

The problem comes from synchronization. At the end of the section, omp waits for the termination of all threads and cannot release the thread on job 2 until its completion has been checked.

The solution requires to suppress the synchronization with a nowait.
I did not succeed to suppress synchronization with sections and nested parallelism. I rarely use nested parallel regions, but I think that, while sections can be nowaited, there is a problem when spawning the new nested parallel region inside a section. There is a mandatory synchronization at the end of a parallel section that cannot be suppressed and it probably prevents new threads to join the pool.

What I did is to use a single thread, without synchronization. This way, omp start the single thread and does not wait for its completion to start the parallel for. When the thread finishes its single work, it joins the thread pool to finish processing the for.

#include <omp.h>
#include <stdio.h>

int main() {
  int singlethreadid=-1;
  // omp_set_nested(1);
#pragma omp parallel
#pragma omp single nowait  // job(2)
    { // 'printf' is not real job. It is just used for simplicity.
      printf("i'm single: %d\n", omp_get_thread_num());
#pragma omp for schedule(dynamic, 32) 
    for (int i = 0 ; i < 100000; ++i) {
      // 'printf' is not real job. It is just used for simplicity.
      printf("%d\n", omp_get_thread_num());
      if (omp_get_thread_num() == singlethreadid)
        printf("Hello, I\'m back\n");