Difference between kernels construct and parallel construct
kernels
directive is the more general case and probably one that you might think of, if you've written GPU (e.g. CUDA) kernels before. kernels
simply directs the compiler to work on a piece of code, and produce an arbitrary number of "kernels", of arbitrary "dimensions", to be executed in sequence, to parallelize/offload a particular section of code to the accelerator. The parallel
construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator, for example by specifying specific dimensions of parallelization. For example, the number of workers and gangs would normally be constant as part of the parallel
directive (since only one underlying "kernel" is usually implied), but perhaps not on the kernels
directive (since it may translate to multiple underlying "kernels").
A good treatment of this specific question is contained in this PGI article.
Quoting from the article summary: "The OpenACC kernels and parallel constructs each try to solve the same problem, identifying loop parallelism and mapping it to the machine parallelism. The kernels construct is more implicit, giving the compiler more freedom to find and map parallelism according to the requirements of the target accelerator. The parallel construct is more explicit, and requires more analysis by the programmer to determine when it is legal and appropriate. "
OpenACC directives and GPU kernels are just two ways of representing the same thing -- a section of code that can run in parallel.
OpenACC may be best when retrofitting an existing app to take advantage of a GPU and/or when it is desirable to let the compiler handle more details related to issues such as memory management. This can make it faster to write an app, with a potential cost in performance.
Kernels may be best when writing a GPU app from scratch and/or when more fine grained control is desired. This can make the app take longer to write, but may increase performance.
I think that people new to GPUs may be tempted to go with OpenACC because it looks more familiar. But I think it's actually better to go the other way, and start with writing kernels, and then, potentially move to OpenACC to save time in some projects. The reason is that OpenACC is a leaky abstraction. So, while OpenACC may make it look as if the GPU details are abstracted out, they are still there. So, using OpenACC to write GPU code without understanding what is happening in the background is likely to be frustrating, with odd error messages when attempting to compile, and result in an app that has low performance.