Multiple CUDA contexts for one device - any sense?
Question 1: In this scenario am I good with using a single context and an explicit stream created on this context for each thread that performs the mentioned pipeline?
You should be fine with a single context.
Question 2: Can someone enlighten me on what is the sole purpose of the CUDA device context? I assume it makes sense in a multiple GPU scenario, but are there any cases where I would want to create multiple contexts for one GPU?
The CUDA device context is discussed in the programming guide. It represents all of the state (memory map, allocations, kernel definitions, and other state-related information) associated with a particular process (i.e. associated with that particular process' use of a GPU). Separate processes will normally have separate contexts (as will separate devices), as these processes have independent GPU usage and independent memory maps.
If you have multi-process usage of a GPU, you will normally create multiple contexts on that GPU. As you've discovered, it's possible to create multiple contexts from a single process, but not usually necessary.
And yes, when you have multiple contexts, kernels launched in those contexts will require context switching to go from one kernel in one context to another kernel in another context. Those kernels cannot run concurrently.
CUDA runtime API usage manages contexts for you. You normally don't explicitly interact with a CUDA context when using the runtime API. However, in driver API usage, the context is explicitly created and managed.
Obviously a few years have passed, but NVENC/NVDEC now appear to have CUstream support as of version 9.1 (circa September 2019) of the video codec SDK: https://developer.nvidia.com/nvidia-video-codec-sdk/download
NEW to 9.1- Encode: CUStream support in NVENC for enhanced parallelism between CUDA pre-processing and NVENC encoding
I'm super new to CUDA, but my basic understanding is that CUcontexts allow multiple processes to use the GPU (by doing context swaps that interrupt each other's work), while CUstreams allow for a coordinated sharing of the GPU's resources from within a single process.