1D Min-convolution in CUDA
An alternative which might be useful for large a
and b
would be to use a block per output entry in c
. Using a block allows for memory coalescing, which will be important in what is a memory bandwidth limited operation, and a fairly efficient shared memory reduction can be used to combine per thread partial results into a final per block result. Probably the best strategy is to launch as many blocks per MP as will run concurrently and have each block emit multiple output points. This eliminates some of the scheduling overheads associated with launching and retiring many blocks with relatively low total instruction counts.
An example of how this might be done:
#include <math.h>
template<int bsz>
__global__ __launch_bounds__(512)
void minconv(const float *a, int sizea, const float *b, int sizeb, float *c)
{
__shared__ volatile float buff[bsz];
for(int i = blockIdx.x; i<(sizea + sizeb); i+=(gridDim.x*blockDim.x)) {
float cval = INFINITY;
for(int j=threadIdx.x; j<sizea; j+= blockDim.x) {
int t = i - j;
if ((t>=0) && (t<sizeb))
cval = min(cval, a[j] + b[t]);
}
buff[threadIdx.x] = cval; __syncthreads();
if (bsz > 256) {
if (threadIdx.x < 256)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+256]);
__syncthreads();
}
if (bsz > 128) {
if (threadIdx.x < 128)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+128]);
__syncthreads();
}
if (bsz > 64) {
if (threadIdx.x < 64)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+64]);
__syncthreads();
}
if (threadIdx.x < 32) {
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+32]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+16]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+8]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+4]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+2]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+1]);
if (threadIdx.x == 0) c[i] = buff[0];
}
}
}
// Instances for all valid block sizes.
template __global__ void minconv<64>(const float *, int, const float *, int, float *);
template __global__ void minconv<128>(const float *, int, const float *, int, float *);
template __global__ void minconv<256>(const float *, int, const float *, int, float *);
template __global__ void minconv<512>(const float *, int, const float *, int, float *);
[disclaimer: not tested or benchmarked, use at own risk]
This is single precision floating point, but the same idea should work for double precision floating point. For integer, you would need to replace the C99 INFINITY
macro with something like INT_MAX
or LONG_MAX
, but the principle remains the same otherwise.
A faster version:
__global__ void convAgB(double *a, double *b, double *c, int sa, int sb)
{
int i = (threadIdx.x + blockIdx.x * blockDim.x);
int idT = threadIdx.x;
int out,j;
__shared__ double c_local [512];
c_local[idT] = c[i];
out = (i > sa) ? sa : i + 1;
j = (i > sb) ? i - sb + 1 : 1;
for(; j < out; j++)
{
if(c_local[idT] > a[j] + b[i-j])
c_local[idT] = a[j] + b[i-j];
}
c[i] = c_local[idT];
}
**Benckmark:**
Size A Size B Size C Time (s)
1000 1000 2000 0.0008
10k 10k 20k 0.0051
100k 100k 200k 0.3436
1M 1M 1M 43,327
Old Version, For sizes between 1000 and 100000, I tested with this naive version:
__global__ void convAgB(double *a, double *b, double *c, int sa, int sb)
{
int size = sa+sb;
int idT = (threadIdx.x + blockIdx.x * blockDim.x);
int out,j;
for(int i = idT; i < size; i += blockDim.x * gridDim.x)
{
if(i > sa) out = sa;
else out = i + 1;
if(i > sb) j = i - sb + 1;
else j = 1;
for(; j < out; j++)
{
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
}
}
}
I populated the array a
and b
with some random double numbers and c
with 999999 (just for testing). I validated the c
array (in the CPU) using your function (without any modifications).
I also removed the conditionals from inside of the inner loop, so it will only test them once.
I am not 100% sure but I think the following modification makes sense. Since you had i - j >= 0
, which is the same as i >= j
, this means that as soon as j > i
it will never enter this block 'X' (since j++):
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
So I calculated on the variable out
the loop conditional if i > sa
, which means that the loop will finish when j == sa
, if i < sa
this means the loop will finish (earlier) on i + 1
because of the condition i >= j
.
The other condition i - j < size(b)
means that you will start the execution of the block 'X' when i > size(b) + 1
since j
starts always = 1. So we can put j
with the value that should begin, thus
if(i > sb) j = i - sb + 1;
else j = 1;
See if you can test this version with real arrays of data, and give me feedback. Also, any improvements are welcome.
EDIT : A new optimization can be implemented, but this one does not make much of a difference.
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
we can eliminate the if, by:
double add;
...
for(; j < out; j++)
{
add = a[j] + b[i-j];
c[i] = (c[i] < add) * c[i] + (add <= c[i]) * add;
}
Having:
if(a > b) c = b;
else c = a;
it the same of having c = (a < b) * a + (b <= a) * b.
if a > b then c = 0 * a + 1 * b; => c = b; if a <= b then c = 1*a + 0 *b; => c = a;
**Benckmark:**
Size A Size B Size C Time (s)
1000 1000 2000 0.0013
10k 10k 20k 0.0051
100k 100k 200k 0.4436
1M 1M 1M 47,327
I am measuring the time of copying from CPU to GPU, running the kernel and copying from GPU to CPU.
GPU Specifications
Device Tesla C2050
CUDA Capability Major/Minor 2.0
Global Memory 2687 MB
Cores 448 CUDA Cores
Warp size 32
I have used you algorithm. I think it'll help you.
const int Length=1000;
__global__ void OneD(float *Ad,float *Bd,float *Cd){
int i=blockIdx.x;
int j=threadIdx.x;
Cd[i]=99999.99;
for(int k=0;k<Length/500;k++){
while(((i-j)>=0)&&(i-j<Length)&&Cd[i+k*Length]>Ad[j+k*Length]+Bd[i-j]){
Cd[i+k*Length]=Ad[j+k*Length]+Bd[i-j];
}}}
I have taken 500
Threads per block. And, 500
blocks per Grid. As, the number of threads per block in my device is restricted to 512
, I used 500
threads. I have taken the size of all the arrays as Length
(=1000).
Working:
i
stores the Block Index andj
stores the Thread Index.The
for
loop is used as the number of threads are less than the size of the arrays.The while loop is used for iterating
Cd[n]
.I have not used Shared Memory because, I have taken lots of blocks and threads. So, the amount of Shared Memory required for each block is low.
PS: If your device supports more Threads and Blocks, replace k<Length/500
with k<Length/(supported number of threads)