osx - OpenCL: how to optimise a reduction kernel (summation of columns), currently the CPU is faster -
i have started use opencl first time , i'm trying optimise reduction kernel. kernel take square grid of floating point numbers (the data represents luminance value of greyscale image) of size width-by-length pixels. kernel sums along every column , returns total each column output array.
/* input -- "2d" array of floats width * height number of elements output -- 1d array containing summation of column values width number of elements width -- number of elements horizontally height -- number of elements vertically both width , height must multiple of 64. */ kernel void sum_columns(global float* input, global float* output, int width, int height) { size_t j = get_global_id(0); float sum = 0.0; int i; for(i=0; i<height; i++) { sum += input[i + width*j]; } output[j] = sum; }
opencl should make perform every column summation in concurrently because set global dimensions number of columns in data. have used instruments.app on macos , timed how long 1000 iterations of kernel takes when executing on cpu , gpu. can done specifying device either cl_device_type_cpu
or cl_device_type_gpu
.
the performance not good! in fact cpu consistently faster gpu, strikes me odd. there wrong kernel? how can cpu faster when @ can execute 8 threads concurrently?
the code project here (an xcode project), https://github.com/danieljfarrell/xcode-4-opencl-example.
here timing results increase size of data,
update
the timing cpu , gpu implied looking @ time taken run kernel using instruments.app,
one easy improvement try make input constant memory instead of global. need set cl_mem_read_only when create buffer. profiler use seems kernel when change parameter __constant.
another option transpose input matrix, you're not trying read columns of memory. made kernel uses entire work group sum row of data , yield single entry in output. __constant parameter helps kernel out lot, making run alu-bound instead of global-fetch-bound out of 4 trials ran.
i didn't loop though height parameter, can either set or create enough work groups output data (1 per element).
kernel void sum_rows(__constant float* input, global float* output, int width, int height) { int gid = get_local_id(0); int gsize = get_local_size(0); local float sum[64]; //assumes work group size of 64 sum[gid] = 0; int i; int rowstart = width * get_group_id(0); for(i=gid; i<width; i+=gsize) { sum[gid] += input[rowstart + i]; } barrier(clk_local_mem_fence); if(gid == 0){ for(i = 0;i<64;i++){ sum[0] += sum[i]; } output[get_group_id(0)] = sum[0]; } }
beyond that, suggest looking @ host-level optimizations. large enough data set, there should no problem gpu outperform cpu in reduction kernel.
Comments
Post a Comment