osx - OpenCL: how to optimise a reduction kernel (summation of columns), currently the CPU is faster -


i have started use opencl first time , i'm trying optimise reduction kernel. kernel take square grid of floating point numbers (the data represents luminance value of greyscale image) of size width-by-length pixels. kernel sums along every column , returns total each column output array.

/*  input  -- "2d" array of floats width * height number of elements output -- 1d array containing summation of column values width number of elements width  -- number of elements horizontally height -- number of elements vertically  both width , height must multiple of 64. */ kernel void sum_columns(global float* input, global float* output,  int width, int height) {             size_t j = get_global_id(0);     float sum = 0.0;     int i;     for(i=0; i<height; i++) {         sum += input[i + width*j];     }     output[j] = sum; } 

opencl should make perform every column summation in concurrently because set global dimensions number of columns in data. have used instruments.app on macos , timed how long 1000 iterations of kernel takes when executing on cpu , gpu. can done specifying device either cl_device_type_cpu or cl_device_type_gpu.

the performance not good! in fact cpu consistently faster gpu, strikes me odd. there wrong kernel? how can cpu faster when @ can execute 8 threads concurrently?

the code project here (an xcode project), https://github.com/danieljfarrell/xcode-4-opencl-example.

here timing results increase size of data,

execution time function of image size.

update

the timing cpu , gpu implied looking @ time taken run kernel using instruments.app, screenshot showing instruments.app using interface execution time kernel highlighted.

one easy improvement try make input constant memory instead of global. need set cl_mem_read_only when create buffer. profiler use seems kernel when change parameter __constant.

another option transpose input matrix, you're not trying read columns of memory. made kernel uses entire work group sum row of data , yield single entry in output. __constant parameter helps kernel out lot, making run alu-bound instead of global-fetch-bound out of 4 trials ran.

i didn't loop though height parameter, can either set or create enough work groups output data (1 per element).

kernel void sum_rows(__constant  float* input, global float* output,  int width, int height) {             int gid = get_local_id(0);     int gsize = get_local_size(0);     local float sum[64]; //assumes work group size of 64     sum[gid] = 0;     int i;     int rowstart = width * get_group_id(0);     for(i=gid; i<width; i+=gsize) {         sum[gid] += input[rowstart + i];     }     barrier(clk_local_mem_fence);     if(gid == 0){         for(i = 0;i<64;i++){             sum[0] += sum[i];         }         output[get_group_id(0)] = sum[0];     } } 

beyond that, suggest looking @ host-level optimizations. large enough data set, there should no problem gpu outperform cpu in reduction kernel.


Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

javascript - Twitter Bootstrap - how to add some more margin between tooltip popup and element -

javascript - Get parameter of GET request -