c++ - CUDA reduction, approach for big arrays -

i have following "frankenstein" sum reduction code, taken partly common cuda reduction slices, partly cuda samples.

    __global__ void  reduce6(float *g_idata, float *g_odata, unsigned int n) {     extern __shared__ float sdata[];      // perform first level of reduction,     // reading global memory, writing shared memory     unsigned int tid = threadidx.x;     unsigned int = blockidx.x*blocksize*2 + threadidx.x;     unsigned int gridsize = blocksize*2*griddim.x;     sdata[tid] = 0;     float mysum = 0;         while (i < n) {          sdata[tid] += g_idata[i] + g_idata[i+maxtreads];          += gridsize;      }    __syncthreads();       // reduction in shared mem     if (tid < 256)         sdata[tid] += sdata[tid + 256];     __syncthreads();      if (tid < 128)         sdata[tid] +=  sdata[tid + 128];      __syncthreads();      if (tid <  64)        sdata[tid] += sdata[tid +  64];     __syncthreads();   #if (__cuda_arch__ >= 300 )     if ( tid < 32 )     {         // fetch final intermediate sum 2nd warp         mysum = sdata[tid]+ sdata[tid + 32];         // reduce final warp using shuffle         (int offset = warpsize/2; offset > 0; offset /= 2)              mysum += __shfl_down(mysum, offset);     }     sdata[0]=mysum; #else      // unroll reduction within single warp     if (tid < 32) {        sdata[tid] += sdata[tid + 32];        sdata[tid] += sdata[tid + 16];        sdata[tid] += sdata[tid + 8];        sdata[tid] += sdata[tid + 4];        sdata[tid] += sdata[tid + 2];        sdata[tid] += sdata[tid + 1];     } #endif     // write result block global mem     if (tid == 0) g_odata[blockidx.x] = sdata[0];   }

i using reduce unrolled array of big size (e.g. 512^3 = 134217728 = n) on tesla k40 gpu.

i have questions regarding blocksize variable, , value.

from here on, try explain understanding (either right or wrong) on how works:

the bigger choose blocksize, faster code execute, spend less time in whole loop, not finish reducing whole array, return smaller array of size dimblock.x, right? if use blocksize=1 code return in 1 call reduction value, slow because not exploiting power of cuda anything. therefore need call reduction kernel several times, each of time smaller bloksize, , reducing result of previous call reduce, until smallest point.

something (pesudocode)

blocks=number; //where start? why? while(not min){      dim3 dimblock( blocks );     dim3 dimgrid(n/dimblock.x);     int smemsize = dimblock.x * sizeof(float);     reduce6<<<dimgrid, dimblock, smemsize>>>(in, out, n);      in=out;      n=dimgrid.x;      dimgrid.x=n/dimblock.x; // right? should change dimblock? }

in value should start? guess gpu dependent. values shoudl tesla k40 (just me understand how values chosen)?

is logic somehow flawed? how?

there cuda tool grid , block sizes : cuda occupancy api.

in response "the bigger choose blocksize, faster code execute" -- not necessarily, want sizes give max occupancy (the ratio of active warps total number of possible active warps).

see answer additional information how choose grid , block dimensions cuda kernels?.

lastly, nvidia gpus supporting kelper or later, there shuffle intrinsics make reductions easier , faster. here article on how use shuffle intrinsics : faster parallel reductions on kepler.

update choosing number of threads:

you might not want use maximum number of threads if results in less efficient use of registers. link on occupancy :

for purposes of calculating occupancy, number of registers used each thread 1 of key factors. example, devices compute capability 1.1 have 8,192 32-bit registers per multiprocessor , can have maximum of 768 simultaneous threads resident (24 warps x 32 threads per warp). means in 1 of these devices, multiprocessor have 100% occupancy, each thread can use @ 10 registers. however, approach of determining how register count affects occupancy not take account register allocation granularity. example, on device of compute capability 1.1, kernel 128-thread blocks using 12 registers per thread results in occupancy of 83% 5 active 128-thread blocks per multi-processor, whereas kernel 256-thread blocks using same 12 registers per thread results in occupancy of 66% because 2 256-thread blocks can reside on multiprocessor.

so way understand increased number of threads has potential limit performance because of way registers can allocated. however, not case, , need calculation (as in above statement) determine optimal number of threads per block.

Search This Blog

Live one

c++ - CUDA reduction, approach for big arrays -

Comments

Post a Comment

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

php - XML feed for Wordpress Social Board plugin modifications -

javascript - Twitter Bootstrap - how to add some more margin between tooltip popup and element -