c++ - CUDA reduction, approach for big arrays -
i have following "frankenstein" sum reduction code, taken partly common cuda reduction slices, partly cuda samples.
__global__ void reduce6(float *g_idata, float *g_odata, unsigned int n) { extern __shared__ float sdata[]; // perform first level of reduction, // reading global memory, writing shared memory unsigned int tid = threadidx.x; unsigned int = blockidx.x*blocksize*2 + threadidx.x; unsigned int gridsize = blocksize*2*griddim.x; sdata[tid] = 0; float mysum = 0; while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+maxtreads]; += gridsize; } __syncthreads(); // reduction in shared mem if (tid < 256) sdata[tid] += sdata[tid + 256]; __syncthreads(); if (tid < 128) sdata[tid] += sdata[tid + 128]; __syncthreads(); if (tid < 64) sdata[tid] += sdata[tid + 64]; __syncthreads(); #if (__cuda_arch__ >= 300 ) if ( tid < 32 ) { // fetch final intermediate sum 2nd warp mysum = sdata[tid]+ sdata[tid + 32]; // reduce final warp using shuffle (int offset = warpsize/2; offset > 0; offset /= 2) mysum += __shfl_down(mysum, offset); } sdata[0]=mysum; #else // unroll reduction within single warp if (tid < 32) { sdata[tid] += sdata[tid + 32]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; sdata[tid] += sdata[tid + 4]; sdata[tid] += sdata[tid + 2]; sdata[tid] += sdata[tid + 1]; } #endif // write result block global mem if (tid == 0) g_odata[blockidx.x] = sdata[0]; }
i using reduce unrolled array of big size (e.g. 512^3 = 134217728 = n
) on tesla k40 gpu.
i have questions regarding blocksize
variable, , value.
from here on, try explain understanding (either right or wrong) on how works:
the bigger choose blocksize
, faster code execute, spend less time in whole loop, not finish reducing whole array, return smaller array of size dimblock.x
, right? if use blocksize=1
code return in 1 call reduction value, slow because not exploiting power of cuda anything. therefore need call reduction kernel several times, each of time smaller bloksize
, , reducing result of previous call reduce, until smallest point.
something (pesudocode)
blocks=number; //where start? why? while(not min){ dim3 dimblock( blocks ); dim3 dimgrid(n/dimblock.x); int smemsize = dimblock.x * sizeof(float); reduce6<<<dimgrid, dimblock, smemsize>>>(in, out, n); in=out; n=dimgrid.x; dimgrid.x=n/dimblock.x; // right? should change dimblock? }
in value should start? guess gpu dependent. values shoudl tesla k40 (just me understand how values chosen)?
is logic somehow flawed? how?
there cuda tool grid , block sizes : cuda occupancy api.
in response "the bigger choose blocksize, faster code execute" -- not necessarily, want sizes give max occupancy (the ratio of active warps total number of possible active warps).
see answer additional information how choose grid , block dimensions cuda kernels?.
lastly, nvidia gpus supporting kelper or later, there shuffle intrinsics make reductions easier , faster. here article on how use shuffle intrinsics : faster parallel reductions on kepler.
update choosing number of threads:
you might not want use maximum number of threads if results in less efficient use of registers. link on occupancy :
for purposes of calculating occupancy, number of registers used each thread 1 of key factors. example, devices compute capability 1.1 have 8,192 32-bit registers per multiprocessor , can have maximum of 768 simultaneous threads resident (24 warps x 32 threads per warp). means in 1 of these devices, multiprocessor have 100% occupancy, each thread can use @ 10 registers. however, approach of determining how register count affects occupancy not take account register allocation granularity. example, on device of compute capability 1.1, kernel 128-thread blocks using 12 registers per thread results in occupancy of 83% 5 active 128-thread blocks per multi-processor, whereas kernel 256-thread blocks using same 12 registers per thread results in occupancy of 66% because 2 256-thread blocks can reside on multiprocessor.
so way understand increased number of threads has potential limit performance because of way registers can allocated. however, not case, , need calculation (as in above statement) determine optimal number of threads per block.
Comments
Post a Comment