The CUDA documentation does not mention if uncoalesced writes are the same speed wise as uncoalesced reads so in order to get the peak performance of my sorting I needed to run a few tests with different kernels in the CUDA profiler.
I made four kernels, each doing a very simple task: read in from global memory and store in shared mem then write from shared mem into a different area of global memory. Each kernel reads or writes to or from global memory in a different fashion to ensure coalesced / uncoalesced access for my 8800GT. Note that on newer cards the coalescing rules are a bit different (see the documentation for details)
The CUDA documentation specifies that for cudamalloc "The allocated memory is suitably aligned for any kind of variable." so we can assume that our global memory blocks are aligned at the correct place for reading and writing.
The four kernels are as follows:
(A):
__global__ void readCowriteCo_kernel(float* d_idata, float* d_odata)
{
__shared__ float sdata[192];
const unsigned int tid = threadIdx.x;
const unsigned int bid = blockIdx.x;
sdata[tid] = d_idata[bid*192+tid];
__syncthreads();
d_odata[bid*192+tid] = sdata[tid];
__syncthreads();
}
(B):
__global__ void readUCowriteCo_kernel(float* d_idata, float* d_odata)
{
__shared__ float sdata[192];
const unsigned int tid = threadIdx.x;
const unsigned int bid = blockIdx.x;
sdata[tid] = d_idata[bid*192*2+tid*2];
__syncthreads();
d_odata[bid*192+tid] = sdata[tid];
__syncthreads();
}
(C):
__global__ void readCowriteUCo_kernel(float* d_idata, float* d_odata)
{
__shared__ float sdata[192];
const unsigned int tid = threadIdx.x;
const unsigned int bid = blockIdx.x;
sdata[tid] = d_idata[bid*192+tid];
__syncthreads();
d_odata[bid*192*2+tid*2] = sdata[tid];
__syncthreads();
}
(D):
__global__ void readUCowriteUCo_kernel(float* d_idata, float* d_odata)
{
__shared__ float sdata[192];
const unsigned int tid = threadIdx.x;
const unsigned int bid = blockIdx.x;
sdata[tid] = d_idata[bid*192*2+tid*2];
__syncthreads();
d_odata[bid*192*2+tid*2] = sdata[tid];
__syncthreads();
}
These kernels were launched with a grid size of 32768 and 192 threads per block. The 192 threads are a multiple of 32 thereby ensuring full warps were used. The kernels were run in succession 1000 times. All of the above kernels have 100% occupancy.
According to the CUDA profiler the kernels produced on average the following: (8800GT)
Kernel | GPUTime | Uncoalesced Read | Coalesced Read | Uncoalesced Write | Coalesced Write |
A | 1121.73 | 0 | 56172 | 0 | 224688 |
B | 4879.17 | 898752 | 0 | 0 | 224688 |
C | 6267.65 | 0 | 56172 | 1797504 | 0 |
D | 9735.1 | 898752 | 0 | 1797504 | 0 |
Based purely on timings kernel B is much faster than kernel C. I am still investigating why a coalesced write uses 4 times as many operations as a coalesced read. (update: please see here for the explanation) The memory from the cudamalloc is meant to be aligned correctly. Kernel D shows that uncoalesced writes take twice as long as uncoalesced reads and in total 9.6 times as many operations and the same ratio in increase in time.
From the above results it is much more efficient for my Sort Algorithm's Kernel A to write out in a coalesced manner when it has finished and Kernel B to then read the data in an uncoalesced fashion. This result seems to hold true in all manner of systems where multiple kernel launches and operations are performed on the same data.
No comments:
Post a Comment