Describe the bug
In chapter 10 chapter10/README.md, I think there's a race condition in the solution for exercise 4.
__global__ void CoarsenedMaxReductionKernel(float* input, float* output) {
__shared__ float input_s[BLOCK_DIM];
unsigned int segment = COARSE_FACTOR*2*blockDim.x*blockIdx.x;
unsigned int i = segment + threadIdx.x;
unsigned int t = threadIdx.x;
float maximum_value = input[i];
for(unsigned int tile = 1; tile < COARSE_FACTOR*2; ++tile) {
maximum_value = fmax(maximum_value, input[i + tile*BLOCK_DIM]);
}
input_s[t] = maximum_value;
for (unsigned int stride = blockDim.x/2; stride >= 1; stride /= 2){
__syncthreads();
if (t < stride) {
input_s[t] = fmax(input_s[t], input_s[t + stride]);
}
}
if (t == 0) {
atomicExch(output, fmax(*output, input_s[0]));
}
}
Note that in the line atomicExch(output, fmax(*output, input_s[0]));. I imagine that it's possible for something like the following sequence of events to happen between example blocks A and B (assuming *output = -inf for convience)
- A: reads *output = -inf
- A: computes candA = fmax(-inf, 3) = 3
- B: reads *output = -inf
- B: computes candB = fmax(-inf, 5) = 5
- B: atomicExch(output, 5) -> *output becomes 5
- A: atomicExch(output, 3) -> *output becomes 3 (wrong)
Proposed solution
You can use CUDA's atomicMax operation. (i.e. atomicMax(output, input_s[0]);).
Describe the bug
In chapter 10
chapter10/README.md, I think there's a race condition in the solution for exercise 4.Note that in the line
atomicExch(output, fmax(*output, input_s[0]));. I imagine that it's possible for something like the following sequence of events to happen between example blocks A and B (assuming*output = -inffor convience)Proposed solution
You can use CUDA's
atomicMaxoperation. (i.e.atomicMax(output, input_s[0]);).