Answered by:
AMP copy bug?

Question
-
I'm getting a bug (possibly in copying results back to the CPU), the code below reproduces on my setup with both a Tesla C2075 and the reference emulator:
Tesla results:
NVIDIA Tesla C2075 2 4 8 mismatch 1 3 16 mismatch 2 4 32 mismatch 1 5 Press any key to continue . . .
Emulator results:
Software Adapter 2 4 8 mismatch 4815112 3 16 mismatch 0 4 32 mismatch 4 5 Press any key to continue . . .
code:
#include <iostream> #include <amp.h> using namespace concurrency; template<unsigned int tile_size> void sanityFunction( const int w, const int h, const int count) { std::cout << tile_size << std::endl; array<int, 1> next_chunk(1); array<int, 2> device_vec(w, h); const int max_chunks = (w * h) / (tile_size * tile_size); parallel_for_each( device_vec.extent.tile<tile_size, tile_size>(), [=, &device_vec, &next_chunk] (tiled_index<tile_size, tile_size> idx) restrict(amp) { tile_static int chunk_id; tile_static int global_y; tile_static int global_x; // Here each tile will process a chuck of data and pick next block to process // This is like load balancing computation, a tile will pick next available chunk to process // "chunk_id" value in a tile will determine which chunk of data is being processed by this tile while (1) { // All threads from previous iteration sync here idx.barrier.wait(); if (idx.local[1] == 0 && idx.local[0] == 0) { // Sync-ing chuck to be processed between tiles // get chunk to process for this tile chunk_id = atomic_fetch_add(&next_chunk[0], 1); global_x = chunk_id % (w / tile_size) * tile_size; global_y = chunk_id / (w / tile_size) * tile_size; } // Sync within a tile. // Now threads have tile specific chunk_id, global_y, and global_x idx.barrier.wait(); if (chunk_id >= max_chunks) break; device_vec(idx.global[1], idx.global[0]) = (int)count; } }); std::vector<int> dists(w*h, 0); copy(device_vec, &dists[0]); if(dists[0]!=count) std::cout << "mismatch " << dists[0] << "\t" << count << std::endl; } bool pick_accelerator() { std::vector<accelerator> accs = accelerator::get_all(); accelerator chosen_one; auto result = std::find_if(accs.begin(), accs.end(), [] (const accelerator& acc) { return !acc.is_emulated; }); if (result != accs.end()) chosen_one = *(result); std::wcout << chosen_one.description << std::endl; bool success = accelerator::set_default(chosen_one.device_path); return success; } void main() { pick_accelerator(); unsigned int count=1; int size = 256; sanityFunction<2>(size, size, count); count++; sanityFunction<4>(size, size, count); count++; sanityFunction<8>(size, size, count); count++; sanityFunction<16>(size, size, count); count++; sanityFunction<32>(size, size, count); }
can anyone else reproduce? thanks.
Note: this code comes from the Mandlebrot AMP sample run repeatedly. doesn't happen without the wait()s.
Tuesday, June 5, 2012 12:57 PM
Answers
-
Hi tspitz,
It seems that you didn't initialize "next_chunk" at:
array<int, 1> next_chunk(1);
Your code depends on that it will be initialized to 0 automatically, which is not guaranteed.
I changed the code to
int init_value = 0;
array<int, 1> next_chunk(1, &init_value);
Then it works fine.
BTW: I notice you wrote:
device_vec(idx.global[1], idx.global[0]) = (int)count;
It should not affect the correctness of this code. However, is this intentional for your application?
Note, for 2D, idx.global[0] is the row id and idx.global[1] is the column id. So idx.global[0] is not indexing on the unit-stride dimension. In this way, you are getting non-coalesced memory access. To get a better access pattern, maybe you want to use:
device_vec(idx.global[0], idx.global[1]) = (int)count;
or just
device_vec[idx.global] = (int)count;
Thanks,
Weirong
- Proposed as answer by DanielMothMicrosoft employee Tuesday, June 5, 2012 6:00 PM
- Marked as answer by tspitz Wednesday, June 6, 2012 12:27 PM
Tuesday, June 5, 2012 4:05 PM
All replies
-
Hi tspitz,
It seems that you didn't initialize "next_chunk" at:
array<int, 1> next_chunk(1);
Your code depends on that it will be initialized to 0 automatically, which is not guaranteed.
I changed the code to
int init_value = 0;
array<int, 1> next_chunk(1, &init_value);
Then it works fine.
BTW: I notice you wrote:
device_vec(idx.global[1], idx.global[0]) = (int)count;
It should not affect the correctness of this code. However, is this intentional for your application?
Note, for 2D, idx.global[0] is the row id and idx.global[1] is the column id. So idx.global[0] is not indexing on the unit-stride dimension. In this way, you are getting non-coalesced memory access. To get a better access pattern, maybe you want to use:
device_vec(idx.global[0], idx.global[1]) = (int)count;
or just
device_vec[idx.global] = (int)count;
Thanks,
Weirong
- Proposed as answer by DanielMothMicrosoft employee Tuesday, June 5, 2012 6:00 PM
- Marked as answer by tspitz Wednesday, June 6, 2012 12:27 PM
Tuesday, June 5, 2012 4:05 PM -
perfect thanks. Note the uninitialised vector:
array<int, 1> count(1);
comes from your Mandlebrot sample - you should fix that sample too.
Wednesday, June 6, 2012 12:27 PM -
Thanks! We will fix the sample.Wednesday, June 6, 2012 8:20 PM