locked
AMP copy bug? RRS feed

  • Question

  • I'm getting a bug (possibly in copying results back to the CPU), the code below reproduces on my setup with both a Tesla C2075 and the reference emulator: 

    Tesla results:

    NVIDIA Tesla C2075
    2
    4
    8
    mismatch 1      3
    16
    mismatch 2      4
    32
    mismatch 1      5
    Press any key to continue . . .

    Emulator results:

    Software Adapter
    2
    4
    8
    mismatch 4815112        3
    16
    mismatch 0      4
    32
    mismatch 4      5
    Press any key to continue . . .

    code:

    #include <iostream>
    
    #include <amp.h>
    using namespace concurrency;
    
    
    template<unsigned int tile_size>
    void sanityFunction(
        const int w, 
        const int h, 
        const int count)
    {
        std::cout << tile_size << std::endl;
    
        array<int, 1> next_chunk(1);
        array<int, 2> device_vec(w, h);
    
        const int max_chunks = (w * h) / (tile_size * tile_size);
    
        parallel_for_each(
            device_vec.extent.tile<tile_size, tile_size>(),
            [=, &device_vec, &next_chunk] (tiled_index<tile_size, tile_size> idx) restrict(amp)
        {
            tile_static int chunk_id;
            tile_static int global_y;
            tile_static int global_x;
    
            // Here each tile will process a chuck of data and pick next block to process
            // This is like load balancing computation, a tile will pick next available chunk to process
            // "chunk_id" value in a tile will determine which chunk of data is being processed by this tile
            while (1)
            {
                // All threads from previous iteration sync here
                idx.barrier.wait();
                if (idx.local[1] == 0 && idx.local[0] == 0)
                {
                    // Sync-ing chuck to be processed between tiles
                    // get chunk to process for this tile
                    chunk_id = atomic_fetch_add(&next_chunk[0], 1);
                    global_x = chunk_id % (w / tile_size) * tile_size;
                    global_y = chunk_id / (w / tile_size) * tile_size;
                }
                // Sync within a tile.
                // Now threads have tile specific chunk_id, global_y, and global_x
                idx.barrier.wait();
    
                if (chunk_id >= max_chunks) break; 
    
                device_vec(idx.global[1], idx.global[0]) = (int)count;
            }
        });
    
        std::vector<int> dists(w*h, 0);
        copy(device_vec, &dists[0]);
        if(dists[0]!=count)
            std::cout << "mismatch " << dists[0] << "\t" << count << std::endl;
    }
    
    bool pick_accelerator()
    {
        std::vector<accelerator> accs = accelerator::get_all();
        accelerator chosen_one;
    
        auto result = 
            std::find_if(accs.begin(), accs.end(), [] (const accelerator& acc)
        {
            return !acc.is_emulated;
        });
    
        if (result != accs.end())
            chosen_one = *(result);
    
        std::wcout << chosen_one.description << std::endl;
    
        bool success = accelerator::set_default(chosen_one.device_path);
        return success;
    }
    
    void main()
    {
        pick_accelerator();
    
        unsigned int count=1;
        int size = 256;
        sanityFunction<2>(size, size, count);
        count++;
        sanityFunction<4>(size, size, count);
        count++;
        sanityFunction<8>(size, size, count);
        count++;
        sanityFunction<16>(size, size, count);
        count++;
        sanityFunction<32>(size, size, count);
    }
    

    can anyone else reproduce?  thanks.  

    Note: this code comes from the Mandlebrot AMP sample run repeatedly.  doesn't happen without the wait()s.

    Tuesday, June 5, 2012 12:57 PM

Answers

  • Hi tspitz,

    It seems that you didn't initialize "next_chunk" at:

    array<int, 1> next_chunk(1);

    Your code depends on that it will be initialized to 0 automatically, which is not guaranteed.

    I changed the code to

    int init_value = 0;

    array<int, 1> next_chunk(1, &init_value);

    Then it works fine.

    BTW: I notice you wrote:

     device_vec(idx.global[1], idx.global[0]) = (int)count;

    It should not affect the correctness of this code. However, is this intentional for your application?

    Note, for 2D,  idx.global[0] is the row id and idx.global[1] is the column id. So idx.global[0] is not indexing on the unit-stride dimension. In this way, you are getting non-coalesced memory access.  To get a better access pattern, maybe you want to use:

     device_vec(idx.global[0], idx.global[1]) = (int)count;

    or just

     device_vec[idx.global] = (int)count;

    Thanks,

    Weirong

    Tuesday, June 5, 2012 4:05 PM

All replies

  • Hi tspitz,

    It seems that you didn't initialize "next_chunk" at:

    array<int, 1> next_chunk(1);

    Your code depends on that it will be initialized to 0 automatically, which is not guaranteed.

    I changed the code to

    int init_value = 0;

    array<int, 1> next_chunk(1, &init_value);

    Then it works fine.

    BTW: I notice you wrote:

     device_vec(idx.global[1], idx.global[0]) = (int)count;

    It should not affect the correctness of this code. However, is this intentional for your application?

    Note, for 2D,  idx.global[0] is the row id and idx.global[1] is the column id. So idx.global[0] is not indexing on the unit-stride dimension. In this way, you are getting non-coalesced memory access.  To get a better access pattern, maybe you want to use:

     device_vec(idx.global[0], idx.global[1]) = (int)count;

    or just

     device_vec[idx.global] = (int)count;

    Thanks,

    Weirong

    Tuesday, June 5, 2012 4:05 PM
  • perfect thanks. Note the uninitialised vector:

    array<int, 1> count(1);

    comes from your Mandlebrot sample - you should fix that sample too.

    Wednesday, June 6, 2012 12:27 PM
  • Thanks! We will fix the sample.
    Wednesday, June 6, 2012 8:20 PM