locked
tile static memory use RRS feed

  • Question

  • tile_static memory can be used only within a tile parallel_for_each. In my case, I want to use tile_static memory, but I do not need to use tiled parallel_for_each. I tried a workaround by having a tile_size = 1 and then use tile_static memory. My code is as:

    parallel_for_each(extent<1>(num_lines).tile<1>(),

    [=] (tiled_index<1> tidx) restrict(amp)

    {

    tile_static unisgned int img[1024];

    for(int ii = 0; ii < num_buff_samples; ii++)

    {

    img[ii] = img_r_();  // fetch image pixels from globle memory into tile static memory)

    }

    the above code generates fatal error C1001: An internal error has occurred in the compiler

    Thursday, November 1, 2012 5:20 PM

Answers

  • Hi BingCai,

    Is this fatal error similar to the one you mentioned in the other post? Does the workaround Amit mentioned in that thread (increase lexical scopes via adding extra wrapping blocks) work for you? Let us know if you still encounter any issues with your programs.

    Thanks,
    Lingli

    • Marked as answer by Bingcai Zhang Thursday, November 8, 2012 4:45 PM
    Tuesday, November 6, 2012 8:51 PM
    Moderator

All replies

  • It's NOT a good idea to have the tile_size == 1.  It means that each tile contains only 1 thread. Depending on the tile size, a tile contains one or more hardware scheduling units (warp/wavefront). In your case, a scheduling unit contains only 1 thread, which causes great under-usage.  Moreover, GPU also have limits on the maximum tiles and maximum warps that a GPU streaming multiprocessor can schedule at any moment. So if you make tile_size == 1,  the whole GPU will be significantly under-used, and you won't get any performance benefit on using GPU.

    Regarding to the compiler fatal error, it's something the C++ AMP team will follow up with you.

    Thanks,

    Weirong

    Thursday, November 1, 2012 5:37 PM
  • Hi Bingcai,

    Thanks for reporting the errors to us.

    I saw your earlier post where you reported internal compiler error. Is the error, mentioned in that earlier post, related to the error mentioned in this post ?

    If they are not related and their source code is different, please share us the source code related this new error. You can mail to pavanmj @ the company you know dot com

    regards,

    Pavan

    Thursday, November 1, 2012 7:38 PM
  • Hi Weirong,

    Thanks for your suggestions. I have the following problem and I hope you can point me to a better solution other than using 1 thread tile.

    I have a piece of image (larger than fitting into constant memory). For each thread, I need to access one line of my images (512 lines by 512 smaples). I need to access the same pixel in one line 41 times. Since globle memory access is 100 times slower than tile_static memory access, globle memory access 41 times per pixel is the performance bottleneck. I cannot use constant memory because my image is > 16K. Texture memory might help, but it is not going to be significant. I am using parallel_for_each such that each thread is processing one line. The result from processing the first pixel in a line is used to process the next pixel in the same line, my parallel_for_each has an extent<1>(512). 

    I need a way to fetch and store 512 (unsigned int) pixels per thread in a fast access memory (not the globle memory). I could not figure out anything other than tile_static memory. I guess I could group 8 lines to process into one tile. (tile static memory is also limited to 52k?).

    My suggestion to your AMP team is that it would be really useful to have something similar as tile_static memory for non-tiled parallel_for_each.

    My other question is the fatal compile error C1001 related to tile_static memory. It happened to me two days ago (I spent half a day struggling with it, see my earlier report). After I had made lots of changes around the tile_static memory code, the fatal compile error went away. I actually ran my code and it works fine. This morning, I had fatal compile error again in a different class using 1 thread tile workaround. It seems to me AMP compile is un-predictable with the current pre-release version. When will Microsoft release AMP 1.0? Before that, can we get patches to resolve issues like this? We are relying on AMP to release our software nexy year.

    cheers,

    Bingcai

    Thursday, November 1, 2012 8:21 PM
  • Hi Bingcai,

    Assuming that the computation for each pixel only relies on the previous pixel’s result, then you would be able to do it sequentially without brining whole line of pixels to tile_static. Fetch the pixel from the global memory once and store it in local variable. Refer to the local variable as many times as needed and keep the information about the result in second variable. Then fetch next pixel from the global memory and store it locally in the same variable as previous pixel, use the result that you saved to drive the result for this pixel, when you are done with this pixel store the result again. Repeat until done with entire line. Please let me know if I misinterpreted your scenario.

    If you are using release version of Visual Studio 2012, then you are using C++ AMP v1.0. If that is what you are using and you still see a bug, then we will investigate it and help you out either with workaround or bugfix.

    Thanks,


    Szymon

    Thursday, November 1, 2012 11:50 PM
  • Hi Simon,

    Thank you very much for your help. To compute result for each pixel, I need to use many pixels in the same line for each pixel in addtion to the result from previous pixel. That is why I need to use tile_static memory. I changed my code to have a tile size = 8. The VS version I am using is:

    Microsoft Visual Studio Professional 2012

    Version 11.0.50727.1 RTMREL

    I am copying the follwoing function for you hoping that you can reproduce the fatal compile error.

    I have commeted out the following lines that causes the fatal compile error C1001. They are related to tile_static memory. To reproduce the fatal compile error, un-comment the following lines.

    The fatal compile error is at the end. Thank you again for your help. My neck is on the line and we have to release our software next year using AMP :) If you need more info to reproduce the fatal compile error, please let me know and I can send your mode code. Yesterday, I sent a different set of code to pavan through regular email. That function is OK now. The code I send to pavan should still reproduce the fatal compile error.

    //       else if(D1 < TSO || D2 < TSO)
    //       {
    //        P1 = P1_MEAD_INTENSITY_DIFF_WEIGHT;
    //        P2 = P2_MEAD_INTENSITY_DIFF_WEIGHT;
    //       }
    //       else
    //       {
    //        P1 = P1_LARGE_INTENSITY_DIFF_WEIGHT;
    //        P2 = P2_LARGE_INTENSITY_DIFF_WEIGHT;
    //       }

    int
    AteNgateSgmAmpHorizontalMatch::applyHorizontalBackwardMatch(
     int num_lines,                 // number of lines in 2D array                     
     int num_samples,               // number of samples in 2d array
     int ll_line,                   // lower left line index
     int ll_sample,                 // lower left sample index
     array_view<const unsigned int, 2> &reliable_match_2dA,
     array_view<unsigned int, 2> &reliable_match_count_2dA,
     array_view<float, 3> &s_3dA,
     array_view<float, 3> &cost_3dA,   
     array_view<const unsigned int, 2> &img_l_2dA,          // 2D array of image (the left image)
     array_view<const unsigned int, 2> &img_r_2dA,          // 2D array of image (the right image)
     array_view<const int, 2> &x_para_i_2dA,      // 2D array of interger initial x parallax
     int num_buff_samples)
    {
     // the difference between forward match and backward match is the
     // left image vs right image.

     const int dim3 = 2 * m_search_dist + 1;
     const int search_dist = m_search_dist;
     const int SGM_MAX_RELIABLE_PIXELS = m_SGM_MAX_RELIABLE_PIXELS;
     
     const float SGM_P1 = m_SGM_P1;
     const float SGM_P2 = m_SGM_P2;
     const int TSO = (int)m_TSO;
     const float P1_MEAD_INTENSITY_DIFF_WEIGHT = m_P1_MEAD_INTENSITY_DIFF_WEIGHT;
     const float P2_MEAD_INTENSITY_DIFF_WEIGHT = m_P2_MEAD_INTENSITY_DIFF_WEIGHT;
     const float P1_LARGE_INTENSITY_DIFF_WEIGHT = m_P1_LARGE_INTENSITY_DIFF_WEIGHT;
     const float P2_LARGE_INTENSITY_DIFF_WEIGHT = m_P2_LARGE_INTENSITY_DIFF_WEIGHT;
     const float P1_UNRELIABLE_WEIGHT = m_P1_UNRELIABLE_WEIGHT;
     const float P2_UNRELIABLE_WEIGHT = m_P2_UNRELIABLE_WEIGHT;

     int num_lines_tile = num_lines / tile_size * tile_size;
     if(num_lines % tile_size != 0)
      num_lines_tile += tile_size;

     parallel_for_each(extent<1>(num_lines_tile).tile<tile_size>(),
       [=] (tiled_index<tile_size> tidx) restrict(amp)
     {
      int line_idx = tidx.global[0];
      int line_idx2 = line_idx + ll_line;
      int ii;

      tile_static unsigned int img[tile_size][max_img_sample_size];
      
      // only when the index is within num_lines, we fetch image pixels.
      if(tidx.global[0] < num_lines)
      {
       for(ii = 0; ii < num_buff_samples; ii++)
        img[tidx.local[0]][ii] = img_l_2dA(line_idx2, ii); 
      }

      tidx.barrier.wait();

      int sample_idx;
      int sample_idx2;
      
      float min_lr, min_lk, min_lk2;
      int is_pre_pixel_reliable;
      int reliable_index;
      int delta_i;
      int D1, D2;
      float P1, P2;
      int a_ind;
      int sample_ind_right;
      float temp_s;
      
      int shift_l = tidx.tile[0]*tile_size;

      if(tidx.global[0] < num_lines)
      {
       // forward or 0 degree case
       for(sample_idx = 0; sample_idx < num_samples; sample_idx++)
       {
        sample_idx2 = sample_idx + ll_sample;
        // for the first pixel, copy its value from m_cost_3d over.
        if(sample_idx == 0)
        {
         min_lk = MAX_COST_VAL;
         for(ii = 0; ii < dim3; ii++)
         {
          temp_s = s_3dA(line_idx, sample_idx, ii);
          temp_s += cost_3dA(line_idx, sample_idx, ii);
          s_3dA(line_idx, sample_idx, ii) = temp_s; 

          // the min_lk is going to be used in the following pixel. we compute the min value of
          // m_s_3d[line_ind][sample_ind]
          if(temp_s < min_lk)
          {
           temp_s;
          }
         }
         min_lk2 = min_lk;

         is_pre_pixel_reliable = reliable_match_2dA(line_idx, sample_idx);
         reliable_index = sample_idx;
         if(is_pre_pixel_reliable == 1)
          reliable_match_count_2dA(line_idx, sample_idx) += 1;
        }
        // if it is not the first pixel, we need to use the SGM logic to apply a smooth constraint to it
        else
        {
         min_lk = min_lk2;
         min_lk2 = MAX_COST_VAL;
               
         int temp_i = x_para_i_2dA(line_idx2, sample_idx2);
         delta_i = temp_i - x_para_i_2dA(line_idx2, sample_idx2 -1);
         sample_ind_right = temp_i + sample_idx2;

         D1 = (img_r_2dA(line_idx2, sample_ind_right) - img_r_2dA(line_idx2, sample_ind_right -1));
         D1 *= D1;
               
         if(reliable_match_2dA(line_idx, sample_idx-1) > 0)
         {
          is_pre_pixel_reliable = 1;
          reliable_index = sample_idx - 1;
         }
         else if(is_pre_pixel_reliable == 1 && sample_idx - reliable_index > SGM_MAX_RELIABLE_PIXELS)
         {
          is_pre_pixel_reliable = 0;
         }

         if(is_pre_pixel_reliable == 1)
          reliable_match_count_2dA(line_idx, sample_idx) += 1;

         for(ii = 0; ii < dim3; ii++)
         {
          if(is_pre_pixel_reliable == 1)
          {
           int sample_ind_left = sample_idx2 + ii - search_dist;
           
           D2 = img[line_idx2 - shift_l][sample_ind_left -1] - img[line_idx2 - shift_l][sample_ind_left];
           D2 *= D2;
                     
           if(D1 < TSO && D2 < TSO)
           {
            P1 = SGM_P1;
            P2 = SGM_P2;
           }
           else if(D1 < TSO || D2 < TSO)
           {
            P1 = P1_MEAD_INTENSITY_DIFF_WEIGHT;
            P2 = P2_MEAD_INTENSITY_DIFF_WEIGHT;
           }
           else
           {
            P1 = P1_LARGE_INTENSITY_DIFF_WEIGHT;
            P2 = P2_LARGE_INTENSITY_DIFF_WEIGHT;
           }
          }
          else
          {
           P1 = P1_UNRELIABLE_WEIGHT;
           P2 = P2_UNRELIABLE_WEIGHT;
          }

          // compute min_lr
          min_lr = MAX_COST_VAL;
          a_ind = ii + delta_i;

          if(a_ind >= 0 && a_ind < dim3)
          {
           min_lr = s_3dA(line_idx, sample_idx-1, a_ind);
          }
                  
          if(a_ind-1 >= 0 && a_ind-1 < dim3)
          {
           temp_s = s_3dA(line_idx, sample_idx-1, a_ind-1);
           if(min_lr > temp_s + P1)
           {
            min_lr = temp_s + P1;
           }
          }
                  
          if(a_ind+1 >= 0 && a_ind+1 < dim3) {
           temp_s = s_3dA(line_idx, sample_idx-1, a_ind+1);
           if(min_lr > temp_s + P1)
           {
            min_lr = temp_s + P1;
           }
          }
                  
          if(min_lr > min_lk + P2)
          {
           min_lr = min_lk + P2;
          }

          temp_s = s_3dA(line_idx, sample_idx, ii);
          temp_s += cost_3dA(line_idx, sample_idx, ii) + min_lr - min_lk;
          s_3dA(line_idx, sample_idx, ii) = temp_s; 

          if(temp_s < min_lk2)
          {
           min_lk2 = temp_s;
          }
         }
        }
       }
       
       // backward or 180 degree case
       for(sample_idx = num_samples - 1; sample_idx >= 0; sample_idx--)
       {
        sample_idx2 = sample_idx + ll_sample;
        // for the first pixel, copy its value from m_cc_3d over
        // since the cc as cost, we need to use 1.0 - m_cc_3d as the aggregated value.
        if(sample_idx == num_samples - 1)
        {
         min_lk = MAX_COST_VAL;
         for(ii = 0; ii < dim3; ii++)
         {
          temp_s = s_3dA(line_idx, sample_idx, ii);
          temp_s += cost_3dA(line_idx, sample_idx, ii);
          s_3dA(line_idx, sample_idx, ii) = temp_s; 
          if(temp_s < min_lk)
          {
           min_lk = temp_s;
          }
         }
         min_lk2 = min_lk;
         is_pre_pixel_reliable = reliable_match_2dA(line_idx, sample_idx);
         reliable_index = sample_idx;
         if(is_pre_pixel_reliable == 1)
          reliable_match_count_2dA(line_idx, sample_idx) += 1;
        }
        // if it is not the first pixel, we need to use the SGM logic to apply a smooth constraint to it
        else
        {
         min_lk = min_lk2;
         min_lk2 = MAX_COST_VAL;
               
         int temp_i = x_para_i_2dA(line_idx2, sample_idx2);
         delta_i = temp_i - x_para_i_2dA(line_idx2, sample_idx2 +1);  
               
         sample_ind_right = temp_i + sample_idx2;

         D1 = (img_r_2dA(line_idx2, sample_ind_right) - img_r_2dA(line_idx2, sample_ind_right +1));
         D1 *= D1;
               
         if(reliable_match_2dA(line_idx, sample_idx+1))
         {
          is_pre_pixel_reliable = 1;
          reliable_index = sample_idx + 1;
         }
         else if(is_pre_pixel_reliable == 1 && reliable_index - sample_idx > SGM_MAX_RELIABLE_PIXELS)
         {
          is_pre_pixel_reliable = 0;
         }

         if(is_pre_pixel_reliable == 1)
          reliable_match_count_2dA(line_idx, sample_idx) += 1;
      
         for(ii = 0; ii < dim3; ii++)
         { 
          if(is_pre_pixel_reliable != 1)
          {
           P1 = P1_UNRELIABLE_WEIGHT;
           P2 = P2_UNRELIABLE_WEIGHT;
          }
          else
          {
           a_ind = sample_idx2 + ii - search_dist;
           D2 = img[line_idx2 - shift_l][a_ind] - img[line_idx2 - shift_l][a_ind+1];
           D2 *= D2;

           if(D1 < TSO && D2 < TSO)
           {
            P1 = SGM_P1;
            P2 = SGM_P2;
           }
    //       else if(D1 < TSO || D2 < TSO)
    //       {
    //        P1 = P1_MEAD_INTENSITY_DIFF_WEIGHT;
    //        P2 = P2_MEAD_INTENSITY_DIFF_WEIGHT;
    //       }
    //       else
    //       {
    //        P1 = P1_LARGE_INTENSITY_DIFF_WEIGHT;
    //        P2 = P2_LARGE_INTENSITY_DIFF_WEIGHT;
    //       }
           
          }
          
          a_ind = ii + delta_i;
          min_lr = MAX_COST_VAL;


          if(a_ind >= 0 && a_ind < dim3)
           min_lr = s_3dA(line_idx, sample_idx+1, a_ind);
                  
          if(a_ind - 1 >= 0 && a_ind - 1 < dim3) {
           temp_s = s_3dA(line_idx, sample_idx+1, a_ind -1);
           if(min_lr > temp_s + P1) {
            min_lr = temp_s + P1;
           }
          }
                  
          if(a_ind + 1 < dim3 && a_ind + 1 >= 0) {
           temp_s = s_3dA(line_idx, sample_idx+1, a_ind+ 1);
           if(min_lr > temp_s + P1) {
            min_lr = temp_s + P1;
           }
          }
                  
          if(min_lr > min_lk + P2)
          {
           min_lr = min_lk + P2;
          }

          temp_s = s_3dA(line_idx, sample_idx, ii);
          temp_s += (cost_3dA(line_idx, sample_idx, ii)) + min_lr - min_lk;

          s_3dA(line_idx, sample_idx, ii) = temp_s;

          if(min_lk2 > temp_s)
          {
           min_lk2 = temp_s;
          }
         }

        }
       }   
      }
     });
     
       s_3dA.synchronize();

       return 0;
    }


    1>------ Build started: Project: AteAMP, Configuration: Debug Win32 ------
    1>  AteNgateSgmAmpHorizontalMatch.cpp
    1>c:\program files (x86)\microsoft visual studio 11.0\vc\include\amp.h(6480): fatal error C1001: An internal error has occurred in the compiler.
    1>  (compiler file 'f:\dd\vctools\compiler\utc\src\p2\main.c', line 211)
    1>   To work around this problem, try simplifying or changing the program near the locations listed above.
    1>  Please choose the Technical Support command on the Visual C++
    1>   Help menu, or open the Technical Support help file for more information
    ========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

    Friday, November 2, 2012 4:12 PM
  • Hi BingCai,

    Is this fatal error similar to the one you mentioned in the other post? Does the workaround Amit mentioned in that thread (increase lexical scopes via adding extra wrapping blocks) work for you? Let us know if you still encounter any issues with your programs.

    Thanks,
    Lingli

    • Marked as answer by Bingcai Zhang Thursday, November 8, 2012 4:45 PM
    Tuesday, November 6, 2012 8:51 PM
    Moderator
  • Hi Lingli,

    The work-around from Amit works ! Thanks a lot!

    Thursday, November 8, 2012 4:47 PM