none
C++ AMP: difference between release and debug of the AMP code RRS feed

  • Question

  • My test project works fine in either Debug or Release mode over Windows 7 64bits.While debugging, I have remarked that the AMP code is still running on the GPU (no warp or GPU debug supported on Win7).

    I always tested with the same image file as input data for my code, but now I wanted to incorporate AMP in the main project. Of course, it crashes as soon I uses another image as input!

    So I put the image creating the problem in the test project, and it works fine in Debug, but it crashes (restart of the video card driver) in Release.

    So here is the question: what are the difference (optimization) between Debug and Release for AMP code?

    By the way, is it possible to debug the AMP part in Release mode?

    Tuesday, March 20, 2012 5:12 PM

Answers

All replies

  • Hi PYB_42,

    The description of your failure (restart of the video card driver) sounds like a TDR. Read more about TDR here: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/06/handling-tdrs-in-c-amp.aspx.

    To isolate the cause of the failure, can you try running on direct3d_ref device to test your app? If this works fine; you might be hitting a hardware specific issue with the release configuration. AMP code will use the same optimization as the rest of your project (Project -> Properties -> Release Configuration -> C/C++ -> Optimization). This post has more details about using the direct3d_ref accelerator: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/11/direct3d-ref-accelerator-in-c-amp.aspx         

    GPU debugging is only enabled in debug configuration. You could also try to use the same optimization level in  the debug configuration to see if the crash reproduces. For more on GPU debugging, please read http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/17/start-gpu-debugging-in-visual-studio-11.aspx

    Another way to investigate the problem without the debugger is to catch the accelerator_view_removed exception and examine the reason for the TDR. This might give you an idea of what triggered the crash. To read more about this exception, please read http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/24/accelerator-view-removed-exception-of-c-amp.aspx

    Thanks,
    Pooja

    Wednesday, March 21, 2012 1:22 AM
  • Thanks Pooja.

    I catch now the exception, and I get "Failed to map staging buffer" when doing an array_view.synchronize().

    I don't have any problem with the direct3d_ref device, so the problem must be Nvidia specifics (Geforce GTX 550 TI).

    Maybe I have a similar problem as in thread http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/28c4933d-df19-4aa2-93b8-9f9cc4e85a7a

    The image which is working don't have the same size than the one producing the problem, and as I put in parallel each pixel column, the extent varies with the image size.

    For the working image, the size and region of interest are:

    height=1090
    width=2076
    right=1977
    left=96
    top=99
    bottom=991
    roiWidth=1881

    For the image causing the crash:

    height=1083
    width=2067
    right=1968
    left=99
    top=99
    bottom=984
    roiWidth=1869

    The extent is created with roiWidth value.

    After some testing with the working image, changing the roiWidth / extent value don't cause any problem. But if I set the bottom value to the one of the other image, it crashes also.

    In the parallel_for_each, I iterate with a for loop from top to bottom values over the pixel lines.

    parallel_for_each(e, [=] (index<1> idx)  restrict(amp)
    {
    index<1> x = idx + left;
    const int index = idx[0] * 3;
    avgBrightness(0+ index) = 0;
    avgBrightness(1+ index) = 0;
    avgBrightness(2+ index) = 0;
    for (unsigned int y = top; y <= bottom; y++)
    {
    	avgBrightness(0+ index) += (float)CompensationInternal::GetPixel(pixelsR,  x[0], y, width);
            avgBrightness(1+ index) += (float)CompensationInternal::GetPixel(pixelsG,  x[0], y, width);
    	avgBrightness(2+ index) += (float)CompensationInternal::GetPixel(pixelsB,  x[0], y, width);
    }
    });

    With any image, I have a TDR if bottom = 984, and it is working with bottom = 985!

    I really don't understand why one more iteration will remove the crash, especially as I write over the same memory for each iteration.

    Is the Release mode optimization changing the for loop so much to cause a crash when synchronizing the avgBrightness array_view?

    • Edited by PYB_42 Wednesday, March 21, 2012 9:32 AM
    Wednesday, March 21, 2012 9:17 AM
  • Hi PYB_42,

    Can you please post a more complete example so we can investigate further?

    Can you check the reason for TDR? You can examine this by calling get_view_removed_reason() on the exception. Does this reproduce in debug configuration with optimization set to /O2? If yes, the exception message may have more details there.

    Thanks,
    Pooja

    Wednesday, March 21, 2012 7:18 PM
  • Hi,

    Here is what I got in release mode. Unfortunately, I don't know where to search the meaning of the number given as reason.

    TDR exception received: Failed to map staging buffer.
    Reason: -2005270522
    Error code:887a0005

    In debug, with /O2 and without /RTC1, I also get the TDR but with more details (but it doesn't help me more):

    TDR exception received: Failed to map staging buffer.
    ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware).
    ID3D11DeviceContext::Map: Returning DXGI_ERROR_DEVICE_REMOVED, when a Resource was trying to be mapped with READ or READWRITE.

    Reason: -2005270522
    Error code:887a0005

    Here is the method causing the TDR, it computes the average pixel value over a specified ROI of an image (RGB image in 16 bits, therefore it is in fact ushort arrays). As addition to a global variable in a kernel isn't effective, I compute in each thread the average of a whole pixel column, and these averages are merged afterwards on the CPU.

    void CompensationInternal::CalcAverages(const unsigned int height, const unsigned int width, const unsigned int right, const unsigned int left, const unsigned int top, const unsigned int bottom,
    
    					array_view<const unsigned int> pixelsR,
    					array_view<const unsigned int> pixelsG, array_view<const unsigned int> pixelsB, float* outAvgBrightness)
    				{
    					const int size = height * width;
    					const int roiWidth = right - left;
    					const int roiHeight = bottom - top;		
    
    					const unsigned int roiSize = roiWidth * roiHeight;
    
    					float* avg = new float[roiWidth * 3];
    					array_view<float,1> avgBrightness( 3*roiWidth, avg);
    
    					avgBrightness.discard_data();
    
    					std::cout << "roiWidth=" << roiWidth << std::endl; 
    
    					
    					index<1> origin(0);
    					extent<1> e(roiWidth);
    					try
    					{
    						// Run code on the GPU
    						parallel_for_each(e, [=] (index<1> idx)  restrict(amp)
    						{
    							index<1> x = idx + left;
    							const int index = idx[0] * 3;
    							avgBrightness(0+ index) = 0;
    							avgBrightness(1+ index) = 0;
    							avgBrightness(2+ index) = 0;
    							for (unsigned int y = top; y <= bottom; y++)
    							{
    								avgBrightness(0+ index) += (float)CompensationInternal::GetPixel(pixelsR,  x[0], y, width);
    								avgBrightness(1+ index) += (float)CompensationInternal::GetPixel(pixelsG,  x[0], y, width);
    								avgBrightness(2+ index) += (float)CompensationInternal::GetPixel(pixelsB,  x[0], y, width);
    							}
    						});
    						// Copy data from GPU to CPU
    						avgBrightness.synchronize();
    					}
    					catch(accelerator_view_removed& ex)
    					{
    						std::cout<< "TDR exception received: " << ex.what() << std::endl;
    						std::cout<< "Reason: " << ex.get_view_removed_reason() << std::endl;
    						std::cout << "Error code:" << std::hex << ex.get_error_code() << std::endl;
    					}
    
    					for (int x = 0; x < roiWidth; x++)
    					{
    						const int index = x * 3;						
    
    						outAvgBrightness[0] +=avg[index + 0];
    						outAvgBrightness[1] +=avg[index + 1];
    						outAvgBrightness[2] +=avg[index + 2];
    					}				
    
    					outAvgBrightness[0] /= roiSize;
    					outAvgBrightness[1] /= roiSize;
    					outAvgBrightness[2] /= roiSize;						
    					delete(avg);
    				}	


    unsigned int  CompensationInternal::GetPixel(array_view<const unsigned int> img, const unsigned int x, const unsigned int  y, const unsigned int width) restrict(amp)
    				{
    					return read_ushort(img, y * width + x);
    				}
    
    	// Read ushort at index idx from array arr.
    template <typename T>
    unsigned int read_ushort(T& arr, int idx) restrict(cpu, amp)
    {
        return (arr[idx >> 1] & (0xFFFF << ((idx & 0x1) << 4))) >> ((idx & 0x1) << 4);
    }


    I have TDR depending of the value of bottom in

    for (unsigned int y = top; y <= bottom; y++)

    When having a TDR, just adding 1 to bottom will remove the TDR! The TDR don't occurs in the loop but on

    avgBrightness.synchronize();

    Thanks

    Thursday, March 22, 2012 7:54 AM
  • Hi PYB_42,

    Thanks very much for providing the complete example.

    We tried this on a variety of hardware and are able to reproduce the TDR on a card similar to yours. we have reported this to the hardware vendor.

    Thank you for reporting this.

    -Pooja


    Pooja Nagpal

    • Marked as answer by PYB_42 Monday, March 26, 2012 6:42 AM
    Saturday, March 24, 2012 1:01 AM
  • Hi,

    Thanks for testing the code. I will not use /O2 for the moment.

    Thanks

    Monday, March 26, 2012 6:42 AM