none
Pc freezes while running c++ amp code RRS feed

  • Question

  • When running c++ amp code that takes more than a few seconds my whole pc freezes, not even able to move the mouse cursor, other tasks are still ongoing, like i can still hear my music but the whole screen is blocked until the execution of the amp code finished at which point everything comes back to normal. The accelerator i'm running on is HD7770 with latest drivers, os is win 7 64bit and the RC ver of Visual Studio. 

    Is this normal behaviour, maybe caused by too high load on the gpu, or some other issue, anyone else encountered something like this or any suggestion on how to alleviate this.

    Thanks and regards

    Thursday, June 28, 2012 4:03 PM

Answers

  • Hi VladMi

    We haven’t seen a response from you in 10 days, so we will close this thread.

    Our best guess based on the info provided is that your kernel including the copies takes 7 seconds with the default queuing mode. That should have caused your driver to rest due to the TDR, but someone or something on your machine has turned off TDR so instead of observing that you are observing a screen that does not update for that duration.

    Please check your TDR settings and follow Pooja’s advice and come back to us if you have new info…

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    Friday, July 13, 2012 7:20 AM

All replies

  • Hi VladMi

    No, this is not normal behavior, and I have not experienced that.

    BTW, given that you can hear sound, it doesn’t appear to be a system freeze, more a display rendering glitch where you can’t see what is going on. I.e. if you move your mouse you cannot see it move, but when the screen rendering resumes you see that the mouse is indeed in a new location.

    When you execute a kernel on the GPU, e.g. via the parallel_for_each from C++ AMP, they should be short lived computations, e.g. milliseconds. In fact, Windows has a mechanism to protect you from inadvertently executing something for more than 2 seconds (the default, but configurable, value). This mechanism is called TDR, and you can read more about it from the links from here:
    http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/06/handling-tdrs-in-c-amp.aspx

    Having said that, you are not complaining about screen flicker, or the driver resetting, or a runtime_exception, right? So since you are not complaining about that, I conclude that you are not running into a TDR. From that it follows that your C++ AMP code is not taking more than 2 seconds. That is at odds with your opening statement that your C++ AMP code takes more than a few seconds. Hmmm.

    So, can you please check that TDR is enabled on your system, and can you share some code that shows how you measured that the parallel_for_each is taking a few seconds?

    Also how much data are you copying to the GPU? Again, a short but complete repro of your code will help us diagnose what is going wrong on your system.

    Finally, do you have another system with a DirectX 11 GPU to try this on so we can determine if this is a system-specific issue?

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    Friday, June 29, 2012 12:59 AM
  • Hello Daniel, thank you for your promt response.

    About the TDR, i'm not seeing any screen flicker and neither any info bubble about driver reset or something like that so i would think also that it does not get triggered.

    I have tried to time different parts of my amp code and have got some interesting result, the screen freeze does not seem to come form the parallel_for_each part as that one only takes max 10 ms, but from exiting the scope containing the amp code, so i'd think its something related to a destructor of an array_view perhaps, this take about 7 seconds

    Here is a bit of code, the project is a genetic algorithm which i tried to run on amp:

    	DWORD tStart = timeGetTime();
    	DWORD dt;
    
    	{
    		void *userData = p.rind[0][0].userData();
    		PData *pData = (PData*)userData;
    
    		array_view<float, 1> data_view(pData->dataCount, pData->data);
    
    		std::vector<float> geneArray;
    		GARealGenome* genome = (GARealGenome*)&p.individual(0);
    		int noGenes = genome->length();
    		int popSize = p.size();
    		int geneArrayLen = noGenes * popSize;
    		geneArray.reserve(geneArrayLen); geneArray.resize(geneArrayLen);
    
    		for( int i = 0; i < popSize; i++ )
    			for( int j = 0; j < noGenes; j++ )
    			{
    				genome = (GARealGenome*)&p.individual(i);
    				geneArray[j + noGenes * i] = genome->gene(j);
    			}
    
    		array_view<float, 1> genes_view( geneArrayLen, &geneArray[0] );
    
    		std::vector<float> scores;
    		scores.reserve(popSize); scores.resize(popSize);
    		array_view<float, 1> scores_view(popSize, &scores[0]);
    
    		dt = timeGetTime() - tStart;
    		printf("Dt1:%d \n", dt);
    
    		tStart = timeGetTime();
    
    		extent<1> e(popSize);
    		parallel_for_each( e, 
    				[=](index<1> idx) restrict(amp) {
    					scores_view[idx] = MaximizeObjective(genes_view, idx, data_view);	
    			}
    			);
    
    		dt = timeGetTime() - tStart;
    		printf("Dt2:%d \n", dt);
    					
    		tStart = timeGetTime();
    	}
    
    	dt = timeGetTime() - tStart;
    	printf("Dt3:%d \n", dt);

    The Dt3 shows about 7 secs each time, the data_view array is a bit large about 1 million floats but also tried shrinking it to 10k and the time still stayed at about 7 secs.

    I am currently out of ideas of what this could be, it is possible that i'm doing something really wrong, any advice would be appreciated.

    Thank you and kind regards.


    • Edited by VladMi Saturday, June 30, 2012 12:39 PM
    Saturday, June 30, 2012 7:29 AM
  • VladMi, the timing code above does not wait for computation and data transfer to host to complete. The default  queuing mode is set to  ‘automatic’ for the accelerator_view used in your kernel. This means the  computation may or may not have completed when Dt2 time is measured. This also means that synchronization of the data back to host may not have completed and implicitly occurred when the array_view is destructed. You can read more about queuing mode and array_view synchronization here:

        http://blogs.msdn.com/b/nativeconcurrency/archive/2011/11/23/understanding-accelerator-view-queuing-mode-in-c-amp.aspx

        http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/20/synchronizing-array-view-in-c-amp.aspx

    Here is one way to separate out the timing of computation and copying of the data:

    array_view<float, 1> genes_view( geneArrayLen, &geneArray[0] );     std::vector<float> scores;     scores.reserve(popSize); scores.resize(popSize);     array_view<float, 1> scores_view(popSize, &scores[0]);     // Avoid copying in data since it looks like initial data in scores_view is not used
    //
    inside the parallel_for_each     // Refer: blogs.msdn.com/b/nativeconcurrency/archive/2012/02/16/writeonly-becomes-discard-data-for-c-amp-array-view.aspx     scores_view.discard_data();     dt = timeGetTime() - tStart;     printf("Dt1:%d \n", dt);     tStart = timeGetTime();     extent<1> e(popSize);     parallel_for_each( e, [=](index<1> idx) restrict(amp) { scores_view[idx] = MaximizeObjective(genes_view, idx, data_view); } );     // Time for computation to complete. This includes copying in time for genes_view.     // You can further separate out the time to copy in data by using an concurrency::array.     av.wait();     dt = timeGetTime() - tStart;     printf("Dt2:%d \n", dt);     tStart = timeGetTime();     // Time to copy data back to host     av.synchronize();     dt = timeGetTime() - tStart;     printf("Dt2_synchronize:%d \n", dt);

    You can also use the concurrency visualizer to get an idea about where time is spent. (Refer http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/09/analyzing-c-amp-code-with-the-concurrency-visualizer.aspx.)

    You mentioned that changing the size of the data doesn’t affect the performance of your app. This probably means that the computation is expensive but it is probably less than 2 seconds since you don’t see a TDR. After the new measurements, where do you see the 7 seconds being spent? Could you also wrap your code in a try/catch to see if you get any exception?


    Pooja Nagpal



    Tuesday, July 3, 2012 2:32 AM
  • Hi VladMi

    We haven’t seen a response from you in 10 days, so we will close this thread.

    Our best guess based on the info provided is that your kernel including the copies takes 7 seconds with the default queuing mode. That should have caused your driver to rest due to the TDR, but someone or something on your machine has turned off TDR so instead of observing that you are observing a screen that does not update for that duration.

    Please check your TDR settings and follow Pooja’s advice and come back to us if you have new info…

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    Friday, July 13, 2012 7:20 AM