Compute Shader programs outperform AMP programs?
-
mercoledì 11 aprile 2012 13:50
Hi,
Recently, I hit some limitations when working with AMP (limitations of my knowledge of AMP, no doubt), so I looked at some Compute Shader code to see how it is done there. While digging deeper / further I found that (on my pc):
- The Compute Shader N-Body simulation outperforms the AMP N-Body simulation by about 30%.
- Basic vector addition is about equally fast in both technologies.
- The Compute Shader version of matrix multiplication from the “C++ AMP for the DirectCompute Programmer” guide outperforms the AMP analogue by about 300% (it is about three times as fast).
Details of my experiments can be found on my blog. Now, of course, I’m confused. Have I done something wrong? Why would you build an N-body simulation that doesn’t have at least equal performance as the technology it is supposed to replace (or overgrow, or interface to the larger community).
Do you have an example application with performance that compares favorably to a Compute Shader program?
AMP has been built upon Direct Compute (with regards to GPU computing, I suppose), does it have the possibility to surpass CS performance?
Regards,
Marc.
Tutte le risposte
-
mercoledì 11 aprile 2012 21:12Proprietario
Hi Marc
We are not seeing the results that you have claimed – we are seeing comparable performance between DirectCompute and C++ AMP. Let's explore this with you here in this forum, and then you can update your blog post later.
Let’s break this down by tackling one workload at a time. When we are done we can move to the next.
For the matrix multiply, for the 4608 size (for both approaches, obviously):
- What perf results are you getting with C++ AMP and what with HLSL?
- Can you share your source code including how you measure the performance? This is critical, and where most folks make mistakes.
- Also can you make sure you are building with optimizations (please share your command line)? Ideally you’d be able to share the Visual Studio project.
- Are you measuring this on a discrete card without a display to ensure there is no contention with the GPU for other activities for both cases?
- What card(s) are you measuring it on?
- Can you check that you have the latest driver for the card? (a key point that I always forget myself on new machines)
Cheers
Daniel
http://www.danielmoth.com/Blog/
- Modificato DanielMothMicrosoft Employee, Owner mercoledì 11 aprile 2012 22:15
-
giovedì 12 aprile 2012 09:00
Hi Daniel,
Thank you for your reply. Below you’ll find the answers first, then a few screen shots. The code is essentially a cut and paste from the “C++ AMP for the DirectCompute Programmer” guide. You can have, of course, the complete VS projects. Is there a way to transfer a zip (25Kb)?
Questions Matrix Multiplication for size 4608:
- Performance results for the AMP program: 11,750.6 ms, 8.3 gFLOPS, the size cannot be raised. Performance for the CS program: 3,031.62 ms, 32.3 gFLOPS, the size can be raised to 7616. In both cases the time result is an average over 10 calls of the “mm” function (the mm function is in the guide).
- Yes, I can share the code; except for some minor adaptations it’s all Microsoft code; Performance (time) is measured by timing the period required to execute the “mm” function. So, the task is equal for both programs since the AMP program is a rewrite of the Compute Shader program (says the guide). The timing code is from the Parallel Computing in C++ and Native Code blog.
- Both programs are built with /O2 optimization – the default for release builds. I also ran a test with /Ox, but got similar results. It’s no trouble to share the VS project, including code and all. How can the package be delivered?
- Measurements are for a discrete card. A display is attached, so there may be contention with the GPU. This then holds for both programs. The results do not show large variance, though. The screen shots below illustrate this.
- The card is a Club 3D Radeon HD 5750 Noiseless Edition.
- The latest compatible driver has been installed – not all driver updates are compatible. I will make an effort update the drivers.
Screen shot results AMP Program:
Compute Shader program:
On the other hand: the projects are standard, the code is finite. So, to reproduce the software:
In Windows 7 create two solutions (one for the AMP version, one for the CS version) for Win32 console applications in VS11 beta, with the DirectX SDK June 2010 installed. Code for the timer, in both solutions, can be found at the end.
AMP Code:
#include "stdafx.h" #include <amp.h> #include <iostream> #include "timer.h" using namespace concurrency; Timer tCompute; double cumElapsed; const int ARRAY_SIZE = 4608; const int NrIterations = 10; float A[ARRAY_SIZE * ARRAY_SIZE]; float B[ARRAY_SIZE * ARRAY_SIZE]; float C[ARRAY_SIZE * ARRAY_SIZE]; void mm(const float * A, const float * B, float * C, int size); int _tmain(int argc, _TCHAR* argv[]) { int k = 0; for (int i=0; i<ARRAY_SIZE * ARRAY_SIZE; ++i) { A[i] = (i%2==0) ? 1.0f : 0.0f; if (k==0 || i%k == 0) { B[i] = 1.0f; k += ARRAY_SIZE + 1; } } std::cout << "Data size (Kb): " << (3 * ARRAY_SIZE * ARRAY_SIZE * 4) / 1024 << "\n"; for (int i=0; i<NrIterations; ++i) { tCompute.Start(); mm(A, B, C, ARRAY_SIZE); tCompute.Stop(); cumElapsed += tCompute.Elapsed(); std::cout << "Iteration " << i << ": " << tCompute.Elapsed() << " ms\n"; } std::cout << "done\n"; double avElapsed = cumElapsed / ((float) NrIterations); std::cout << "Average time per iteration: " << avElapsed << "\n"; double gflops = ((ARRAY_SIZE / 1000.0) * (ARRAY_SIZE / 1000.0)) * (ARRAY_SIZE / avElapsed); std::cout << "Average gFLOPS: " << gflops << "\n"; char c; std::cin >> c; return 0; } void mm(const float * A, const float * B, float * C, int size) { array_view<const float, 2> d_A(size, size, A); array_view<const float, 2> d_B(size, size, B); array_view<float, 2> d_C(size, size, C); d_C.discard_data(); parallel_for_each(d_C.extent.tile<16, 16>(), [=] (tiled_index<16, 16> t_idx) restrict(amp) { int row = t_idx.local[0]; int col = t_idx.local[1]; tile_static float local_a[16][16]; tile_static float local_b[16][16]; float sum = 0.0f; for (int i = 0; i < size; i += 16) { local_a[row][col] = d_A(t_idx.global[0], i + col); local_b[row][col] = d_B(i + row, t_idx.global[1]); t_idx.barrier.wait(); for (int k = 0; k < 16; k++) { sum += local_a[row][k] * local_b[k][col]; } t_idx.barrier.wait(); } d_C[t_idx.global] = sum; }); }ComputShader code
C++:
#include "stdafx.h" #include <d3d11.h> #include <d3dcompiler.h> #include <d3dx11.h> #include <iostream> #include "timer.h" Timer tCompute; double cumElapsed; const int ARRAY_SIZE = 4608; // up to 7616 const int NrIterations = 10; float A[ARRAY_SIZE * ARRAY_SIZE]; float B[ARRAY_SIZE * ARRAY_SIZE]; float C[ARRAY_SIZE * ARRAY_SIZE]; #ifndef SAFE_RELEASE #define SAFE_RELEASE(p) { if (p) { (p)->Release(); (p)=NULL; } } #endif HRESULT CreateComputeShader( LPCWSTR pSrcFile, LPCSTR pFunctionName, ID3D11Device* pDevice, ID3D11ComputeShader** ppShaderOut ); void mm(const float * A, const float * B, float * C, int size); int _tmain(int argc, _TCHAR* argv[]) { int k = 0; for (int i=0; i<ARRAY_SIZE * ARRAY_SIZE; ++i) { A[i] = (i%2==0) ? 1.0f : 0.0f; if (k==0 || i%k == 0) { B[i] = 1.0f; k += ARRAY_SIZE + 1; } } int dataSize = (3 * ARRAY_SIZE * ARRAY_SIZE * 4) / 1024; std::cout << "Data size (Kb): " << dataSize << "\n"; for (int i=0; i<NrIterations; ++i) { tCompute.Start(); mm(A, B, C, ARRAY_SIZE); tCompute.Stop(); cumElapsed += tCompute.Elapsed(); std::cout << "Iteration " << i << ": " << tCompute.Elapsed() << " ms\n"; } std::cout << "done\n"; double avElapsed = cumElapsed / ((float) NrIterations); std::cout << "Average time per iteration: " << avElapsed << "\n"; double gflops = ARRAY_SIZE * ((ARRAY_SIZE * ARRAY_SIZE) / (1000000.0 * avElapsed)); std::cout << "Average gFLOPS: " << gflops << "\n"; char c; std::cin >> c; return 0; } void mm(const float * A, const float * B, float * C, int size) { HRESULT hr; ID3D11Device *device; const D3D_FEATURE_LEVEL featureLevels[] = { D3D_FEATURE_LEVEL_11_0 }; ID3D11DeviceContext *deviceContext; D3D_FEATURE_LEVEL featureLevel; hr = D3D11CreateDevice(NULL, D3D_DRIVER_TYPE_HARDWARE, NULL, D3D11_CREATE_DEVICE_SINGLETHREADED, featureLevels, 1, D3D11_SDK_VERSION, &device, &featureLevel, &deviceContext ); ID3D11ComputeShader *shader; CreateComputeShader( L"mm.hlsl", "mm", device, &shader ); //hr = device->CreateComputeShader(binary, binarySize, NULL, &shader); D3D11_BUFFER_DESC desc; ZeroMemory( &desc, sizeof(desc) ); desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE; desc.ByteWidth = sizeof(float) * size * size; desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED; desc.StructureByteStride = sizeof(float); D3D11_SHADER_RESOURCE_VIEW_DESC srvDesc; ZeroMemory(&srvDesc, sizeof(srvDesc)); srvDesc.ViewDimension = D3D11_SRV_DIMENSION_BUFFEREX; srvDesc.Format = DXGI_FORMAT_UNKNOWN; srvDesc.BufferEx.NumElements = size * size; D3D11_UNORDERED_ACCESS_VIEW_DESC uavDesc; ZeroMemory(&uavDesc, sizeof(uavDesc)); uavDesc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER; uavDesc.Format = DXGI_FORMAT_UNKNOWN; uavDesc.Buffer.NumElements = size * size; D3D11_SUBRESOURCE_DATA InitData; ID3D11Buffer *d_A; InitData.pSysMem = A; hr = device->CreateBuffer(&desc, &InitData, &d_A); ID3D11ShaderResourceView *d_A_SRV; hr = device->CreateShaderResourceView(d_A, &srvDesc, &d_A_SRV); ID3D11Buffer *d_B; InitData.pSysMem = B; hr = device->CreateBuffer(&desc, &InitData, &d_B); ID3D11ShaderResourceView *d_B_SRV; hr = device->CreateShaderResourceView(d_B, &srvDesc, &d_B_SRV); ID3D11Buffer *d_C; hr = device->CreateBuffer(&desc, NULL, &d_C); ID3D11UnorderedAccessView *d_C_UAV; hr = device->CreateUnorderedAccessView(d_C, &uavDesc, &d_C_UAV); struct ConstantBufferStruct { int size, padding[3]; }; ZeroMemory(&desc, sizeof(desc)); desc.ByteWidth = sizeof(ConstantBufferStruct); desc.Usage = D3D11_USAGE_DEFAULT; desc.BindFlags = D3D11_BIND_CONSTANT_BUFFER; ID3D11Buffer *constantBuffer; hr = device->CreateBuffer(&desc, NULL, &constantBuffer); ConstantBufferStruct constantValues = { size }; deviceContext->UpdateSubresource(constantBuffer, 0, NULL, &constantValues, 0, 0); deviceContext->CSSetConstantBuffers(0, 1, &constantBuffer); ID3D11UnorderedAccessView* rw_views[1] = { d_C_UAV }; deviceContext->CSSetUnorderedAccessViews(0, 1, rw_views, NULL); ID3D11ShaderResourceView* ro_views[2] = { d_A_SRV, d_B_SRV }; deviceContext->CSSetShaderResources(0, 2, ro_views); deviceContext->CSSetShader(shader, NULL, 0); d_C->GetDesc( &desc ); desc.CPUAccessFlags = D3D11_CPU_ACCESS_READ; desc.Usage = D3D11_USAGE_STAGING; desc.BindFlags = 0; desc.MiscFlags = 0; ID3D11Buffer *readBackBuffer; hr = device->CreateBuffer(&desc, NULL, &readBackBuffer); D3D11_MAPPED_SUBRESOURCE MappedResource = {0}; deviceContext->CopyResource(readBackBuffer, d_C ); deviceContext->Dispatch(size/16, size/16, 1); deviceContext->Map(readBackBuffer, 0, D3D11_MAP_READ, 0, &MappedResource); memcpy(C, MappedResource.pData, size * size * sizeof(float)); deviceContext->Unmap(readBackBuffer, 0); readBackBuffer->Release(); constantBuffer->Release(); d_A->Release(); d_A_SRV->Release(); d_B->Release(); d_B_SRV->Release(); d_C->Release(); d_C_UAV->Release(); shader->Release(); deviceContext->Release(); device->Release(); } //-------------------------------------------------------------------------------------- // Tries to find the location of the shader file // This is a trimmed down version of DXUTFindDXSDKMediaFileCch. It only addresses the // following issue to allow the sample correctly run from within Sample Browser directly // // When running the sample from the Sample Browser directly, the executables are located // in $(DXSDK_DIR)\Samples\C++\Direct3D11\Bin\x86 or x64\, however the shader file is // in the sample's own dir //-------------------------------------------------------------------------------------- HRESULT FindDXSDKShaderFileCch( __in_ecount(cchDest) WCHAR* strDestPath, int cchDest, __in LPCWSTR strFilename ) { if( NULL == strFilename || strFilename[0] == 0 || NULL == strDestPath || cchDest < 10 ) return E_INVALIDARG; // Get the exe name, and exe path WCHAR strExePath[MAX_PATH] = { 0 }; WCHAR strExeName[MAX_PATH] = { 0 }; WCHAR* strLastSlash = NULL; GetModuleFileName( NULL, strExePath, MAX_PATH ); strExePath[MAX_PATH - 1] = 0; strLastSlash = wcsrchr( strExePath, TEXT( '\\' ) ); if( strLastSlash ) { wcscpy_s( strExeName, MAX_PATH, &strLastSlash[1] ); // Chop the exe name from the exe path *strLastSlash = 0; // Chop the .exe from the exe name strLastSlash = wcsrchr( strExeName, TEXT( '.' ) ); if( strLastSlash ) *strLastSlash = 0; } // Search in directories: // .\ // %EXE_DIR%\..\..\%EXE_NAME% wcscpy_s( strDestPath, cchDest, strFilename ); if( GetFileAttributes( strDestPath ) != 0xFFFFFFFF ) return true; swprintf_s( strDestPath, cchDest, L"%s\\..\\..\\%s\\%s", strExePath, strExeName, strFilename ); if( GetFileAttributes( strDestPath ) != 0xFFFFFFFF ) return true; // On failure, return the file as the path but also return an error code wcscpy_s( strDestPath, cchDest, strFilename ); return E_FAIL; } HRESULT CreateComputeShader( LPCWSTR pSrcFile, LPCSTR pFunctionName, ID3D11Device* pDevice, ID3D11ComputeShader** ppShaderOut ) { HRESULT hr; // Finds the correct path for the shader file. // This is only required for this sample to be run correctly from within the Sample Browser, // in your own projects, these lines could be removed safely WCHAR str[MAX_PATH]; hr = FindDXSDKShaderFileCch( str, MAX_PATH, pSrcFile ); if ( FAILED(hr) ) return hr; DWORD dwShaderFlags = D3DCOMPILE_ENABLE_STRICTNESS; #if defined( DEBUG ) || defined( _DEBUG ) // Set the D3DCOMPILE_DEBUG flag to embed debug information in the shaders. // Setting this flag improves the shader debugging experience, but still allows // the shaders to be optimized and to run exactly the way they will run in // the release configuration of this program. dwShaderFlags |= D3DCOMPILE_DEBUG; #endif const D3D_SHADER_MACRO defines[] = { #ifdef USE_STRUCTURED_BUFFERS "USE_STRUCTURED_BUFFERS", "1", #endif #ifdef TEST_DOUBLE "TEST_DOUBLE", "1", #endif NULL, NULL }; // We generally prefer to use the higher CS shader profile when possible as CS 5.0 is better performance on 11-class hardware LPCSTR pProfile = ( pDevice->GetFeatureLevel() >= D3D_FEATURE_LEVEL_11_0 ) ? "cs_5_0" : "cs_4_0"; ID3DBlob* pErrorBlob = NULL; ID3DBlob* pBlob = NULL; hr = D3DX11CompileFromFile( str, defines, NULL, pFunctionName, pProfile, dwShaderFlags, NULL, NULL, &pBlob, &pErrorBlob, NULL ); if ( FAILED(hr) ) { if ( pErrorBlob ) OutputDebugStringA( (char*)pErrorBlob->GetBufferPointer() ); SAFE_RELEASE( pErrorBlob ); SAFE_RELEASE( pBlob ); return hr; } hr = pDevice->CreateComputeShader( pBlob->GetBufferPointer(), pBlob->GetBufferSize(), NULL, ppShaderOut ); #if defined(DEBUG) || defined(PROFILE) if ( *ppShaderOut ) (*ppShaderOut)->SetPrivateData( WKPDID_D3DDebugObjectName, lstrlenA(pFunctionName), pFunctionName ); #endif SAFE_RELEASE( pErrorBlob ); SAFE_RELEASE( pBlob ); return hr; }HLSL code:
cbuffer CB : register(b0) { int size; }; StructuredBuffer<float> d_A : register(t0); StructuredBuffer<float> d_B : register(t1); RWStructuredBuffer<float> d_C : register(u0); groupshared float local_a[16][16]; groupshared float local_b[16][16]; [numthreads(16, 16, 1)] void mm(uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID) { int row = GTid.y; int col = GTid.x; float sum = 0.0f; for (int i = 0; i < size; i += 16) { local_a[row][col] = d_A[DTid.y * size + i + col]; local_b[row][col] = d_B[(i + row) * size + DTid.x]; AllMemoryBarrierWithGroupSync(); for (int k = 0; k < 16; k++) { sum += local_a[row][k] * local_b[k][col]; } AllMemoryBarrierWithGroupSync(); } d_C[DTid.y * size + DTid.x] = sum; }Timer.cpp (timer.cpp and timer.h should be in both solutions)
#include "stdafx.h" #include "timer.h" // Initialize the resolution of the timer LARGE_INTEGER Timer::m_freq = \ (QueryPerformanceFrequency(&Timer::m_freq), Timer::m_freq); // Calculate the overhead of the timer LONGLONG Timer::m_overhead = Timer::GetOverhead();Timer.h
#pragma once #include <windows.h> struct Timer { void Start() { QueryPerformanceCounter(&m_start); } void Stop() { QueryPerformanceCounter(&m_stop); } // Returns elapsed time in milliseconds (ms) double Elapsed() { return (m_stop.QuadPart - m_start.QuadPart - m_overhead) \ * 1000.0 / m_freq.QuadPart; } private: // Returns the overhead of the timer in ticks static LONGLONG GetOverhead() { Timer t; t.Start(); t.Stop(); return t.m_stop.QuadPart - t.m_start.QuadPart; } LARGE_INTEGER m_start; LARGE_INTEGER m_stop; static LARGE_INTEGER m_freq; static LONGLONG m_overhead; };Then compile the solution (release configuration) and run without debugging.
Compiler commandline:
/analyze- /O2 /Gd /nologo /MD /Gm- /Yu"stdafx.h" /GL /Fa"Release\" /Oi /Oy- /Zc:forScope /Fo"Release\" /Gy /Fp"Release\Guide_CS_MM.pch" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /WX- /errorReport:queue /GS /Fd"Release\vc110.pdb" /fp:precise /W3 /Zi /Zc:wchar_t /EHsc
Linker commandline:
/NXCOMPAT /MANIFEST /OPT:REF /SUBSYSTEM:CONSOLE /TLBID:1 /NOLOGO /INCREMENTAL:NO /ManifestFile:"Release\Guide_CS_MM.exe.intermediate.manifest" /DYNAMICBASE /MACHINE:X86 /PDB:"C:\visual studio 11\Projects\Guide_CS_MM\Release\Guide_CS_MM.pdb" /DEBUG /ERRORREPORT:QUEUE /MANIFESTUAC:"level='asInvoker' uiAccess='false'" "d3d11.lib" "d3dcompiler.lib" "d3dx11.lib" "d3dx9.lib" "dxerr.lib" "dxguid.lib" "winmm.lib" "comctl32.lib" /OPT:ICF /SAFESEH /PGD:"C:\visual studio 11\Projects\Guide_CS_MM\Release\Guide_CS_MM.pgd" /LTCG /OUT:"C:\visual studio 11\Projects\Guide_CS_MM\Release\Guide_CS_MM.exe"
Regards,
Marc.
- Modificato OlManMarc giovedì 12 aprile 2012 11:53 Source code and commandlines added
-
giovedì 12 aprile 2012 23:57Proprietario
Hi Marc
Thank you for sharing your code.
We tried your code on an NVIDIA GTX580, an ATI 5870, and an ATI 5770. There is no performance difference between the HLSL and the C++ AMP code.
One thing that *is* different is that you are compiling with /Zi. Can you please remove that from the command line (or project properties) and try to measure again?
The other thing that *may be* different is the driver. You did not share your driver version, but the latest that we are using from ATI is 8.951.0.0. Can you please ensure you are using that latest driver and if not try to install it and try again?
I am looking forward to seeing your results after the above adjustments, thanks for trying.
Cheers
Daniel
http://www.danielmoth.com/Blog/
- Modificato DanielMothMicrosoft Employee, Owner venerdì 13 aprile 2012 16:18
-
domenica 15 aprile 2012 15:38
Hi Daniel,
Thank you very much for your information. The /Zi flag realizes a small improvement for the AMP version, but the updated driver (8.951.0.0) really changed everything. In this message the consequences for all three programs are reviewed.
Program
AMP
CS
Guide
Average time (ms, 10 it.)
2,650
2,995
gFLOPS
36.9
32.7
Max. Data Load (Kb)
714,432
691,200
Vector Addition
Average time (ms, 10 it.)
6,017
8,155
gFLOPS
0.03
0.02
Max. Data Load (Kb)
1,781,248
2,039,056
N-Body Simulation
Number of Particles
16,128
16,128
Frame rate
44.4
63.4
gFLOPS
229
329
Notes:
1. Programs from the Guide.
As you can see in the table (timing for ARRAY_SIZE = 4608), the AMP version is now much faster; it is even a bit faster than the CS version. Moreover, the limit on the maximum data load has greatly improved; the AMP program can handle an ARRAY_SIZE of 7808, the CS program an ARRAY_SIZE of 7680 elements.
2. Vector Addition
Timing has been done for an array size of 76,000,000 elements. Timing runs from creating the views to and including copying back the results to the CPU memory. Timing is averaged over 10 iterations. The AMP program has become faster, and the Compute Shader program is about the same (I adapted the CS program to also copy back the results to CPU memory – it didn’t do that). There is also still the difference in maximum data load (AMP: 76,000,000 vs. CS: 87,000,000).
At this point I propose to drop this pair of programs from the discussion. They do both represent basic vector addition, but the implementations are unrelated.
3. N-Body Simulation
The optimum is still reached by both programs at 16,128 particles. The AMP version shows a small improvement, but the Compute Shader version shows significant improvement. The Compute Shader version shows about 30% better performance than the AMP version.
In this case the implementations of the programs are related: the AMP version (tiling, 1GPU) implements the Gravity shader in C++ code. You stated that these programs perform comparable on your PCs, can you suggest any action so I may reproduce that performance on my PC?
- Modificato OlManMarc domenica 15 aprile 2012 15:39
-
domenica 15 aprile 2012 19:43Proprietario
Hi Marc
Thanks for confirming. So, two out of your three experiments support our generic claim: C++ AMP has comparable performance to DirectCompute.
I hope you can update your blog post accordingly.
FYI, the /Zi difference will not be there in our post-Beta release: RC. In fact, RC has performance improvements across the board. Also FYI, you will hear us talk more about performance at RTM, when the v1 product is done, rather than prematurely at this stage. C++ AMP has comparable performance with all other GPU programming models, according to our measurements.
Now, as for NBODY, I’ll get back to you when the engineer who is assigned to it is back in the office. We have a few variants of nbody floating around, and we are not sure which one *really* corresponds to the directcompute code since we have been evolving both the host and the compute code from the first time it was ported last year (by an engineer who is no longer on our team no less). So please be patient on the nbody front for a complete response.
Cheers
Danielhttp://www.danielmoth.com/Blog/
-
lunedì 16 aprile 2012 09:35
Hi Daniel,
I've updated the blog post, and it now clearly states that performance of AMP has comparable performance to Compute Shader programs.
I will, nonetheless, remain very interested in the N-Body simulation case, since the Compute Shader version is clearly strongly optimized. Do you agree to leaving this thread open until more information about the N-Body simulation is available?
Will you publish documentation that shows how C++ AMP code relates to optimized compute shaders?
Regards,
Marc.
-
lunedì 16 aprile 2012 12:13Proprietario
Hi Marc
Thanks for updating your post. Latest driver and omitting /Zi (for Beta) are the answer to the perf discrepancy you observed. So that answers part of your questions.
You are absolutely right, we do have the nbody perf discrepancy still open, thank you again for reporting it. I want to find out what is going on there as much as you do, if not more – stay tuned on this thread.
Cheers
Danielhttp://www.danielmoth.com/Blog/
-
martedì 17 aprile 2012 02:16Proprietario
Hi Marc
I have good news on the NBODY front too.
- The code for C++ AMP and for DirectCompute are NOT the same in the NBODY sample that you have. Apologies we claimed that they were, but they are not (if the original engineer was around, I would query them about it ;-). For example, in your downloaded code, compare the tile size for one with the other; compare the usage of float_3 instead of float_4; compare the usage of float_4 instead of some other custom structure; and so on.
- Good news. Internally, we have made the code resemble the other much closer, and now we are seeing comparable performance here (when measuring same number of bodies in both, with same resolution, foreground app, etc).
- We will update the public sample on the blog in the next few days, and then you’ll be able to observe it for yourself (as well as diff the changes we made).
Feel free to resolve this thread and update your blog post when you get your hands on our latest (step 3 above).
Thank you very much for reporting this, so we can make our C++ AMP nbody sample even faster by more accurately porting the code. Really appreciate it.
Cheers
Danielhttp://www.danielmoth.com/Blog/
- Proposto come risposta Zhu, Weirong sabato 21 aprile 2012 22:43
- Contrassegnato come risposta OlManMarc lunedì 23 aprile 2012 06:09
-
lunedì 23 aprile 2012 06:09
Hi Daniel,
The updated AMP NBody sample indeed has a performance that is comparable with the compute shader version. A quick comparison of the code (old - new NBody and the original) shows (to me) a clean port that draws a clear line between AMP code and DirectX code.
Really convincing. I'm impressed how such subtle changes can make such a big difference.
Regards,
Marc.

