locked
How can I improve performance of Direct3D when I'm writing to a single vertex buffer thousands of times per frame?

    Question

  • I am trying to write an OpenGL wrapper that will allow me to use all of my existing graphics code (written for OpenGL) and will route the OpenGL calls to Direct3D equivalents. This has worked surprisingly well so far, except performance is turning out to be quite a problem.

    Now, I admit I am most likely using D3D in a way it was never designed. I am updating a single vertex buffer thousands of times per render loop. Every time I draw a "sprite" I send 4 vertices to the GPU with texture coordinates, etc and when the number of "sprites" on the screen at one time gets to around 1k to 1.5k, then the FPS of my app drops to below 10fps.

    Using the VS2012 Performance Analysis (which is awesome, btw), I can see that the ID3D11DeviceContext->Draw method is taking up the bulk of the time: Screenshot Here

    Is there some setting I'm not using correctly while setting up my vertex buffer, or during the draw method? Is it really, really bad to be using the same vertex buffer for all of my sprites? If so, what other options do I have that wouldn't drastically alter the architecture of my existing graphics code base (which are built around the OpenGL paradigm...send EVERYTHING to the GPU every frame!)

    The biggest FPS killer in my game is when I'm displaying a lot of text on the screen. Each character is a textured quad, and each one requires a separate update to the vertex buffer and a separate call to Draw. If D3D or hardware doesn't like many calls to Draw, then how else can you draw a lot of text to the screen at one time?

    Let me know if there is any more code you'd like to see to help me diagnose this problem.

    Thanks!

    Here's the hardware I'm running on:

    • Core i7 @ 3.5GHz
    • 16 gigs of RAM
    • GeForce GTX 560 Ti

    And here's the software I'm running:

    • Windows 8 Release Preview
    • VS 2012
    • DirectX 11

    Here is the draw method:

    void OpenGL::Draw(const std::vector<OpenGLVertex>& vertices) {
       auto matrix = *_matrices.top();
       _constantBufferData.view = DirectX::XMMatrixTranspose(matrix);

    _context->UpdateSubresource(_constantBuffer, 0, NULL, &_constantBufferData, 0, 0); _context->IASetInputLayout(_inputLayout); _context->VSSetShader(_vertexShader, nullptr, 0); _context->VSSetConstantBuffers(0, 1, &_constantBuffer); D3D11_PRIMITIVE_TOPOLOGY topology = D3D11_PRIMITIVE_TOPOLOGY_TRIANGLESTRIP; ID3D11ShaderResourceView* texture = _textures[_currentTextureId]; // Set shader texture resource in the pixel shader. _context->PSSetShader(_pixelShaderTexture, nullptr, 0); _context->PSSetShaderResources(0, 1, &texture); D3D11_MAPPED_SUBRESOURCE mappedResource; D3D11_MAP mapType = D3D11_MAP::D3D11_MAP_WRITE_DISCARD; auto hr = _context->Map(_vertexBuffer, 0, mapType, 0, &mappedResource); if (SUCCEEDED(hr)) { OpenGLVertex *pData = reinterpret_cast<OpenGLVertex *>(mappedResource.pData); memcpy(&(pData[_currentVertex]), &vertices[0], sizeof(OpenGLVertex) * vertices.size()); _context->Unmap(_vertexBuffer, 0); } UINT stride = sizeof(OpenGLVertex); UINT offset = 0; _context->IASetVertexBuffers(0, 1, &_vertexBuffer, &stride, &offset); _context->IASetPrimitiveTopology(topology); _context->Draw(vertices.size(), _currentVertex); _currentVertex += (int)vertices.size(); }


    And here is the method that creates the vertex buffer:

    void OpenGL::CreateVertexBuffer()
    {
       D3D11_BUFFER_DESC bd;
       ZeroMemory(&bd, sizeof(bd));
       bd.Usage = D3D11_USAGE_DYNAMIC;
       bd.ByteWidth = _maxVertices * sizeof(OpenGLVertex);
       bd.BindFlags = D3D11_BIND_VERTEX_BUFFER;
       bd.CPUAccessFlags = D3D11_CPU_ACCESS_FLAG::D3D11_CPU_ACCESS_WRITE;
       bd.MiscFlags = 0;
       bd.StructureByteStride = 0;
       D3D11_SUBRESOURCE_DATA initData;
       ZeroMemory(&initData, sizeof(initData));
       _device->CreateBuffer(&bd, NULL, &_vertexBuffer);
    }

    Here is my vertex shader code:

    cbuffer ModelViewProjectionConstantBuffer : register(b0)
    {
        matrix model;
        matrix view;
        matrix projection;
    };
    
    struct VertexShaderInput
    {
        float3 pos : POSITION;
        float4 color : COLOR0;
        float2 tex : TEXCOORD0;
    };
    
    struct VertexShaderOutput
    {
        float4 pos : SV_POSITION;
        float4 color : COLOR0;
        float2 tex : TEXCOORD0;
    };
    
    VertexShaderOutput main(VertexShaderInput input)
    {
        VertexShaderOutput output;
        float4 pos = float4(input.pos, 1.0f);
    
        // Transform the vertex position into projected space.
        pos = mul(pos, model);
        pos = mul(pos, view);
        pos = mul(pos, projection);
        output.pos = pos;
    
        // Pass through the color without modification.
        output.color = input.color;
        output.tex = input.tex;
    
        return output;
    }



    Monday, August 27, 2012 2:52 PM

Answers

  • Don't draw each quad as a strip. Accumulate them in system memory as indexed triangles then dump the whole index and vertex buffer at one time to the GPU. Flush the system memory buffer when the texture changes or you're done drawing the quads. Even if you handled a string at a time like this it would probably be sufficient because it would cut the draw calls by an order of magnitude. OpenGL is probably doing something similar under the covers.
    Monday, August 27, 2012 3:44 PM

All replies

  • Don't draw each quad as a strip. Accumulate them in system memory as indexed triangles then dump the whole index and vertex buffer at one time to the GPU. Flush the system memory buffer when the texture changes or you're done drawing the quads. Even if you handled a string at a time like this it would probably be sufficient because it would cut the draw calls by an order of magnitude. OpenGL is probably doing something similar under the covers.
    Monday, August 27, 2012 3:44 PM
  • You were absolutely correct. I accumulated triangle lists in system memory until I detected that the texture was being changed, at which point I would draw all stored triangles at the same time. This gave a huge performance boost. Thanks!
    Monday, August 27, 2012 7:59 PM