locked
C++ AMP bitonic sort performance RRS feed

  • General discussion

  • So almost exactly the same time as I finished coding up my own bitonic sort algorithm with C++ AMP, this blog post was put up with sample code. I ran it against mine (using the Concurrency Visualizer SDK - it's great!) and was surprised that mine was faster, despite a complete lack of tiling and a somewhat hacky approach (at least it feels that way to me). Mine also supports reverse sorting, non-power-of-two length arrays and non-primitive datatypes, since those were features that I needed.

    I've put my code up here, since it's too long to include directly. For the 512*512 elements in the other sample, mine took about 18ms (including checking the result) and the other took about 24 (including an accl.wait() but not copying the data back to the CPU). Both used deferred accelerators and measured the second of two runs. I tried special casing the last four passes of each stage (j = {16, 8, 4, 2}) and using a tiled grid but runtime got worse. I'd be happy to hear any suggestions for improving the performance, though I don't really have the need to squeeze any more power out of it right now (most of my actual datasets are small enough that copying to the CPU and using std::sort is as quick or quicker).

    (Just sharing this for other people's interest. It's completely undocumented, but feel free to use and abuse however you see fit.)


    Tuesday, November 15, 2011 6:12 AM

All replies

  • Hi

    Awesome to see your Bitonic Sort implementation - if we knew you were doing one, we wouldn't have done ours :-)

    Joking aside, you are right, there is no reason for a performance difference between the two.

    FWIW, we measured in one of our hardware environments your code and ours, and we found them comparable (our example being marginally faster).

    1. Can you please confirm that you are measuring performance on a *release* configuration please? As an aside, our debug configuration is truly worthless for performance measurements, since we do so much more in that configuration to make our Visual Studio debugger shine through.

    2. Assuming you are measuring in release configuration, can you share how you are measuring the perfomance, i.e. can you share some perf measuring code, or else how exactly with the tools you are measuring the perf difference please? It is way too easy to be measuring the wrong thing, and I am confident you are not, but it doesn't hurt to be sure.

    3. Finally, can you tell us what exact card you are measuring this on? We run all our measurements on AMD and nvidia cards, but it is good to try your code in our perf lab on the exact card, so this would help.

    Looking forward to the additional info from you so we can understand where the discrepancy is...thanks again.

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/
    Wednesday, November 16, 2011 6:52 AM
  • Equally, if I'd know you were doing one, I wouldn't have bothered either. I've uploaded my testing project here, which is meant to be run under the Concurrency Visualizer. For convenience, I refer to mine as "parallel_sort" and yours as "bitonic_sort_amp". Just for fun, I threw std_sort in there as well. On my hardware, std_sort comes out at 20ms, versus 16ms for parallel_sort and 21ms for bitonic_sort_amp -- is this considered comparable? They're very similar timings, though they all seem to be repeatable.

    I had thought there was a big reason for a performance difference -- specifically, parallel_sort only uses random global memory accesses, and "should" have been much slower...

    For your numbered points:

    1. Absolutely running under release mode. Interestingly, if you run the two algorithms under debug mode (with vcamp.lib instead of vcampd), parallel_sort runs an order of magnitude slower, while bitonic_sort_amp is far less affected. According to CV, most of the extra time is spent in nvwgf2um.dll; both CPU and GPU utilisation are at 100% (one logical CPU core out of eight, one GPU out of two).

    2. I'm using CV spans from where the arrays (not vectors) are ready to go until after the accelerator_view::wait() call. Copying data back to the CPU and validating it isn't included in the measurement, though it's still in the source. I'm using floats in the code linked above, but get identical results for ints. Each algorithm is run 5 times; the first run is always 3-4 times slower, but the other four runs are typically within 1ms of each other.

    3. I'm running on a GeForce GT 540M -- the mobile version -- I assume there's only one set of specs for this card, but in brief it's 2GB dedicated + 2GB shared, 96 CUDA cores and 672MHz graphics, 1344MHz processor and 900MHz memory clocks. It's also paired with the Intel integrated graphics, but I believe that only comes into play as a pass-through when rendering to the screen (whole laptop is a Dell XPS 15).

    I suspect the vastly different results under debug will be the best hint here, though I don't know exactly what it means. I tried running the two sorts on different amounts of data, but wasn't confident that I could change it without breaking some of the assumptions throughout bitonic_sort_amp. There's probably also a difference in the number of kernels being dispatched, though I haven't figured out how many (or even which one has more).

    Wednesday, November 16, 2011 9:13 AM
  • Hi All,

    Interesting conversation going on here....  Actually, there should be a performance difference between the two implementations.  Zooba's implementation does not use GPU shared memory since it does not have any references to tile_static.  BharathM's implementation uses shared memory, i.e., lines 37 and 68 of BitonicSort.cpp, in order to minimize global memory accesses. Also, BharathM's implementation adds matrix transposition in order to do sorts that are greater than 512 elements.

    Since I've been studying bitonic sorting implementations on GPU's for almost 2 months now (see blog), I decided to look at this.  I created a single program with both implementations and ran it on my AMD APU A8-3850.  In order to compare apples with apples and oranges with oranges so to speak, I made sure to include the time for copying of data from the CPU to GPU and back.  BharathM's implementation contains a constructor for array<>, which copies data from the CPU to GPU.  At the end of BharathM's implementation, there is a call to concurrency::copy(), which copies data from the GPU to the CPU.  I tried to make sure the copy steps are the same, but I could have made a mistake.

    In addition, I added a version I wrote that uses just registers instead of GPU shared memory.  This implementation is based on Peters paper [Peters, H., O. Schulz-Hildebrandt, et al. (2011). "Fast in-place, comparison-based sorting with CUDA: a study with bitonic sort." Concurrency and Computation: Practice and Experience 23(7): 681-693.], but uses a "normalized" bitonic sorting network.  This version uses "degree 5" sorting networks per thread, but 4 is better.  The paper is pretty good.

    Throwing out the first run of each, and compiling it in "release mode", the runtime for Zooba seems slower than that for BharathM--at least on my machine with the 512*512 problem size.  The register implementation works faster than both the other implementations.

    Zooba: 0.060 s; BharathM: 0.042 s; Domino: 0.040 s.

    Changing the problem size does seem to create some interesting results.  With a 512*512*64 sized array, BharathM's implementation is actually slower than either.

    Zooba: 1.96 s; BharathM: 3.46 s; Domino: 1.44 s.

    The complete project is on my web site.

    Ken

    Wednesday, November 16, 2011 8:24 PM
  • Hi Ken

    I downloaded and tried your code, and it's certainly faster (dramatically so if you use a deferred queuing mode), though the verification failed on the second and third runs of degree both times (indices 512, 4, 512 and 2).

    Immediate -- Zooba: 33ms; BharathM: 24ms; Domino: 16ms

    Deferred -- Zooba: 18ms; BharathM: 23ms; Domino: 3ms

    With the larger data set it's almost unbelievably fast, though again suffers from the occasional verification failure (index 1):

    Deferred *64 -- Zooba: 1.57s; BharathM: 2.24s; Domino: 0.08s

     

    Interestingly, I just noticed that with my setup, the default accelerator under CV is my GT540M, but running directly (with or without the VS debugger attached) the default is D3D ref. I suspect this is due to the dual-card setup ("Optimus") and I don't seem to have a workaround except to use CV. A known dev preview issue, perhaps? If I run dxdiag it defaults to using the integrated graphics and reports DDI version 10.1; when I force the GT540M (through NVIDIA's context menu) it still reports the integrated graphics by name and drivers, but the memory as ~4GB and DDI version as 11, which matches the GT540M. Happy to run any diagnostic tools you like to provide more info on my configuration, it looks like someone in the mix is lying about things.

    Thursday, November 17, 2011 4:33 AM
  • Interesting...I noticed that the code works fine on my two different machines as is, even in an infinite loop running for 10 or so minutes.  But, I noticed that I don't use any accelerator(_view) in the code, except for the constructor of the array.  So, I decided to add an accelerator_view to all of the code (array<> constructor and parallel_for_each).  I now get a similar, spurious behavior after a random number of iterations in an infinite for-loop.  I'm wondering if there is a problem in resource allocation in the program or lower down in the drivers or runtime. Thanks for the info.  I will investigate.
    Thursday, November 17, 2011 12:27 PM
  • Ops...my problem. C++ AMP is fine :). I didn't set up the input and output arrays correctly so it got random data.  Also, I didn't specify the accelerator on the parallel_for_each.  Amazing the thing "worked" before.  The code seems to now work (here).  --Ken
    Thursday, November 17, 2011 2:48 PM
  • Hi All,

    Unfortunately i couldnt get a GT540M laptop today to test and understand perf issue. While i work internally to find out more on this issue, i did improve some performance on my end by some work around.

    Here is another matrix transpose implementation which improved performance on my AMD APU (based on HD 6550D):

    template <typename _type>
    void transpose_kernel(array<_type, 1>& data_in, array<_type, 1>& data_out, unsigned width, unsigned height, tiled_index<TRANSPOSE_TILE_SIZE, TRANSPOSE_TILE_SIZE> tidx) restrict (amp)
    {
        tile_static _type transpose_shared_data[TRANSPOSE_TILE_SIZE * TRANSPOSE_TILE_SIZE];

        transpose_shared_data[tidx.local[0]*tidx.get_tile_extent()[1] + tidx.local[1]] = data_in[tidx.global[0] * width + tidx.global[1]];
        tidx.barrier.wait();

        unsigned x = tidx.global[0] - tidx.local[0] + tidx.local[1];
        unsigned y = tidx.global[1] - tidx.local[1] + tidx.local[0];

        data_out[y * height + x] = transpose_shared_data[tidx.local[1] * TRANSPOSE_TILE_SIZE + tidx.local[0]];
    }


    And another minor change to avoid a warning in bitonic_sort_kernel:
    Replace

     "_type result = ((sh_data[tidx.local[0] & ~j] <= sh_data[tidx.local[0] | j]) == (bool)(ulevelmask & global_idx))? sh_data[tidx.local[0] ^ j] : sh_data[tidx.local[0]];"

    in "bitonic_sort_kernel" with

     "
      bool mask = (ulevelmask & global_idx) != 0 ? true : false;
     bool sh_data_b = (sh_data[local_idx & ~j] <= sh_data[local_idx | j]) ? true : false;
            _type result = (sh_data_b == mask) ? sh_data[local_idx ^ j] : sh_data[local_idx];
     "

    Hope this improves performance on NVidia. Will let you if there are any new issues we discover with my implementation.


    Since this is ported from another sample, it is inheriting the same rigidness in terms of its input size and GPU thread/tile configuration.

    Friday, November 18, 2011 4:11 AM
  • Interestingly, I just noticed that with my setup, the default accelerator under CV is my GT540M, but running directly (with or without the VS debugger attached) the default is D3D ref. I suspect this is due to the dual-card setup ("Optimus") and I don't seem to have a workaround except to use CV. A known dev preview issue, perhaps? If I run dxdiag it defaults to using the integrated graphics and reports DDI version 10.1; when I force the GT540M (through NVIDIA's context menu) it still reports the integrated graphics by name and drivers, but the memory as ~4GB and DDI version as 11, which matches the GT540M. Happy to run any diagnostic tools you like to provide more info on my configuration, it looks like someone in the mix is lying about things.

    Hi

    Bharath is looking at the bitonic sort kernel itself as he thought the performance was comparable between yours and his (although I believe we have a clear winner with Ken's example - our samples are meant to show the API and we do not claim that they are super tuned).

    Regarding your comment on the accelerators, here are some comments:

    1. There are better ways than dxdiag to see the DX11 capabilities of your hardware, please see this post: http://www.danielmoth.com/Blog/What-DX-Level-Does-My-Graphics-Card-Support-Does-It-Go-To-11.aspx
    2. When you run under the VS GPU debugger (currently only a Win8 feature), the default accelerator becomes Ref. This is super slow and only useful for correctness debugging.
    3. When not under the debugger, the runtime picks an accelerator for you, typically what it thinks is best.
    4. In your optimus system, it sounds like something goes wrong and the runtime thinks that Ref is better than the hardware. We'll try to repro this and investigate in house, thanks.
    5. On an optimus system I believe you can right click on an executable and choose which GPU you want it to use, so that could be a workaround for you. Another is to use nvidia's control panel and set a profile for your binary to always use that card.
    6. Regardless of all of the above, you can specify the accelerator_view to use (either in array ctor or in the p_f_e) and then you don't have to rely on the default accelerator (I haven't tried this on an optimus system).

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/


    Saturday, November 19, 2011 1:42 AM
  • There's definitely some messing around going on with this system, I assume it's due to Optimus.

    The only way to instantiate an accelerator without crashing seems to be to use the default device. Doing this and running the executable with the GT540M works and uses the hardware, BUT the accelerator device_path shows the path to the Intel device (other properties are correct). Running with the integrated card (the default) uses Ref and the device path is "ref".  Enumerating all accelerators only shows the GT540M if the process has been started with it.

    Explicitly providing the device path with either the path to the GT540M or the integrated card causes a runtime_error, regardless of which card is set. 

    I'm running on Windows 7, just FYI. I haven't been really happy with Optimus, mostly because it doesn't seem keen enough to use the proper card (say, in scenarios like this) without specifying individual overrides, though it does keep battery usage down. How does CUDA handle this situation?

    Sunday, November 20, 2011 7:12 AM
  • Hi,

    On my laptop(W7 64 bit) I have both Nvidia 3000M and Intel integrated graphics (and of cause Optimus). In CUDA when I enumerate the GPU devices I see just the NVidia card so without doing anything, CUDA runs on the 3000M device.

    G.

    Sunday, November 20, 2011 6:03 PM
  • Hi

    Yes we have an open thread with NVIDIA inquiring about this, and are writing a blog post around Optimus and what needs to be done for detecting your EXE as requiring the high end graphics card. Stay tuned on our blog when we have more details to share...

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/
    Monday, November 21, 2011 10:29 PM
  • Looks like part of the problem was badly installed drivers and a string-escaping failure on my part.

    I did as clean an install as I could of the latest drivers (leaving out the 3D stuff that I can't use anyway) and reset all the settings. I also set it to use the GT540M by default, which now works fine when run from VS or directly.

    Remembering to escape the backslashes in the device path fixes the runtime errors, but the device path for running C++ AMP on the GT540M is actually the path to the integrated card. I've seen a number of issues reported on the NVIDIA forums and gaming forums about the drivers incorrectly reporting which device is in use, so I suspect this is theirs to fix.

    Sunday, November 27, 2011 1:36 AM
  • Cool you are writing sorts. It's the best learning experience for GPU. But unless you have a specific need for it, I wouldn't put more effort into writing a bitonic sort. It has very poor work efficiency. The advantage of bitonic sort is that it doesn't require random access into the arrays, which is great if all you have are non-indexable registers. However with shared memory on GPU you should definitely use a radix sort. For very small arrays that are stored in register (like you may have in a pixel shader) bitonic is probably better, but once you have multiple blocks working on the same array the work efficiency of bitonic sort makes achieving good throughput impossible.

    Since you like sorts read through my GPU radix sort tutorial:

    http://www.moderngpu.com/sort/mgpusort.html

    On a GTX 570 it sorts a 40,000,000 array of 32bit ints in 30ms for a sustained throughput of 1.31bn keys/s. 

    The library is CUDA although I did a DirectCompute version before this. My implementation uses the byte permute Fermi instruction quite extensively. As this isn't exposed in AMP you'll have to use more standard scan operations. 

    There's a general description of radix sort and GPU tradeoffs here:

    http://www.moderngpu.com/sort/algo.html

    The remaining pages in that section dissect the actual implementation. A line-by-line translation would work for the most part, except for the byte permute hacks.

    There are some likely optimizations that can be made by leveraging texture cache to do block transpose instead of using shared memory, but afaik AMP doesn't support texture cache at all (however DirectCompute did, and so does OpenCL 1.2).

    I also have some of my radix select (k'th smallest element search) implementation available and documented. It's the symmetric case to sort - instead of starting from the least significant radix digit, you start from the most significant one. Radix techniques are very useful on GPU because they are inherently data parallel.

    sean

     

    Monday, November 28, 2011 6:56 PM