A replacement for the Quicksort function in the C++ library

    General discussion

  • Hi every one,

    I'd like to introduce and share a new Triple State Quicksort algorithm which was the result of my research in sorting algorithms during the last few years. The new algorithm reduces the number of swaps to about two thirds (2/3) of classical Quicksort. A multitude of other improvements are implemented. Test results against the std::sort() function shows an average of 43% improvement in speed throughout various input array types. It does this by trading space for performance at the price of n/2 temporary extra spaces. The extra space is allocated automatically and efficiently in a way that reduces memory fragmentation and optimizes performance.

    Triple State Algorithm
    The classical way of doing Quicksort is as follows:

    - Choose one element p. Called pivot. Try to make it close to the median.
    - Divide the array into two parts. A lower (left) part that is all less than p. And a higher (right) part that is all greater than p.
    - Recursively sort the left and right parts using the same method above.
    - Stop recursion when a part reaches a size that can be trivially sorted.

     The difference between the various implementations is in how they choose the pivot p, and where equal elements to the pivot are placed. There are several schemes as follows:

    [ <=p | ? | >=p ]
    [ <p | >=p | ? ]
    [ <=p | =p | ? | >p ]
    [ =p | <p | ? | >p ]  Then swap = part to middle at the end
    [ =p | <p | ? | >p | =p ]  Then swap = parts to middle at the end

    Where the goal (or the ideal goal) of the above schemes (at the end of a recursive stage) is to reach the following:

    [ <p | =p | >p ]

    The above would allow exclusion of the =p part from further recursive calls thus reducing the number of comparisons. However, there is a difficulty in reaching the above scheme with minimal swaps. All previous implementation of Quicksort could not immediately put =p elements in the middle using minimal swaps, first because p might not be in the perfect middle (i.e. median), second because we don’t know how many elements are in the =p part until we finish the current recursive stage.

    The new Triple State method first enters a monitoring state 1 while comparing and swapping. Elements equal to p are immediately copied to the middle if they are not already there, following this scheme:

    [ <p | ? | =p | ? | >p ]

    Then when either the left (<p) part or the right (>p) part meet the middle (=p) part, the algorithm will jump to one of two specialized states. One state handles the case for a relatively small =p part. And the other state handles the case for a relatively large =p part. This method adapts to the nature of the input array better than the ordinary classical Quicksort.

    Further reducing number of swaps
    A typical quicksort loop scans from left, then scans from right. Then swaps. As follows:

    while (l<=r)
       while (ar[l]<p)
       while (ar[r]>p)
       if (l<r)
       {  Swap(ar[l],ar[r]);
          l++; r--;
       else if (l==r)
       {	l++; r--; break;

    The Swap macro above does three copy operations:

    Temp=ar[l]; ar[l]=ar[r]; ar[r]=temp;

    There exists another method that will almost eliminate the need for that third temporary variable copy operation. By copying only the first ar[r] that is less than or equal to p, to the temp variable, we create an empty space in the array. Then we proceed scanning from left to find the first ar[l] that is greater than or equal to p. Then copy ar[r]=ar[l]. Now the empty space is at ar[l]. We scan from right again then copy ar[l]=ar[r] and continue as such. As long as the temp variable hasn’t been copied back to the array, the empty space will remain there juggling left and right. The following code snippet explains.

    // Pre-scan from the right
    while (ar[r]>p)
    temp = ar[r];
    // Main loop
    while (l<r)
       while (l<r && ar[l]<p)
       if (l<r) ar[r--] = ar[l];
       while (l<r && ar[r]>p)
       if (l<r) ar[l++] = ar[r];
    // After loop finishes, copy temp to left side
    ar[r] = temp; l++;
    if (temp==p) r--;

    (For simplicity, the code above does not handle equal values efficiently. Refer to the complete code for the elaborate version).

    This method is not new, a similar method has been used before (read:

    However it has a negative side effect on some common cases like nearly sorted or nearly reversed arrays causing undesirable shifting that renders it less efficient in those cases. However, when used with the Triple State algorithm combined with further common cases handling, it eventually proves more efficient than the classical swapping approach.

    Run time tests
    Here are some test results, done on an i5 2.9Ghz with 6Gb of RAM. Sorting a random array of integers. Each test is repeated 5000 times. Times shown in milliseconds.

    size		std::sort()		Triple State QuickSort
    5000		2039			1609
    6000		2412			1900
    7000		2733			2220
    8000		2993			2484
    9000		3361			2778
    10000		3591			3093

    It gets even faster when used with other types of input or when the size of each element is large. The following test is done for random large arrays of up to 1000000 elements where each element size is 56 bytes. Test is repeated 25 times.

    size		std::sort()	Triple State QuickSort
    100000		1607		424
    200000		3165		845
    300000		4534		1287
    400000		6461		1700
    500000		7668		2123
    600000		9794		2548
    700000		10745		3001
    800000		12343		3425
    900000		13790		3865
    1000000		15663		4348

    Further extensive tests has been done following Jon Bentley’s framework of tests for the following input array types:

    sawtooth: ar[i] = i % arange
    random: ar[i] = GenRand() % arange + 1
    stagger: ar[i] = (i* arange + i) % n
    plateau: ar[i] = min(i, arange)
    shuffle: ar[i] = rand()%arange? (j+=2): (k+=2)

    I also add the following two input types, just to add a little torture:

    Hill: ar[i] = min(i<(size>>1)? i:size-i,arange);
    Organ Pipes: (see full code for details)

    Where each case above is sorted then reordered in 6 deferent ways then sorted again after each reorder as follows:

    Sorted, reversed, front half reversed, back half reversed, dithered, fort.

    Note: GenRand() above is a certified random number generator based on Park-Miller method. This is to avoid any non-uniform behavior in C++ rand().

    The complete test results can be found here:

    Theoretical Analysis
    A Classical Quicksort algorithm performs less than 2n*ln(n) comparisons on the average (check JACEK CICHON’s paper) and less than 0.333n*ln(n) swaps on the average (check Wild and Nebel’s paper). Triple state will perform about the same number of comparisons but with less swaps of about 0.222n*ln(n) in theory. In practice however, Triple State Quicksort will perform even less comparisons in large arrays because of a new 5 stage pivot selection algorithm that is used. Here is the detailed theoretical analysis:

    Using SSE2 instruction set
    SSE2 uses the 128bit sized XMM registers that can do memory copy operations in parallel since there are 8 registers of them. SSE2 is primarily used in speeding up copying large memory blocks in real-time graphics demanding applications.
    In order to use SSE2, copied memory blocks have to be 16byte aligned. Triple State Quicksort will automatically detect if element size and the array starting address are 16byte aligned and if so, will switch to using SSE2 instructions for extra speedup. This decision is made only once when the function is called so it has minor overhead.

    Few other notes
    - The standard C++ sorting function in almost all platforms religiously takes a “call back pointer” to a comparison function that the user/programmer provides. This is obviously for flexibility and to allow closed sourced libraries. Triple State defaults to using a call back function. However, call back functions have bad overhead when called millions of times. Using inline/operator or macro based comparisons will greatly improve performance. An improvement of about 30% to 40% can be expected. Thus, I seriously advise against using a call back function when ever possible. You can disable the call back function in my code by #undefining CALL_BACK precompiler directive.

    - Like most other efficient implementations, Triple State switches to insertion sort for tiny arrays, whenever the size of a sub-part of the array is less than TINY_THRESH directive. This threshold is empirically chosen. I set it to 15. Increasing this threshold will improve the speed when sorting nearly sorted and reversed arrays, or arrays that are concatenations of both cases (which are common). But will slow down sorting random or other types of arrays. To remedy this, I provide a dual threshold method that can be enabled by #defining DUAL_THRESH directive. Once enabled, another threshold TINY_THRESH2 will be used which should be set lower than TINY_THRESH. I set it to 9. The algorithm is able to “guess” if the array or sub part of the array is already sorted or reversed, and if so will use TINY_THRESH as it’s threshold, otherwise it will use the smaller threshold TINY_THRESH2. Notice that the “guessing” here is NOT fool proof, it can miss. So set both thresholds wisely.

    - You can #define the RANDOM_SAMPLES precompiler directive to add randomness to the pivoting system to lower the chances of the worst case happening at a minor performance hit.

    -When element size is very large (320 bytes or more). The function/algorithm uses a new “late swapping” method. This will auto create an internal array of pointers, sort the pointers array, then swap the original array elements to sorted order using minimal swaps for a maximum of n/2 swaps. You can change the 320 bytes threshold with the LATE_SWAP_THRESH directive.

    - The function provided here is optimized to the bone for performance. It is one monolithic piece of complex code that is ugly, and almost unreadable. Sorry about that, but inorder to achieve improved speed, I had to ignore common and good coding standards a little. I don’t advise anyone to code like this, and I my self don’t. This is really a special case for sorting only. So please don’t trip if you see weird code, most of it have a good reason.

    Finally, I would like to present the new function to Microsoft and the community for further investigation and possibly, inclusion in VC++ or any C++ library as a replacement for the sorting function.

    You can find the complete VC++ project/code along with a minimal test program here:

    Important: To fairly compare two sorting functions, both should either use or NOT use a call back function. If one uses and another doesn’t, then you will get unfair results, the one that doesn’t use a call back function will most likely win no matter how bad it is!!

    Ammar Muqaddas

    • Edited by S0lo Friday, March 06, 2015 4:49 PM
    Thursday, March 05, 2015 12:06 PM

All replies

  • I just did I few changes to the test program that I forgot to do for a better random generation.

    Thursday, March 05, 2015 3:22 PM
  • interesting.  I've done my own qsorts before, usually on pointer arrays.  '1 of 5' optimization probably does more to the algorithm than anything, and having 'fewer swaps', if you ONLY use pointers rather than swapping entire structures or BLOB data, may be negated by the extra code.  You should profile it on an array of pointers to see if this is the case, with and without the '1 of 5' optimization.

    If you REALLY want speed, try my multi-threaded qsort algorithm.  sample code here: Sort Demo source and info

    it should build with MFC using an earlier version of devstudio, but not sure about the latest.  you might have to tweek it.  it also builds on non-windows systems using 'wxWidgets' which is one reason I like doing MFC development, since porting MFC to wxWidgets is a relatively straightforward process once you understand the fundamental changes.  In any case, I thought it was worth a mention.  multi-core with multi-thread qsort would be about 2.5 times faster with 4 cores, on average.  that's an even MORE significant speedup.

    (updated) I added an MFC version download to the web page, for VS 2010
    Friday, March 06, 2015 7:24 PM
  • Thanks for your interest.

    Excuse my ignorance as I'm not sure what you meant by "1 of 5" optimization. Did you mean median of 5 ?

    Regarding swapping pointers, yes it is common sense and rather common among programmers to swap pointers instead of swapping large data types, at the small price of indirect access to the actual data through the pointers.

    However, there is a rather unobvious and quite terrible side effect of using this trick. After the pointer array is sorted, sequential (sorted) access to the actual data throughout the remaining of the program will suffer heavily because of cache misses. Memory is being accessed randomly because the pointers still point to the unsorted data causing many many cache misses, which will render the program itself slow, although the sort was fast!!.

    Multi-threaded qsort is a good idea in principle and easy to implement obviously because qsort itself is recursive. The thing is Multi-threaded qsort is actually just stealing CPU time from other cores that might be busy running other apps, this might slow down other apps, which might not be ideal for servers. The thing researchers usually try to do is to do the improvement in the algorithm it self.

    I Will try to look at your sorting code, lets see if I can compile it.

    • Edited by S0lo Friday, March 06, 2015 9:30 PM
    Friday, March 06, 2015 8:11 PM
  • Just want to say I did a few fixes to the code. The major one being a compile error that occurs when both USE_SortN and USE_UNALIGNED_SSE2 are defined.

    @big bad bombastic bob

    I wasn't able to compile your code, mainly because of different VS versions, tried to rebuild from scratch but got a linker error. I'm completely MFC illiterate. The code looks interesting though. The app looks very cool.

    Friday, March 13, 2015 10:42 AM
  • I've added an "Array Based" version of the algorithm. Same link above.

    This version is simpler and easier to understand but does NOT include further C/C++ related optimization for better element swapping. These optimizations are included in the other pointer based (Pointer to Character "PC") version. Therefore, this array version will usually run slower. It is provided for simplicity and easiness, for programmers who want to understand the algorithm or port it to other languages.

    Algorithmically, it is exactly the same as the pointer based version, doing the same number of comparisons and swaps/copies.

    Since it's array based, this version depends on the compiler to do the swapping/copying. This means that element size MUST be known at compile time.

    I also did fix a compile error that occurs when RANDOM_SAMPLES is defined.

    Hope this helps someone.
    • Edited by S0lo Thursday, April 30, 2015 5:46 PM
    Thursday, April 30, 2015 12:14 PM
  • I just finished a research paper describing the algorithm in full length. You can find it at Cornell Library:

    • Edited by S0lo Tuesday, May 05, 2015 8:52 AM
    Tuesday, May 05, 2015 8:51 AM
  • I've fixed a bug in the pointer based version of the function, that happens in Late Swapping when the element size is very large.

    Whoever is using the function, please download the latest version. Same link above.

    Tuesday, September 15, 2015 10:25 AM
  • @S0lo - I've downloaded your code and compiled it in VS2k15. I turned off the callback comparison function and removed the extra parameter to std::sort() to ensure just internal comparisons were being made. std::sort() is consistently faster than 3-state quicksort using your test app. Also keep in mind that std::sort also uses std::move() internally now too. Maybe that helps it.

    In addition, I copied in my personal optimized version of quicksort which uses sentinals, internal stack (instead of recursion), median of 3, cutoff to my personal optimized version of insertion sort, and the 3 way partitioning scheme described by sedgewick (for equal elements). This version blows the doors off of std::sort() and literally completes in half the time your 3 state quicksort implementation does.

    Just for the fun of it, I translated the jdk 8 version of dual-pivot quicksort from java to c++ and added it to your program. It performs similarly to my optimized version of quicksort, slower for this specific dataset, but blows away std::sort() and yours. I didn't bother optimizing the dual pivot version (just a straight translation), but there is room for it to be improved.

    My optimized version and my translation of dual pivot are both about 150 lines of code each. What in the world is your version doing? literally thousands of lines of spaghetti code. For what it's worth, I was able to improve the speed of yours by not executing insertion sort over and over like your implementation does, just 1 pass at the end across the whole dataset instead.... no need for all of the specialized sorting networks for tiny partitions.

    Ok. So I am very interested in sorting algorithms and your claim caught my attention. I've read through your paper and very briefly through your code, but I am doubting the code you have posted will produce the results that you claim... unless you have very poor implementations of the other algorithms. Can you share the source for the program used to build your "complete results?" I'd like to see my version of quicksort given a run for its money, but so far your algorithm doesn't measure up to STL, much less an optimal quicksort.

    These are the results on my machine with the two additional versions of quicksort that I added for reference: (also swapped "clock" for "queryperformancecounter" to get better timing accuracy)

    Sorting 10000 random elements, 4000 times.....
    std::sort() run time: 2962.09 ms
    Triple State Quicksort run time: 3141.77 ms
    herb Quicksort run time: 1568.32 ms
    Dual Pivot Quicksort run time: 1893.14 ms

    btw - I have written a multithreaded version of my optimal quicksort that would likely bring the time down to around 800-1000ms. Theoretically, multithreading yours wouldn't be difficult, but given the code, your algorithm would be a nightmare.

    Monday, November 07, 2016 7:10 AM
  • @test123we, Thanks for your reply.

    I just tried the same, turning off CALL_BACK and removing the compare function from std:sort. My triple state still performed faster.

    Sorting 10000 random elements, 4000 times.....
    std::sort() run time: 6657 ms
    Triple State Quicksort run time: 5175 ms

    Which brings me to the question. What are doing?!!. First make sure of the following

    1. Download the PC version (not array version). File named: 

    2. To disable call back. You have to #undef CALL_BACK. Don't just comment the line #define CALL_BACK. 

    3. You probably know that when you remove the compare function from std:sort() and CALL_BACK. It will use the > < operators. This will work well for integers, BUT if your using larger elements (like the elem I have), it has to end up calling a costume operator, which may end up being inlined in one sorting function but not the other. Which will affect the fairness of the test. OR, if there is no costume operator, the compiler may try to cast the value to an integer then compare using int > < operators, while the other function may be using a - (minus) operator then comparing to zero. Fairness problem again.

    All my tests on the paper are based on USING A COMMON (OR SIMILAR) CALLBACK FUNCTION. This is essential for a controlled fair test. If you don't do that, you have to make sure each comparison takes nearly the same amount of CPU time in both functions that you are testing.

    Also, practically speaking. There is no point of sorting mere integers. This is not what happens in real world databases or even an excel sheet. You usually sort a list of records (large elements), where the key value to compare may be an integer (or a string) but the record is much larger. A well written library function needs to handle, a multitude of situations, large elements, none random input, sorted, reversed, mixed, large ranges, small ranges..etc. After you take the averages and deviation of all tests, only then you can judge the efficiency of an algorithm. And still, thats only a good guess. 

    • Edited by S0lo Monday, November 07, 2016 12:18 PM
    Monday, November 07, 2016 12:17 PM
  • Cool, thanks for the guide
    Monday, November 07, 2016 1:25 PM