flat assembler
Message board for the users of flat assembler.

Index > Macroinstructions > Efficient generic sorting algorithms

Author
Thread Post new topic Reply to topic
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 10 Jan 2015, 01:35
Hello, everyone. I've been working on basic sorting algorithms in fasm macrolanguage for the last two weeks. Now I'd like to share some results.

Following macros are generic sorting algorithms. They could be used to sort whatever values during compile time by providing a custom comparison macro. E.g., if you need to sort an array of pointers to strings you'd do something like this:
Code:
include 'shsort.inc'

irp s,'we','are','gonna','sort','this','sentence'
{
    forward local p
        p db s,0
    common label pStrings dword
    forward dd p
        common countof.pStrings = ($-pStrings)/4
}
macro ifless s1*,s2*
{
    local c1,c2
    while 1
        load c1 from s1+%-1
        load c2 from s2+%-1
        if c1 = 0 | c1 <> c2
            break
        end if
    end while
    if c1 < c2
}
mShellSort pStrings,countof.pStrings,ifless,4    

I.e., the first argument must be a pointer to an array (with an optional specification of an addressing space), the second argument is a number of elements in the array, the third argument is a name of a custom comparison macro that ends with an if-statement header, and the forth argument is the size of an individual array element (1,2,4 or 8 ). I guess, the most commonly used example for the above sorting is the standard export building macro in fasm, which btw. uses a somewhat inferior implementation of a binary Shell Sort.

The following macros are implemented:
1) mInsertionSort, which is the only algorithm here having quadratic complexity. It's advantages however are its simplicity together with the stability property.
2) mBinShellSort, which is a non-algorithmic improvement of the sorting used by the fasm standard export building macros. This implementation is 10 to 40 percent faster on average.
3) mCombSort. It's somewhat faster than the previous algorithm.
4) mShellSort. This is presumably the fastest Shell Sort known until now. Contrary to the descending powers of two used by mBinShellSort for gaps this implementation uses the Tokuda sequence.
5) mHeapSort. A heavily optimized Heap Sort well known for its guaranteed n*log(n) complexity.
6) mQuickSort. A heavily optimized iterative Quick Sort. Besides the fact that recursive isn't really possible, iterative implementations are also more efficient. This implementation uses 0x20 pairs of dwords to simulate the stack. It also uses the split [x<=pivot; pivot<=x]. The split together with the choice of pivot in the middle avoids degradation on trivial inputs such as sorted arrays ([x<pivot; x==pivot; pivot<x] becomes 50 to 100 percent worse) or arrays with many equal elements ([x<=pivot; pivot<x] becomes quadratic). However I'd still generally avoid Quick Sort as there's always a killer sequence that makes it quadratic and hence unacceptably slow (see the graph below).
7) mMergeSort. A heavily optimized Merge Sort. Very fast. And the second stable (!) in this set of macros. It's only disadvantage is that it has linear (auxiliary) space complexity, which means for this implementation half the largest array sorted during the compilation. The half-array approach makes it somewhat adaptive.

I could hardly imagine that the macros can be improved any further with respect to performance. Smile Except for mInsertionSort that could be using binary search, but remain quadratic, and maybe mQuickSort with respect to the choice of the pivot, which is IMHO pointless for reasons mentioned above. The performance of all the macros was tested with the following inputs:
Code:
macro caseBigRandom      { file 'rdrand_8M_0.bin':0, ARRAY_COUNT*4                       }
macro caseBigSorted      { file 'rdrand_8M_1.bin':0, ARRAY_COUNT*4                       }
macro caseBigReversed    { file 'rdrand_8M_2.bin':0, ARRAY_COUNT*4                       }
macro caseBigSplitSorted { file 'rdrand_8M_3.bin':0, ARRAY_COUNT*2
                           file 'rdrand_8M_3.bin':$400000, ARRAY_COUNT*2                 }
macro caseSplitCopy   { times ARRAY_COUNT dd % mod (ARRAY_COUNT/2)                       }
macro caseManyCopy    { times ARRAY_COUNT dd % mod $FF                                   }
macro caseAllCopy     { times ARRAY_COUNT dd $01234567                                   }    

, where
"rdrand_8M_0.bin" contains rdrand output from my Haswell,
"rdrand_8M_1.bin" — presorted rdrand output,
"rdrand_8M_2.bin" — reversely presorted rdrand output, and
"rdrand_8M_3.bin" — output of rdrand that was split into 2 chunks, then each was sorted separately, and the result was concatenated back.
Each file is 8MB large. The last one comes a bit close to the Quick Sort killer sequence, and caseSplitCopy is even closer.

Additionally all macros have been tested for correctness using random inputs of all sizes from 2 to 1024. Some neat pictures are attached (more pictures in the archive*). The binary input files (random data) are not attached (mainly because of size, and because it's pointless to attach them). To shortly summarize: Merge Sort is the winner.

* the 7z archive inside is actually an *.ods, which I had to repack because of the board limitation for attachment size. Hence, you'll need to reverse that to see the performance data and more graphs.


Description:
Filesize: 132.9 KB
Viewed: 4663 Time(s)

random.duration.png


Description:
Filesize: 119.28 KB
Viewed: 4663 Time(s)

splitcopy.ccount.png


Description:
Download
Filename: results.zip
Filesize: 526.29 KB
Downloaded: 457 Time(s)


_________________
Faith is a superposition of knowledge and fallacy
Post 10 Jan 2015, 01:35
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20432
Location: In your JS exploiting you and your system
revolution 10 Jan 2015, 02:23
How about a radix sort? That would be good to see as a preprocessor macro.
Post 10 Jan 2015, 02:23
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 10 Jan 2015, 02:50
revolution
Radix sort requires O(n) memory, is not better than n*log(n) and most importantly is not general purpose (hence cannot be made as generic as the above algorithms). Therefore I don't think it makes sense to implement it.

_________________
Faith is a superposition of knowledge and fallacy
Post 10 Jan 2015, 02:50
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 10 Jan 2015, 07:39
l_inc, this is really great! I always knew you are FASM macro genious! Smile
Post 10 Jan 2015, 07:39
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 10 Jan 2015, 14:56
JohnFound
Implementing existing algorithms is rather a mechanical coding work than a sign of geniality, but still thank you. Smile I appreciate appreciation.

_________________
Faith is a superposition of knowledge and fallacy
Post 10 Jan 2015, 14:56
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.