flat assembler
Message board for the users of flat assembler.
Index
> Main > Sorting Algorithm, RadixSort Goto page 1, 2 Next |
Author |
|
madmatt 10 Apr 2006, 09:39
I have a quicksort in my TurboAsmDx Library, VERY FAST. I can sort signed and unsigned integers (ascending and descending), also single and double floats (ascending and descending). It uses recursion, so you can see how to do that in fasm assembly too. Here is the code for unsigned integer, descending.
Code: proc QuickSortDUD uses ebx, array, left, right locals pivot dd ? lhold dd ? rhold dd ? endl mov eax, [left] mov [lhold], eax mov eax, [right] mov [rhold], eax mov ecx, [array] mov edx, [left] mov eax, [ecx + edx*4] mov [pivot],eax mov eax, [left] .while eax, l, [right] mov ecx, [array] mov edx, [right] mov eax, [pivot] QuickSortDUD_contloop0: .if [ecx + edx*4], a, eax ;ge mov eax, [left] .if eax, l, [right] sub [right], 1 .else jmp QuickSortDUD_endloop0 .endif .else jmp QuickSortDUD_endloop0 .endif mov eax, [pivot] mov edx, [right] jmp QuickSortDUD_contloop0 QuickSortDUD_endloop0: mov eax, [left] .if eax, ne, [right] mov edx, [right] mov eax, [ecx + edx*4] mov edx, [left] mov [ecx + edx*4], eax add [left], 1 .endif mov ecx, [array] mov edx, [left] mov eax, [pivot] QuickSortDUD_contloop1: .if [ecx + edx*4], b, eax ;le mov eax, [left] .if eax, l, [right] add [left], 1 .else jmp QuickSortDUD_endloop1 .endif .else jmp QuickSortDUD_endloop1 .endif mov eax, [pivot] mov edx, [left] jmp QuickSortDUD_contloop1 QuickSortDUD_endloop1: mov eax, [left] .if eax, ne, [right] mov edx, [left] mov eax, [ecx + edx*4] mov edx, [right] mov [ecx + edx*4], eax sub [right], 1 .endif mov eax, [left] .endw mov ecx, [array] mov edx, [left] mov eax, [pivot] mov [ecx + edx*4], eax mov [pivot], edx mov eax, [lhold] mov [left], eax mov eax, [rhold] mov [right], eax mov eax, [pivot] .if [left], l, eax sub eax, 1 stdcall QuickSortDUD, [array], [left], eax .endif .if [right], g, eax add eax, 1 stdcall QuickSortDUD, [array], eax, [right] .endif return endp |
|||
10 Apr 2006, 09:39 |
|
Octavio 10 Apr 2006, 13:27
i did a quick test,
my qsort takes 9.7 seconds to sort the array on a P3 930Mhz i did the test on my OS, with windows you don´t know wich task is runing. |
|||
10 Apr 2006, 13:27 |
|
Madis731 10 Apr 2006, 16:07
I found a Q+I-sort (quick & insert) on the net and converted it. If I manage to make it work in your benchmark, I'll update my post.
EDIT: OK, I did it, but I see you want esi and edi preserved and you need some value in eax?! Well, at least it works somehow. Maybe mine is 4x faster because I wrote 40M instead of 10M, but my proc wants to know the length not the number of elements... Code: 2DC59F4 Result 1: 3636: 2DC5A00 Result 2: 2964: 1 Result 3: 2884: 1 Result 4: 2924: And some code in the attachment:
|
|||||||||||
10 Apr 2006, 16:07 |
|
shism2 10 Apr 2006, 22:16
Ok so why doesnt someone put all the algos here into 1 program and test the times.
|
|||
10 Apr 2006, 22:16 |
|
madmatt 10 Apr 2006, 22:17
Using my function above (QuickSortDUD [still as yet un-optimized, got it working and moved on to other things ]) I can sort 10,000,000 Random Integers in a little less than 1.5 seconds. A bit tough to get an accurate count with the timers that I have setup now. I did this test on a Celeron 2.7 ghz, with windowsxp sp2.
Last edited by madmatt on 10 Apr 2006, 22:18; edited 1 time in total |
|||
10 Apr 2006, 22:17 |
|
shism2 10 Apr 2006, 22:17
Code: mov esi, VectorToSort mov edi, LastElement + 4 call MainRSort push 0 call [ExitProcess] ; ESI = Initial address ; EDI = Last address proc MainRSort mov ebp, 3 ; Sort byte 3 call PreRSort mov ebp, 2 ; Sort byte 2 call PreRSort mov ebp, 1 ; Sort byte 1 call PreRSort xor ebp, ebp ; Sort byte 0 call PreRSort ret endp proc PreRSort push esi cmp ebp, 3 jz @@Sort3 cmp ebp, 2 jz @@Sort2 cmp ebp, 1 ; Select masks and rotation values jz @@Sort1 @@Sort0: mov eax, 0FFFFFF00h mov ebx, 0000000FFh mov ecx, 2 jmp @@Next00 @@Sort3: xor eax, eax mov ebx, 0FF000000h mov ecx, 0Ah jmp @@Next00 @@Sort2: mov eax, 0FF000000h mov ebx, 000FF0000h mov ecx, 12h jmp @@Next00 @@Sort1: mov eax, 0FFFF0000h mov ebx, 00000FF00h mov ecx, 1Ah @@Next00: mov [TempMask1], eax mov [TempMask2], ebx mov [ShiftMask], ecx @@Loop0: push edi mov ebx, esi mov edx, [ebx] xor ecx, ecx push esi mov esi, [TempMask1] and edx, esi xor ecx, ecx @@LoopCount: mov eax, [ebx] and eax, esi cmp eax, edx jnz @@Sort add ecx, 4 add ebx, 4 cmp ebx, edi jnz @@LoopCount @@Sort: pop esi cmp ecx, 4 jbe @@AlreadySorted mov edi, ebx push ebx call RSort pop ebx @@AlreadySorted: mov esi, ebx pop edi cmp esi, edi jb @@Loop0 pop esi ret endp proc RSort mov ebx, 100h*4 - 8 mov ecx, VectorTemp1 fldz ; Big optimization! We store a 0 Loop0: fst qword [ecx] ; in the FPU and overwrite with QWORDs add ecx, 8 ; instead of DWORDs. sub ebx, 8 jnz Loop0 fstp qword [ecx] push esi mov edx, [TempMask2] mov ecx, [ShiftMask] @@Next1: mov ebx, [esi] and ebx, edx rol ebx, cl mov eax, [VectorTemp1+ebx] add eax, 4 mov [VectorTemp1+ebx], eax add esi, 4 cmp esi, edi jnz @@Next1 pop esi mov ebx, VectorTemp1 mov ecx, 100h xor edx, edx @@Loop1: mov eax, [ebx] mov [ebx], edx add edx, eax add ebx, 4 ; Addition of indexes dec ecx jnz @@Loop1 push esi mov edx, [TempMask2] @@Next2: mov ebx, [esi] mov ecx, [ShiftMask] mov eax, ebx and ebx, edx rol ebx, cl mov ecx, [VectorTemp1+ebx] mov [VectorTemp2+ecx], eax add ecx, 4 mov [VectorTemp1+ebx], ecx add esi, 4 cmp esi, edi jnz @@Next2 pop esi push esi mov ebx, VectorTemp2 @@Loop2: mov eax, [ebx] mov [esi], eax add esi, 4 add ebx, 4 cmp esi, edi ; Copy the data. This part overwrites jnz @@Loop2 ; the unsorted vector with the sorted pop esi ; one at VectorTemp2. If you don't need ret ; it, you can eliminate this. endp TempMask1 dd 0 TempMask2 dd 0 ShiftMask dd 0 VectorTemp1 dd 256 dup (0) VectorTemp2 dd 256 dup (0) VectorToSort dd 100h ; Sample array of elements to sort. dd 99h dd 98h dd 87h dd 76h dd 65h dd 54h dd 43h dd 32h dd 21h LastElement dd 55676134h |
|||
10 Apr 2006, 22:17 |
|
Madis731 10 Apr 2006, 22:23
All algos are hard to test in all conditions. The quicksort can be as bad as some bubblesort, if you make the conditions hard for it. Although, sometimes it can quit earlier. There are very different algoritms for different memory, move, exchange, etc, costs. There's also a flash-sort that does the sorting in about O(n) time
depenging on the "ONE", the size of the memory used is about 110%-200% of the original data. EDIT: Yeah sorry, O(n), because O(1) is a look-up table Last edited by Madis731 on 11 Apr 2006, 21:21; edited 1 time in total |
|||
10 Apr 2006, 22:23 |
|
shism2 10 Apr 2006, 22:47
flash sort????
|
|||
10 Apr 2006, 22:47 |
|
r22 11 Apr 2006, 19:34
FlashSort and RadixSort both have O(n) time.
http://www.neubert.net/Flacodes/FLACodes.html When I benchmarked the JAVA RadixSort vs the JAVA FlashSort there timings were identical. JAVA version are on average 4x slower than the ASM versions. It would make testing the ASM versions easier if they followed a calling convention. |
|||
11 Apr 2006, 19:34 |
|
Madis731 12 Apr 2006, 07:49
Hi, I just had a vision
I don't remember what was the stride length in modern computers and is it even constant, but you should be able to sort one stride at a time and then insert sort all other strides. Would it be faster? i.e. stride is 32-bytes (this is just an example): You take 8 DWORDs and sort them for free any way you want. Because they are in one cache stride, you will only get a level 1 cache penalty which is 3 clocks. Okay, lets scale horizontally now - we have many cache strides (P4s have upto 4MB of cache nowdays) and we can read an array of strides in by accessing the first memory location of each of these. This can be done by dummy cmp [mem],0 or something. Maybe use some SSE instruction. We can even move chunks of 32-bytes between strides, because CPU architecture is optimized this way so REP MOVSD is less than 8 clocks when ECX=8. I don't know if this can be done faster with XMM. ----- Okay, another thought here: What if you read in four of the XMM registers (that is 128 bits or 8 bytes). Then you apply an algorithm on xmm0..xmm3 to sort the eight counterparts. I know that SSE is meant to deal with floats, but who cares We're still using XOR EAX,EAX to "move" 0 into eax. ----- Would any of these work or the code to make it work will be bloat? |
|||
12 Apr 2006, 07:49 |
|
r22 12 Apr 2006, 18:06
I don't see how that could work any faster. Because you'd probably have to break the strides up with you were insertion sorting them
Say we sort 2 at a time 100 5 60 20 6 80 (5, 100) (20, 60) (6, 55) To get to 5, 6, 20, 55, 60 would cost a lot of time Unless your thinking of something different. |
|||
12 Apr 2006, 18:06 |
|
shism2 13 Apr 2006, 01:28
How about a greator or less sort... By subtracitng lets say 9 from 6.. you get 3 but if you subtract 6-9 you get -3 . So
If you subtract a number from that number and the outcome is positive. the First number is greater than the second. If the outcome is negative the second number is greater. mov ecx,9 mov eax,7 sub ecx,eax js ecxless |
|||
13 Apr 2006, 01:28 |
|
r22 13 Apr 2006, 05:25
CMP reg,reg does the same thing, only better, because it doesn't change teh contents of the register.
Some sort of CMOVxx might be an interesting way to optimize a sorting algorithm that uses a lot of CMP. As the branchless technique is proven to be a lot faster. |
|||
13 Apr 2006, 05:25 |
|
shism2 13 Apr 2006, 14:23
Is there anyway to add many memory destinations at the same time example
1m dd 4 2m dd 5 3m dd 2 add [1m],[2m],[3m] = output (11) |
|||
13 Apr 2006, 14:23 |
|
f0dder 23 Apr 2006, 11:07
shism2, your code GPFs on me (somewhere in RSort I think).
Madis731, your code seems to fail to sort properly. Attached is my correctness + benchmark test, enjoy Quote:
EDIT: forgot to state my machine specs. AMD64x2 4400+, 4x512meg ram, ASUS A8N SLI Premium motherboard (for stability and passively cooled chipset, I don't run SLI). EDIT2: I also just realized the program says "signed", but test tests are unsigned
_________________ - carpe noctem Last edited by f0dder on 23 Apr 2006, 11:33; edited 2 times in total |
|||||||||||
23 Apr 2006, 11:07 |
|
okasvi 23 Apr 2006, 11:24
results:
Quote: C:\Documents and Settings\Administrator\My Documents\Downloads\sort_test>driver amd sempron 2800+, sucky i know... |
|||
23 Apr 2006, 11:24 |
|
UCM 23 Apr 2006, 15:39
sort_test wrote:
AMD Athlon X2 4200+ p.s. if you think this is a tad slow for my processor consider that i am running firefox, 7zip, regedt32.exe, 8 instances of explorer, 1 instance of Prime95 per core (2), PvPGN, OpenVPN, gaim, Skype, Apache, OpenOffice.org, Avast! Antivirus and 6 instances of Notepad. _________________ This calls for... Ultra CRUNCHY Man! Ta da!! *crunch* |
|||
23 Apr 2006, 15:39 |
|
shism2 23 Apr 2006, 16:22
That's funny fodder because it doesn't produce a GPF for me
|
|||
23 Apr 2006, 16:22 |
|
Vasilev Vjacheslav 23 Apr 2006, 20:22
maybe someone will be interested in this program (include source code)
http://www.codingcrew.de/marty/downloads/sorting_algorithms_in_assembly.zip |
|||
23 Apr 2006, 20:22 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.