flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register 
 Profile   Log in to check your private messages   Log in 
flat assembler > Main > Test function (IsSorted) optimization

Author
Thread Post new topic Reply to topic
x64



Joined: 08 Jan 2018
Posts: 14

Test function (IsSorted) optimization

This is a function that I use with my sorting functions, I use it to test if a buffer is sorted or not, it is quite fast as it is, but I want to learn and understand more, can it be improved somewhere, maybe everywhere for all I know?


Quote:

; Entry:
; rcx = The total number of members in the buffer
; rdx = Address of a buffer to be tested
; Return:
; rax = TRUE if sorted or FALSE if unsorted


Code:
align 4
proc _IsSortedUDwA uses rbx rsi rdi r12 r13 r14 r15CountpBuf
        
        ; If Count < 2, we treat the buffer as sorted
        mov             rax1
        cmp             rcx2
        jb              .return
        
        mov             r8d, [rdx]
        lea             rdx, [rdx+4]
        sub             rcx1
        mov             rsirdx
        mov             rdirdx
        add             rsi4
        add             rdi8
        xor             raxrax
        
align 4
.Loop:
        mov             r9d, [rdx]
        lea             rdx, [rdx+12]
        cmp             r9dr8d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r10d, [rsi]
        lea             rsi, [rsi+12]
        cmp             r10dr9d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r11d, [rdi]
        lea             rdi, [rdi+12]
        cmp             r11dr10d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             ebx, [rdx]
        lea             rdx, [rdx+12]
        cmp             ebxr11d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r12d, [rsi]
        lea             rsi, [rsi+12]
        cmp             r12debx
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r13d, [rdi]
        lea             rdi, [rdi+20]
        cmp             r13dr12d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r14d, [rdx]
        lea             rdx, [rdx+8]
        cmp             r14dr13d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r15d, [rsi]
        lea             rsi, [rsi+8]
        cmp             r15dr14d
        sbb             rax0
        sub             rcx1
        jz              .done

        shr             rax64
        mov             r8dr15d
        jnc             .Loop
        jmp             .return
        
.done:
        not             rax
        shr             rax63
.return:        
        ret
endp


Post 08 Jan 2018, 21:48
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 102
Location: Belarus

Re: Test function (IsSorted) optimization


x64 wrote:
This is a function that I use with my sorting functions, I use it to test if a buffer is sorted or not, it is quite fast as it is, but I want to learn and understand more, can it be improved somewhere, maybe everywhere for all I know?


Quote:

; Entry:
; rcx = The total number of members in the buffer
; rdx = Address of a buffer to be tested
; Return:
; rax = TRUE if sorted or FALSE if unsorted



From your description I’d say (even before reading the code itself) that you can improve the function by making it work for arbitrary data items. It may not be useful improvement in your particular case but may be useful in general.

Another thing I’d like to mention is that sorted array is just one case out of N! where N is the number of elements, i.e. you may be optimizing rare case here which is not a bottleneck for many sorting algorithms. So in case you control the array contents from the very beginning and sorted arrays are quite common in your task, it may be better to store “Sorted” flag next to the array thus decreasing the complexity of the check from O(N) to O(1). But then again, it depends.
Post 09 Jan 2018, 06:09
View user's profile Send private message Reply with quote
x64



Joined: 08 Jan 2018
Posts: 14

Hi there DimonSoft. I don't know what to "name" the function, it is a, I guess I could call it a "garbage" function that I use during debugging, just to check that my sort functions did the job. But I thought it might be useful learning to see what could be done better, even if it is just a "debugging" function.

I tried to use some of the same functionality in my quicksort algorithms to try to remove the "bottlenecks" of it, but I couldn't remove everything, and it was too time consuming to work more on it, it was not worth it.

Here are some of the things I managed to remove:

1. Already sorted list would be done immediately
2. By shuffling each member (quite fast), for partially sorted lists helped tremendously
3. By counting the pivot at the same time you check if it is sorted, you could skip large repeating occurances and get smaller partitions in the end

But I gave up, I couldn't remove all bottlenecks so it did not satisfy me to have one or two bottlenecks left that I couldn't deal with.

I basically wanted to take QuickSort and "chop" off the enormous spikes it produces into a stable sort algorithm.

Doing a one full pass over the buffer before the sorting starts might not be such a bad idea, because lets say the buffer is 10 MB, it's done very quickly, in matter of microseconds. But one would have to gather a lot of data in that one pass.
Post 10 Jan 2018, 06:45
View user's profile Send private message Reply with quote
x64



Joined: 08 Jan 2018
Posts: 14

In that one pass that I did I gathered the following data:

* Sorted state of the buffer
* The number of times the chosen pivot value repeated itself (So I could basically use MemFill with the pivot value rather than checking and moving them)
* Naturally you would get the number of non-pivot values from the last point and the loop could be a counter rather than checking if left and right sides have crossed which greatly simplifies the loop

All values smaller than pivot ended up at the bottom, greater at the top, and MemFill the pivot in the center of the buffer, and if the pivot value repeated itself, I would only partition values that is not a pivot, so the buffer gets smaller faster if the pivot repeats itself.

Let's say the buffer has 1000 members and the pivot you choose repeats itself 800 times, then the sub partitions would only be 100 members each on the second round. But if one make one full pass it should be done in reverse so that the whole beginning of the buffer fills up the cache for easy reference once the sorting starts.
Post 10 Jan 2018, 07:10
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 102
Location: Belarus

Re: Test function (IsSorted) optimization

Found a minute to take a look at the code. I don’t have much experience in 64-bit programming (not interested yet) but I’ll try to give a few mode/bitness-agnostic tips.


Code:
        mov             r8d, [rdx]
        lea             rdx, [rdx+4]
        sub             rcx1
        mov             rsirdx
        mov             rdirdx
        add             rsi4
        add             rdi8

[/quote]
I haven’t chacked but add rdx, 4 might take less bytes to encode. dec rcx might also be smaller. It also seems like the last four instructions would take less space if changed to lea rsi, [rdx + 4] / lea rdi, [rdx + 4]. Smaller code has better chances to be cache-friendly, the last case replaces 2 interleaved pairs of instructions with dependency between them with two independent instructions which may be better for pipelining. But this code is executed only once per call so the (possible) performance improvement might not be observable.


Code:
.Loop:
        mov             r9d, [rdx]
        lea             rdx, [rdx+12]
        cmp             r9dr8d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r10d, [rsi]
        lea             rsi, [rsi+12]
        cmp             r10dr9d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r11d, [rdi]
        lea             rdi, [rdi+12]
        cmp             r11dr10d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             ebx, [rdx]
        lea             rdx, [rdx+12]
        cmp             ebxr11d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r12d, [rsi]
        lea             rsi, [rsi+12]
        cmp             r12debx
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r13d, [rdi]
        lea             rdi, [rdi+20]
        cmp             r13dr12d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r14d, [rdx]
        lea             rdx, [rdx+8]
        cmp             r14dr13d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r15d, [rsi]
        lea             rsi, [rsi+8]
        cmp             r15dr14d
        sbb             rax0
        sub             rcx1
        jz              .done

        shr             rax64
        mov             r8dr15d
        jnc             .Loop


Is this just loop unrolling? Did it actually show performance boost?

Suggestions about replacing lea with add are still valid here: if the loop code is smaller it has better chances to fit in caches and internal CPU storage used to prefetch several instructions at once (can’t remembar its name right now).

Why don’t you use r15d instead of r8d before the loop? Or r8d instead of r15d? Then you can read directly to such a register before the last comparison of loop interation and avoid mov r8d, r15d.


Code:
        shr             rax64


Isn’t this instruction a no-op due to the masking done to shift counts? What do you expect it to do?


Code:
        not             rax
        shr             rax63


Is it a strange way to convert rax=non-zero to 1, rax=0 to 0? Here you’re actually setting rax to its inverted most-significant bit. From your procedure description, you return either TRUE or FALSE. For debugging function it is enough to return 0 for FALSE, non-zero for TRUE since you’ll be using something like test rax, rax / jcc ... in the caller anyway. That is the way a lot of Windows API functions are documented to work (although they might currently return 0 or 1 only).
Post 10 Jan 2018, 17:46
View user's profile Send private message Reply with quote
x64



Joined: 08 Jan 2018
Posts: 14

This code runs 2% faster. (Can this one be improved?)


Code:

align 4
proc _IsSortedUDwA uses rbx rsi rdi r12 r13 r14CountpBuf
        
        ; If Count < 2, we treat the buffer as sorted
        cmp             rcx2
        lea             rcx, [rcx-1]
        mov             r8d, [rdx]
        mov             rax1
        jb              .return

        irps regrsi rdi r10 r11 r12 r13 r14 { mov regrdx }
        OFFS EQU 4
        irps regrdx rsi rdi r10 r11 r12 r13 r14 {
          add regOFFS
          OFFS EQU OFFS+4
        }
        xor             raxrax
        
align 4
.Loop:
        mov             r9d, [rdx]
        lea             rdx, [rdx+32]
        cmp             r9dr8d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r8d, [rsi]
        lea             rsi, [rsi+32]
        cmp             r8dr9d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r9d, [rdi]
        lea             rdi, [rdi+32]
        cmp             r9dr8d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r8d, [r10]
        lea             r10, [r10+32]
        cmp             r8dr9d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r9d, [r11]
        lea             r11, [r11+32]
        cmp             r9dr8d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r8d, [r12]
        lea             r12, [r12+32]
        cmp             r8dr9d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r9d, [r13]
        lea             r13, [r13+32]
        cmp             r9dr8d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r8d, [r14]
        lea             r14, [r14+32]
        cmp             r8dr9d
        sbb             rax0
        sub             rcx1
        jz              .done

        shr             rax64
        jnc             .Loop
        jmp             .return
        
.done:
        not             rax
        shr             rax63
.return:        
        ret
endp




Last edited by x64 on 11 Jan 2018, 18:00; edited 2 times in total
Post 10 Jan 2018, 19:32
View user's profile Send private message Reply with quote
x64



Joined: 08 Jan 2018
Posts: 14

You're right, dec is smaller but I had to bring back the sub again because the dec gave me a tiny few more cycles to the execution time. Yes the function is unrolled 8 times. The return value is for the sake of cleanliness (TRUE/FALSE)
Post 10 Jan 2018, 19:42
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 102
Location: Belarus


x64 wrote:
You're right, dec is smaller but I had to bring back the sub again because the dec gave me a tiny few more cycles to the execution time. Yes the function is unrolled 8 times. The return value is for the sake of cleanliness (TRUE/FALSE)


My previous tips (except for inc/dec) are still there for you to try. I’m especially worried about shift by 64 bits. You seem to rely on CF after the shift, but if you read Intel Software Developer’s Manual carefully you’ll see that CF is undefined after shifts where count is >= the destination operand size.

BTW, I wonder if you’re really interested in rax being a negated counter (sbb rax, 0). You function still just returns 0 or 1. Why not just branch on carry?


Code:
        mov             raxTRUE
.Loop:
        ...
        cmp             r9dr8d
        jc              .Unsorted
        sub             rcx1
        jz              .EndProc
        ...
.Unsorted:
        xor             raxrax
.EndProc:
        ret


Branching may be slower but that should be tested anyway. Note that your code not rax / shr rax, 63 relies on the most-significant bit being 1 for unsorted arrays. Although you’ll never have an array for which the counter would underflow such code asks for my attention as not-very-future-proof.

BTW, why would one optimize performance for a debugging-only function? Is it a bottleneck in your production code? Does it make your debugging extremely slow? I mean, any optimizations are better when done in the context of particular task and requirements.
Post 11 Jan 2018, 05:56
View user's profile Send private message Reply with quote
x64



Joined: 08 Jan 2018
Posts: 14

Isn't it better to test if it underflows once per 8 iterations if you consider that a line of cache today is never below 32 bytes? (Once you've grabbed the first dword, there i no point in stopping there)

I will try the jc version later to see if it gives any improvements.
Post 11 Jan 2018, 18:59
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 102
Location: Belarus


x64 wrote:
Isn't it better to test if it underflows once per 8 iterations if you consider that a line of cache today is never below 32 bytes? (Once you've grabbed the first dword, there i no point in stopping there)


What I tried to do was to replace counting+conversion with more straightforward logic: using sbb just to get a boolean seemed strange to me. And, as soon as unsorted neighbours are found, there’s no need to continue the loop, so I suggested to jump out of it immediately.

Forward branch is statically predicted not to be taken, so such jc’s will get right predictions all the time.
Post 11 Jan 2018, 19:15
View user's profile Send private message Reply with quote
x64



Joined: 08 Jan 2018
Posts: 14


Code:

        mov             raxTRUE

align 4
.Loop:
        mov             r9d, [rdx]
        lea             rdx, [rdx+32]
        cmp             r9dr8d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r8d, [rsi]
        lea             rsi, [rsi+32]
        cmp             r8dr9d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r9d, [rdi]
        lea             rdi, [rdi+32]
        cmp             r9dr8d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r8d, [r10]
        lea             r10, [r10+32]
        cmp             r8dr9d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r9d, [r11]
        lea             r11, [r11+32]
        cmp             r9dr8d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r8d, [r12]
        lea             r12, [r12+32]
        cmp             r8dr9d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r9d, [r13]
        lea             r13, [r13+32]
        cmp             r9dr8d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r8d, [r14]
        lea             r14, [r14+32]
        cmp             r8dr9d
        jc              .Unsorted
        sub             rcx1
        jnz             .Loop

.Unsorted:
        xor             raxrax
.return:
        ret
endp




It did went up by 41% in running time. Question
Post 11 Jan 2018, 19:24
View user's profile Send private message Reply with quote
x64



Joined: 08 Jan 2018
Posts: 14

Another 3% speed improvement from the last one. What could be wrong with this one, what can be done better?

Code:

align 4
.Loop:
        mov             r9d, [rdx]
        lea             rdx, [rdx+32]
        cmp             r9dr8d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r8d, [rsi]
        lea             rsi, [rsi+32]
        cmp             r8dr9d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r9d, [rdi]
        lea             rdi, [rdi+32]
        cmp             r9dr8d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r8d, [r10]
        lea             r10, [r10+32]
        cmp             r8dr9d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r9d, [r11]
        lea             r11, [r11+32]
        cmp             r9dr8d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r8d, [r12]
        lea             r12, [r12+32]
        cmp             r8dr9d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r9d, [r13]
        lea             r13, [r13+32]
        cmp             r9dr8d
        sbb             rax0
        sub             rcx1
        jz              .done

        mov             r8d, [r14]
        lea             r14, [r14+32]
        cmp             r8dr9d
        sbb             rax0
        sub             rcx1
        jz              .done

        test            raxrax
        jns             .Loop

.done:
        not             rax
        shr             rax63
.return:
        ret
endp


Post 11 Jan 2018, 19:30
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 102
Location: Belarus


x64 wrote:

Code:

        mov             raxTRUE

align 4
.Loop:
        mov             r9d, [rdx]
        lea             rdx, [rdx+32]
        cmp             r9dr8d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r8d, [rsi]
        lea             rsi, [rsi+32]
        cmp             r8dr9d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r9d, [rdi]
        lea             rdi, [rdi+32]
        cmp             r9dr8d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r8d, [r10]
        lea             r10, [r10+32]
        cmp             r8dr9d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r9d, [r11]
        lea             r11, [r11+32]
        cmp             r9dr8d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r8d, [r12]
        lea             r12, [r12+32]
        cmp             r8dr9d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r9d, [r13]
        lea             r13, [r13+32]
        cmp             r9dr8d
        jc              .Unsorted
        sub             rcx1
        jz              .return

        mov             r8d, [r14]
        lea             r14, [r14+32]
        cmp             r8dr9d
        jc              .Unsorted
        sub             rcx1
        jnz             .Loop

.Unsorted:
        xor             raxrax
.return:
        ret
endp




It did went up by 41% in running time. Question


It didn’t. It went up by up to 100% of running time since now it depends on the actual values in the array: it stops as soon as it finds that neighbours are in the wrong order, so it consumes nearly “zero” time if the first two items are in the wrong order and could’ve even become slower for sorted arrays. This needs to be measured.

Your last attempt seems to have new bugs introduced. 7 times out of 8 you jump to .return label if rcx becomes zero. But at the end of the loop body you perform a jnz .Loop. If rcx has become zero right before it, you go to the next instruction and return 0 instead of 1.

In your last attempt you may check a few useless comparisons after you find the first wrong-ordered pair. If your measurements are correct, this seems to have lower performance impact than correctly predicted branches. That’s quite possible, but recheck your experiment once again.

You may also want to zero out rax before the loop, jc out of it when you find wrong-ordered neighbours and then, after your .done/.Unsorted label just use setnc al.

And don’t forget you can try to implement the same logic with SIMD instructions. SSE2 seems to be supported on all 64-bit platforms.


Last edited by DimonSoft on 12 Jan 2018, 05:38; edited 2 times in total
Post 11 Jan 2018, 19:35
View user's profile Send private message Reply with quote
x64



Joined: 08 Jan 2018
Posts: 14

The bad order gets detected by mine too. I tested (with the carry bug) with already sorted buffers, so the carry bug doesn't inflict the time results in any of the tests. After the carry bug was removed, the timings didn't change.

Maybe if you place an out of order sequence in within the first eight dwords, your function would gain an advantage, but is it really worth it if you consider the maximum extra time is 41%? What happens if the very last elements are out of order?

Please show me your way of doing it, if you don't mind.
Post 11 Jan 2018, 19:40
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 102
Location: Belarus


x64 wrote:
The bad order gets detected by mine too. I tested (with the carry bug) with already sorted buffers, so the carry bug doesn't inflict the time results in any of the tests. After the carry bug was removed, the timings didn't change.

Maybe if you place an out of order sequence in within the first eight dwords, your function would gain an advantage, but is it really worth it if you consider the maximum extra time is 41%? What happens if the very last elements are out of order?

Please show me your way of doing it, if you don't mind.


Sorry, forgot about the branch on CF at the end of the original loop implementation. In that case you don’t gain much algorithmically. Definitely, you won’t get large performance boost by possibly skipping up to 7 steps but if you measure debug-targeted functions even such case may be interesting to you.

Besides, note the edit to my previous post. Replacing cmp+sbb with cmp+jc let me simplify logic and suggest a more straightforward way to get boolean result. As well as avoid a new bug with end-of-loop branch. Not that it is “The Absolutely Right Thing To Do” but I find writing more readable code more important since it helps me to avoid bugs and optimize the code on the algorithmic level which is generally more effective than sucking out every single tick from low-level optimizations. After all, the whole low-level optimization might disappear after you put a procedure into real project with production code due to something like instruction pairing, cache misses, <put_your_favourite_crazy_hardware_optimization_stuff_here>.

I wonder what array size do you use for testing and what absolute values do you get in your measurements.
Post 11 Jan 2018, 20:03
View user's profile Send private message Reply with quote
x64



Joined: 08 Jan 2018
Posts: 14

It's just a garbage function, and it is highly dependent on memory speeds, it's just for learning not much else. One can take what one learns and apply it elsewhere later. But even if it is highly memory dependent, there is still a "compact" room for optimization, it just seem like it's irrelevant when dealing with large amounts of memory.

I use pretty much any array size but for the tests I've done here, I've used an array of 65536 dwords (262,144 bytes) and done huge amounts of repeats on that array, so it would have ended up pretty high in the cache.

As the function stands now, I can do 65,536 dwords in 22 microseconds. (Sorted array, so it actually parses the entire array every time)

With an array size of 10 million dwords I get 3.4 ms (Lowest) 3.5 ms (Average) 3.7 ms (Maximum) those are from performing the test multiple times, not by changing the order of the array.
Post 11 Jan 2018, 21:18
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2005 phpBB Group.

Main index   Download   Documentation   Examples   Message board
Copyright © 2004-2017, Tomasz Grysztar.