flat assembler
Message board for the users of flat assembler.

 Index > Main > Test function (IsSorted) optimization
Author
x64

Joined: 08 Jan 2018
Posts: 14
x64
This is a function that I use with my sorting functions, I use it to test if a buffer is sorted or not, it is quite fast as it is, but I want to learn and understand more, can it be improved somewhere, maybe everywhere for all I know?

Quote:

; Entry:
; rcx = The total number of members in the buffer
; rdx = Address of a buffer to be tested
; Return:
; rax = TRUE if sorted or FALSE if unsorted
Code:
```align 4
proc _IsSortedUDwA uses rbx rsi rdi r12 r13 r14 r15, Count, pBuf

; If Count < 2, we treat the buffer as sorted
mov             rax, 1
cmp             rcx, 2
jb              .return

mov             r8d, [rdx]
lea             rdx, [rdx+4]
sub             rcx, 1
mov             rsi, rdx
mov             rdi, rdx
xor             rax, rax

align 4
.Loop:
mov             r9d, [rdx]
lea             rdx, [rdx+12]
cmp             r9d, r8d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r10d, [rsi]
lea             rsi, [rsi+12]
cmp             r10d, r9d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r11d, [rdi]
lea             rdi, [rdi+12]
cmp             r11d, r10d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             ebx, [rdx]
lea             rdx, [rdx+12]
cmp             ebx, r11d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r12d, [rsi]
lea             rsi, [rsi+12]
cmp             r12d, ebx
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r13d, [rdi]
lea             rdi, [rdi+20]
cmp             r13d, r12d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r14d, [rdx]
lea             rdx, [rdx+8]
cmp             r14d, r13d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r15d, [rsi]
lea             rsi, [rsi+8]
cmp             r15d, r14d
sbb             rax, 0
sub             rcx, 1
jz              .done

shr             rax, 64
mov             r8d, r15d
jnc             .Loop
jmp             .return

.done:
not             rax
shr             rax, 63
.return:
ret
endp
```
08 Jan 2018, 21:48
DimonSoft

Joined: 03 Mar 2010
Posts: 937
Location: Belarus
DimonSoft
x64 wrote:
This is a function that I use with my sorting functions, I use it to test if a buffer is sorted or not, it is quite fast as it is, but I want to learn and understand more, can it be improved somewhere, maybe everywhere for all I know?

Quote:

; Entry:
; rcx = The total number of members in the buffer
; rdx = Address of a buffer to be tested
; Return:
; rax = TRUE if sorted or FALSE if unsorted

From your description I’d say (even before reading the code itself) that you can improve the function by making it work for arbitrary data items. It may not be useful improvement in your particular case but may be useful in general.

Another thing I’d like to mention is that sorted array is just one case out of N! where N is the number of elements, i.e. you may be optimizing rare case here which is not a bottleneck for many sorting algorithms. So in case you control the array contents from the very beginning and sorted arrays are quite common in your task, it may be better to store “Sorted” flag next to the array thus decreasing the complexity of the check from O(N) to O(1). But then again, it depends.
09 Jan 2018, 06:09
x64

Joined: 08 Jan 2018
Posts: 14
x64
Hi there DimonSoft. I don't know what to "name" the function, it is a, I guess I could call it a "garbage" function that I use during debugging, just to check that my sort functions did the job. But I thought it might be useful learning to see what could be done better, even if it is just a "debugging" function.

I tried to use some of the same functionality in my quicksort algorithms to try to remove the "bottlenecks" of it, but I couldn't remove everything, and it was too time consuming to work more on it, it was not worth it.

Here are some of the things I managed to remove:

1. Already sorted list would be done immediately
2. By shuffling each member (quite fast), for partially sorted lists helped tremendously
3. By counting the pivot at the same time you check if it is sorted, you could skip large repeating occurances and get smaller partitions in the end

But I gave up, I couldn't remove all bottlenecks so it did not satisfy me to have one or two bottlenecks left that I couldn't deal with.

I basically wanted to take QuickSort and "chop" off the enormous spikes it produces into a stable sort algorithm.

Doing a one full pass over the buffer before the sorting starts might not be such a bad idea, because lets say the buffer is 10 MB, it's done very quickly, in matter of microseconds. But one would have to gather a lot of data in that one pass.
10 Jan 2018, 06:45
x64

Joined: 08 Jan 2018
Posts: 14
x64
In that one pass that I did I gathered the following data:

* Sorted state of the buffer
* The number of times the chosen pivot value repeated itself (So I could basically use MemFill with the pivot value rather than checking and moving them)
* Naturally you would get the number of non-pivot values from the last point and the loop could be a counter rather than checking if left and right sides have crossed which greatly simplifies the loop

All values smaller than pivot ended up at the bottom, greater at the top, and MemFill the pivot in the center of the buffer, and if the pivot value repeated itself, I would only partition values that is not a pivot, so the buffer gets smaller faster if the pivot repeats itself.

Let's say the buffer has 1000 members and the pivot you choose repeats itself 800 times, then the sub partitions would only be 100 members each on the second round. But if one make one full pass it should be done in reverse so that the whole beginning of the buffer fills up the cache for easy reference once the sorting starts.
10 Jan 2018, 07:10
DimonSoft

Joined: 03 Mar 2010
Posts: 937
Location: Belarus
DimonSoft
Found a minute to take a look at the code. I don’t have much experience in 64-bit programming (not interested yet) but I’ll try to give a few mode/bitness-agnostic tips.

Code:
```        mov             r8d, [rdx]
lea             rdx, [rdx+4]
sub             rcx, 1
mov             rsi, rdx
mov             rdi, rdx
[/quote]
I haven’t chacked but add rdx, 4 might take less bytes to encode. dec rcx might also be smaller. It also seems like the last four instructions would take less space if changed to lea rsi, [rdx + 4] / lea rdi, [rdx + 4]. Smaller code has better chances to be cache-friendly, the last case replaces 2 interleaved pairs of instructions with dependency between them with two independent instructions which may be better for pipelining. But this code is executed only once per call so the (possible) performance improvement might not be observable.

Code:
```.Loop:
mov             r9d, [rdx]
lea             rdx, [rdx+12]
cmp             r9d, r8d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r10d, [rsi]
lea             rsi, [rsi+12]
cmp             r10d, r9d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r11d, [rdi]
lea             rdi, [rdi+12]
cmp             r11d, r10d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             ebx, [rdx]
lea             rdx, [rdx+12]
cmp             ebx, r11d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r12d, [rsi]
lea             rsi, [rsi+12]
cmp             r12d, ebx
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r13d, [rdi]
lea             rdi, [rdi+20]
cmp             r13d, r12d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r14d, [rdx]
lea             rdx, [rdx+8]
cmp             r14d, r13d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r15d, [rsi]
lea             rsi, [rsi+8]
cmp             r15d, r14d
sbb             rax, 0
sub             rcx, 1
jz              .done

shr             rax, 64
mov             r8d, r15d
jnc             .Loop    ```

Is this just loop unrolling? Did it actually show performance boost?

Suggestions about replacing lea with add are still valid here: if the loop code is smaller it has better chances to fit in caches and internal CPU storage used to prefetch several instructions at once (can’t remembar its name right now).

Why don’t you use r15d instead of r8d before the loop? Or r8d instead of r15d? Then you can read directly to such a register before the last comparison of loop interation and avoid mov r8d, r15d.

Code:
`        shr             rax, 64    `

Isn’t this instruction a no-op due to the masking done to shift counts? What do you expect it to do?

Code:
```        not             rax
shr             rax, 63    ```

Is it a strange way to convert rax=non-zero to 1, rax=0 to 0? Here you’re actually setting rax to its inverted most-significant bit. From your procedure description, you return either TRUE or FALSE. For debugging function it is enough to return 0 for FALSE, non-zero for TRUE since you’ll be using something like test rax, rax / jcc ... in the caller anyway. That is the way a lot of Windows API functions are documented to work (although they might currently return 0 or 1 only).
10 Jan 2018, 17:46
x64

Joined: 08 Jan 2018
Posts: 14
x64
This code runs 2% faster. (Can this one be improved?)

Code:
```align 4
proc _IsSortedUDwA uses rbx rsi rdi r12 r13 r14, Count, pBuf

; If Count < 2, we treat the buffer as sorted
cmp             rcx, 2
lea             rcx, [rcx-1]
mov             r8d, [rdx]
mov             rax, 1
jb              .return

irps reg, rsi rdi r10 r11 r12 r13 r14 { mov reg, rdx }
OFFS EQU 4
irps reg, rdx rsi rdi r10 r11 r12 r13 r14 {
OFFS EQU OFFS+4
}
xor             rax, rax

align 4
.Loop:
mov             r9d, [rdx]
lea             rdx, [rdx+32]
cmp             r9d, r8d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r8d, [rsi]
lea             rsi, [rsi+32]
cmp             r8d, r9d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r9d, [rdi]
lea             rdi, [rdi+32]
cmp             r9d, r8d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r8d, [r10]
lea             r10, [r10+32]
cmp             r8d, r9d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r9d, [r11]
lea             r11, [r11+32]
cmp             r9d, r8d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r8d, [r12]
lea             r12, [r12+32]
cmp             r8d, r9d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r9d, [r13]
lea             r13, [r13+32]
cmp             r9d, r8d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r8d, [r14]
lea             r14, [r14+32]
cmp             r8d, r9d
sbb             rax, 0
sub             rcx, 1
jz              .done

shr             rax, 64
jnc             .Loop
jmp             .return

.done:
not             rax
shr             rax, 63
.return:
ret
endp
```

Last edited by x64 on 11 Jan 2018, 18:00; edited 2 times in total
10 Jan 2018, 19:32
x64

Joined: 08 Jan 2018
Posts: 14
x64
You're right, dec is smaller but I had to bring back the sub again because the dec gave me a tiny few more cycles to the execution time. Yes the function is unrolled 8 times. The return value is for the sake of cleanliness (TRUE/FALSE)
10 Jan 2018, 19:42
DimonSoft

Joined: 03 Mar 2010
Posts: 937
Location: Belarus
DimonSoft
x64 wrote:
You're right, dec is smaller but I had to bring back the sub again because the dec gave me a tiny few more cycles to the execution time. Yes the function is unrolled 8 times. The return value is for the sake of cleanliness (TRUE/FALSE)

My previous tips (except for inc/dec) are still there for you to try. I’m especially worried about shift by 64 bits. You seem to rely on CF after the shift, but if you read Intel Software Developer’s Manual carefully you’ll see that CF is undefined after shifts where count is >= the destination operand size.

BTW, I wonder if you’re really interested in rax being a negated counter (sbb rax, 0). You function still just returns 0 or 1. Why not just branch on carry?

Code:
```        mov             rax, TRUE
.Loop:
...
cmp             r9d, r8d
jc              .Unsorted
sub             rcx, 1
jz              .EndProc
...
.Unsorted:
xor             rax, rax
.EndProc:
ret    ```

Branching may be slower but that should be tested anyway. Note that your code not rax / shr rax, 63 relies on the most-significant bit being 1 for unsorted arrays. Although you’ll never have an array for which the counter would underflow such code asks for my attention as not-very-future-proof.

BTW, why would one optimize performance for a debugging-only function? Is it a bottleneck in your production code? Does it make your debugging extremely slow? I mean, any optimizations are better when done in the context of particular task and requirements.
11 Jan 2018, 05:56
x64

Joined: 08 Jan 2018
Posts: 14
x64
Isn't it better to test if it underflows once per 8 iterations if you consider that a line of cache today is never below 32 bytes? (Once you've grabbed the first dword, there i no point in stopping there)

I will try the jc version later to see if it gives any improvements.
11 Jan 2018, 18:59
DimonSoft

Joined: 03 Mar 2010
Posts: 937
Location: Belarus
DimonSoft
x64 wrote:
Isn't it better to test if it underflows once per 8 iterations if you consider that a line of cache today is never below 32 bytes? (Once you've grabbed the first dword, there i no point in stopping there)

What I tried to do was to replace counting+conversion with more straightforward logic: using sbb just to get a boolean seemed strange to me. And, as soon as unsorted neighbours are found, there’s no need to continue the loop, so I suggested to jump out of it immediately.

Forward branch is statically predicted not to be taken, so such jc’s will get right predictions all the time.
11 Jan 2018, 19:15
x64

Joined: 08 Jan 2018
Posts: 14
x64
Code:
```        mov             rax, TRUE

align 4
.Loop:
mov             r9d, [rdx]
lea             rdx, [rdx+32]
cmp             r9d, r8d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r8d, [rsi]
lea             rsi, [rsi+32]
cmp             r8d, r9d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r9d, [rdi]
lea             rdi, [rdi+32]
cmp             r9d, r8d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r8d, [r10]
lea             r10, [r10+32]
cmp             r8d, r9d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r9d, [r11]
lea             r11, [r11+32]
cmp             r9d, r8d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r8d, [r12]
lea             r12, [r12+32]
cmp             r8d, r9d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r9d, [r13]
lea             r13, [r13+32]
cmp             r9d, r8d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r8d, [r14]
lea             r14, [r14+32]
cmp             r8d, r9d
jc              .Unsorted
sub             rcx, 1
jnz             .Loop

.Unsorted:
xor             rax, rax
.return:
ret
endp
```

It did went up by 41% in running time.
11 Jan 2018, 19:24
x64

Joined: 08 Jan 2018
Posts: 14
x64
Another 3% speed improvement from the last one. What could be wrong with this one, what can be done better?
Code:
```align 4
.Loop:
mov             r9d, [rdx]
lea             rdx, [rdx+32]
cmp             r9d, r8d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r8d, [rsi]
lea             rsi, [rsi+32]
cmp             r8d, r9d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r9d, [rdi]
lea             rdi, [rdi+32]
cmp             r9d, r8d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r8d, [r10]
lea             r10, [r10+32]
cmp             r8d, r9d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r9d, [r11]
lea             r11, [r11+32]
cmp             r9d, r8d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r8d, [r12]
lea             r12, [r12+32]
cmp             r8d, r9d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r9d, [r13]
lea             r13, [r13+32]
cmp             r9d, r8d
sbb             rax, 0
sub             rcx, 1
jz              .done

mov             r8d, [r14]
lea             r14, [r14+32]
cmp             r8d, r9d
sbb             rax, 0
sub             rcx, 1
jz              .done

test            rax, rax
jns             .Loop

.done:
not             rax
shr             rax, 63
.return:
ret
endp
```
11 Jan 2018, 19:30
DimonSoft

Joined: 03 Mar 2010
Posts: 937
Location: Belarus
DimonSoft
x64 wrote:
Code:
```        mov             rax, TRUE

align 4
.Loop:
mov             r9d, [rdx]
lea             rdx, [rdx+32]
cmp             r9d, r8d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r8d, [rsi]
lea             rsi, [rsi+32]
cmp             r8d, r9d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r9d, [rdi]
lea             rdi, [rdi+32]
cmp             r9d, r8d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r8d, [r10]
lea             r10, [r10+32]
cmp             r8d, r9d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r9d, [r11]
lea             r11, [r11+32]
cmp             r9d, r8d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r8d, [r12]
lea             r12, [r12+32]
cmp             r8d, r9d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r9d, [r13]
lea             r13, [r13+32]
cmp             r9d, r8d
jc              .Unsorted
sub             rcx, 1
jz              .return

mov             r8d, [r14]
lea             r14, [r14+32]
cmp             r8d, r9d
jc              .Unsorted
sub             rcx, 1
jnz             .Loop

.Unsorted:
xor             rax, rax
.return:
ret
endp
```

It did went up by 41% in running time.

It didn’t. It went up by up to 100% of running time since now it depends on the actual values in the array: it stops as soon as it finds that neighbours are in the wrong order, so it consumes nearly “zero” time if the first two items are in the wrong order and could’ve even become slower for sorted arrays. This needs to be measured.

Your last attempt seems to have new bugs introduced. 7 times out of 8 you jump to .return label if rcx becomes zero. But at the end of the loop body you perform a jnz .Loop. If rcx has become zero right before it, you go to the next instruction and return 0 instead of 1.

In your last attempt you may check a few useless comparisons after you find the first wrong-ordered pair. If your measurements are correct, this seems to have lower performance impact than correctly predicted branches. That’s quite possible, but recheck your experiment once again.

You may also want to zero out rax before the loop, jc out of it when you find wrong-ordered neighbours and then, after your .done/.Unsorted label just use setnc al.

And don’t forget you can try to implement the same logic with SIMD instructions. SSE2 seems to be supported on all 64-bit platforms.

Last edited by DimonSoft on 12 Jan 2018, 05:38; edited 2 times in total
11 Jan 2018, 19:35
x64

Joined: 08 Jan 2018
Posts: 14
x64
The bad order gets detected by mine too. I tested (with the carry bug) with already sorted buffers, so the carry bug doesn't inflict the time results in any of the tests. After the carry bug was removed, the timings didn't change.

Maybe if you place an out of order sequence in within the first eight dwords, your function would gain an advantage, but is it really worth it if you consider the maximum extra time is 41%? What happens if the very last elements are out of order?

Please show me your way of doing it, if you don't mind.
11 Jan 2018, 19:40
DimonSoft

Joined: 03 Mar 2010
Posts: 937
Location: Belarus
DimonSoft
x64 wrote:
The bad order gets detected by mine too. I tested (with the carry bug) with already sorted buffers, so the carry bug doesn't inflict the time results in any of the tests. After the carry bug was removed, the timings didn't change.

Maybe if you place an out of order sequence in within the first eight dwords, your function would gain an advantage, but is it really worth it if you consider the maximum extra time is 41%? What happens if the very last elements are out of order?

Please show me your way of doing it, if you don't mind.

Sorry, forgot about the branch on CF at the end of the original loop implementation. In that case you don’t gain much algorithmically. Definitely, you won’t get large performance boost by possibly skipping up to 7 steps but if you measure debug-targeted functions even such case may be interesting to you.

Besides, note the edit to my previous post. Replacing cmp+sbb with cmp+jc let me simplify logic and suggest a more straightforward way to get boolean result. As well as avoid a new bug with end-of-loop branch. Not that it is “The Absolutely Right Thing To Do” but I find writing more readable code more important since it helps me to avoid bugs and optimize the code on the algorithmic level which is generally more effective than sucking out every single tick from low-level optimizations. After all, the whole low-level optimization might disappear after you put a procedure into real project with production code due to something like instruction pairing, cache misses, <put_your_favourite_crazy_hardware_optimization_stuff_here>.

I wonder what array size do you use for testing and what absolute values do you get in your measurements.
11 Jan 2018, 20:03
x64

Joined: 08 Jan 2018
Posts: 14
x64
It's just a garbage function, and it is highly dependent on memory speeds, it's just for learning not much else. One can take what one learns and apply it elsewhere later. But even if it is highly memory dependent, there is still a "compact" room for optimization, it just seem like it's irrelevant when dealing with large amounts of memory.

I use pretty much any array size but for the tests I've done here, I've used an array of 65536 dwords (262,144 bytes) and done huge amounts of repeats on that array, so it would have ended up pretty high in the cache.

As the function stands now, I can do 65,536 dwords in 22 microseconds. (Sorted array, so it actually parses the entire array every time)

With an array size of 10 million dwords I get 3.4 ms (Lowest) 3.5 ms (Average) 3.7 ms (Maximum) those are from performing the test multiple times, not by changing the order of the array.
11 Jan 2018, 21:18
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum