flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
DimonSoft 09 Jan 2018, 06:09
x64 wrote: This is a function that I use with my sorting functions, I use it to test if a buffer is sorted or not, it is quite fast as it is, but I want to learn and understand more, can it be improved somewhere, maybe everywhere for all I know? From your description I’d say (even before reading the code itself) that you can improve the function by making it work for arbitrary data items. It may not be useful improvement in your particular case but may be useful in general. Another thing I’d like to mention is that sorted array is just one case out of N! where N is the number of elements, i.e. you may be optimizing rare case here which is not a bottleneck for many sorting algorithms. So in case you control the array contents from the very beginning and sorted arrays are quite common in your task, it may be better to store “Sorted” flag next to the array thus decreasing the complexity of the check from O(N) to O(1). But then again, it depends. |
|||
![]() |
|
x64 10 Jan 2018, 06:45
Hi there DimonSoft. I don't know what to "name" the function, it is a, I guess I could call it a "garbage" function that I use during debugging, just to check that my sort functions did the job. But I thought it might be useful learning to see what could be done better, even if it is just a "debugging" function.
I tried to use some of the same functionality in my quicksort algorithms to try to remove the "bottlenecks" of it, but I couldn't remove everything, and it was too time consuming to work more on it, it was not worth it. Here are some of the things I managed to remove: 1. Already sorted list would be done immediately 2. By shuffling each member (quite fast), for partially sorted lists helped tremendously 3. By counting the pivot at the same time you check if it is sorted, you could skip large repeating occurances and get smaller partitions in the end But I gave up, I couldn't remove all bottlenecks so it did not satisfy me to have one or two bottlenecks left that I couldn't deal with. I basically wanted to take QuickSort and "chop" off the enormous spikes it produces into a stable sort algorithm. Doing a one full pass over the buffer before the sorting starts might not be such a bad idea, because lets say the buffer is 10 MB, it's done very quickly, in matter of microseconds. But one would have to gather a lot of data in that one pass. |
|||
![]() |
|
x64 10 Jan 2018, 07:10
In that one pass that I did I gathered the following data:
* Sorted state of the buffer * The number of times the chosen pivot value repeated itself (So I could basically use MemFill with the pivot value rather than checking and moving them) * Naturally you would get the number of non-pivot values from the last point and the loop could be a counter rather than checking if left and right sides have crossed which greatly simplifies the loop All values smaller than pivot ended up at the bottom, greater at the top, and MemFill the pivot in the center of the buffer, and if the pivot value repeated itself, I would only partition values that is not a pivot, so the buffer gets smaller faster if the pivot repeats itself. Let's say the buffer has 1000 members and the pivot you choose repeats itself 800 times, then the sub partitions would only be 100 members each on the second round. But if one make one full pass it should be done in reverse so that the whole beginning of the buffer fills up the cache for easy reference once the sorting starts. |
|||
![]() |
|
DimonSoft 10 Jan 2018, 17:46
Found a minute to take a look at the code. I don’t have much experience in 64-bit programming (not interested yet) but I’ll try to give a few mode/bitness-agnostic tips.
Code: mov r8d, [rdx] lea rdx, [rdx+4] sub rcx, 1 mov rsi, rdx mov rdi, rdx add rsi, 4 add rdi, 8 I haven’t chacked but add rdx, 4 might take less bytes to encode. dec rcx might also be smaller. It also seems like the last four instructions would take less space if changed to lea rsi, [rdx + 4] / lea rdi, [rdx + 4]. Smaller code has better chances to be cache-friendly, the last case replaces 2 interleaved pairs of instructions with dependency between them with two independent instructions which may be better for pipelining. But this code is executed only once per call so the (possible) performance improvement might not be observable. Code: .Loop: mov r9d, [rdx] lea rdx, [rdx+12] cmp r9d, r8d sbb rax, 0 sub rcx, 1 jz .done mov r10d, [rsi] lea rsi, [rsi+12] cmp r10d, r9d sbb rax, 0 sub rcx, 1 jz .done mov r11d, [rdi] lea rdi, [rdi+12] cmp r11d, r10d sbb rax, 0 sub rcx, 1 jz .done mov ebx, [rdx] lea rdx, [rdx+12] cmp ebx, r11d sbb rax, 0 sub rcx, 1 jz .done mov r12d, [rsi] lea rsi, [rsi+12] cmp r12d, ebx sbb rax, 0 sub rcx, 1 jz .done mov r13d, [rdi] lea rdi, [rdi+20] cmp r13d, r12d sbb rax, 0 sub rcx, 1 jz .done mov r14d, [rdx] lea rdx, [rdx+8] cmp r14d, r13d sbb rax, 0 sub rcx, 1 jz .done mov r15d, [rsi] lea rsi, [rsi+8] cmp r15d, r14d sbb rax, 0 sub rcx, 1 jz .done shr rax, 64 mov r8d, r15d jnc .Loop Is this just loop unrolling? Did it actually show performance boost? Suggestions about replacing lea with add are still valid here: if the loop code is smaller it has better chances to fit in caches and internal CPU storage used to prefetch several instructions at once (can’t remembar its name right now). Why don’t you use r15d instead of r8d before the loop? Or r8d instead of r15d? Then you can read directly to such a register before the last comparison of loop interation and avoid mov r8d, r15d. Code: shr rax, 64 Isn’t this instruction a no-op due to the masking done to shift counts? What do you expect it to do? Code: not rax shr rax, 63 Is it a strange way to convert rax=non-zero to 1, rax=0 to 0? Here you’re actually setting rax to its inverted most-significant bit. From your procedure description, you return either TRUE or FALSE. For debugging function it is enough to return 0 for FALSE, non-zero for TRUE since you’ll be using something like test rax, rax / jcc ... in the caller anyway. That is the way a lot of Windows API functions are documented to work (although they might currently return 0 or 1 only). |
|||
![]() |
|
x64 10 Jan 2018, 19:32
This code runs 2% faster. (Can this one be improved?)
Code: align 4 proc _IsSortedUDwA uses rbx rsi rdi r12 r13 r14, Count, pBuf ; If Count < 2, we treat the buffer as sorted cmp rcx, 2 lea rcx, [rcx-1] mov r8d, [rdx] mov rax, 1 jb .return irps reg, rsi rdi r10 r11 r12 r13 r14 { mov reg, rdx } OFFS EQU 4 irps reg, rdx rsi rdi r10 r11 r12 r13 r14 { add reg, OFFS OFFS EQU OFFS+4 } xor rax, rax align 4 .Loop: mov r9d, [rdx] lea rdx, [rdx+32] cmp r9d, r8d sbb rax, 0 sub rcx, 1 jz .done mov r8d, [rsi] lea rsi, [rsi+32] cmp r8d, r9d sbb rax, 0 sub rcx, 1 jz .done mov r9d, [rdi] lea rdi, [rdi+32] cmp r9d, r8d sbb rax, 0 sub rcx, 1 jz .done mov r8d, [r10] lea r10, [r10+32] cmp r8d, r9d sbb rax, 0 sub rcx, 1 jz .done mov r9d, [r11] lea r11, [r11+32] cmp r9d, r8d sbb rax, 0 sub rcx, 1 jz .done mov r8d, [r12] lea r12, [r12+32] cmp r8d, r9d sbb rax, 0 sub rcx, 1 jz .done mov r9d, [r13] lea r13, [r13+32] cmp r9d, r8d sbb rax, 0 sub rcx, 1 jz .done mov r8d, [r14] lea r14, [r14+32] cmp r8d, r9d sbb rax, 0 sub rcx, 1 jz .done shr rax, 64 jnc .Loop jmp .return .done: not rax shr rax, 63 .return: ret endp Last edited by x64 on 11 Jan 2018, 18:00; edited 2 times in total |
|||
![]() |
|
x64 10 Jan 2018, 19:42
You're right, dec is smaller but I had to bring back the sub again because the dec gave me a tiny few more cycles to the execution time. Yes the function is unrolled 8 times. The return value is for the sake of cleanliness (TRUE/FALSE)
|
|||
![]() |
|
DimonSoft 11 Jan 2018, 05:56
x64 wrote: You're right, dec is smaller but I had to bring back the sub again because the dec gave me a tiny few more cycles to the execution time. Yes the function is unrolled 8 times. The return value is for the sake of cleanliness (TRUE/FALSE) My previous tips (except for inc/dec) are still there for you to try. I’m especially worried about shift by 64 bits. You seem to rely on CF after the shift, but if you read Intel Software Developer’s Manual carefully you’ll see that CF is undefined after shifts where count is >= the destination operand size. BTW, I wonder if you’re really interested in rax being a negated counter (sbb rax, 0). You function still just returns 0 or 1. Why not just branch on carry? Code: mov rax, TRUE .Loop: ... cmp r9d, r8d jc .Unsorted sub rcx, 1 jz .EndProc ... .Unsorted: xor rax, rax .EndProc: ret Branching may be slower but that should be tested anyway. Note that your code not rax / shr rax, 63 relies on the most-significant bit being 1 for unsorted arrays. Although you’ll never have an array for which the counter would underflow such code asks for my attention as not-very-future-proof. BTW, why would one optimize performance for a debugging-only function? Is it a bottleneck in your production code? Does it make your debugging extremely slow? I mean, any optimizations are better when done in the context of particular task and requirements. |
|||
![]() |
|
x64 11 Jan 2018, 18:59
Isn't it better to test if it underflows once per 8 iterations if you consider that a line of cache today is never below 32 bytes? (Once you've grabbed the first dword, there i no point in stopping there)
I will try the jc version later to see if it gives any improvements. |
|||
![]() |
|
DimonSoft 11 Jan 2018, 19:15
x64 wrote: Isn't it better to test if it underflows once per 8 iterations if you consider that a line of cache today is never below 32 bytes? (Once you've grabbed the first dword, there i no point in stopping there) What I tried to do was to replace counting+conversion with more straightforward logic: using sbb just to get a boolean seemed strange to me. And, as soon as unsorted neighbours are found, there’s no need to continue the loop, so I suggested to jump out of it immediately. Forward branch is statically predicted not to be taken, so such jc’s will get right predictions all the time. |
|||
![]() |
|
x64 11 Jan 2018, 19:24
Code: mov rax, TRUE align 4 .Loop: mov r9d, [rdx] lea rdx, [rdx+32] cmp r9d, r8d jc .Unsorted sub rcx, 1 jz .return mov r8d, [rsi] lea rsi, [rsi+32] cmp r8d, r9d jc .Unsorted sub rcx, 1 jz .return mov r9d, [rdi] lea rdi, [rdi+32] cmp r9d, r8d jc .Unsorted sub rcx, 1 jz .return mov r8d, [r10] lea r10, [r10+32] cmp r8d, r9d jc .Unsorted sub rcx, 1 jz .return mov r9d, [r11] lea r11, [r11+32] cmp r9d, r8d jc .Unsorted sub rcx, 1 jz .return mov r8d, [r12] lea r12, [r12+32] cmp r8d, r9d jc .Unsorted sub rcx, 1 jz .return mov r9d, [r13] lea r13, [r13+32] cmp r9d, r8d jc .Unsorted sub rcx, 1 jz .return mov r8d, [r14] lea r14, [r14+32] cmp r8d, r9d jc .Unsorted sub rcx, 1 jnz .Loop .Unsorted: xor rax, rax .return: ret endp It did went up by 41% in running time. ![]() |
|||
![]() |
|
x64 11 Jan 2018, 19:30
Another 3% speed improvement from the last one. What could be wrong with this one, what can be done better?
Code: align 4 .Loop: mov r9d, [rdx] lea rdx, [rdx+32] cmp r9d, r8d sbb rax, 0 sub rcx, 1 jz .done mov r8d, [rsi] lea rsi, [rsi+32] cmp r8d, r9d sbb rax, 0 sub rcx, 1 jz .done mov r9d, [rdi] lea rdi, [rdi+32] cmp r9d, r8d sbb rax, 0 sub rcx, 1 jz .done mov r8d, [r10] lea r10, [r10+32] cmp r8d, r9d sbb rax, 0 sub rcx, 1 jz .done mov r9d, [r11] lea r11, [r11+32] cmp r9d, r8d sbb rax, 0 sub rcx, 1 jz .done mov r8d, [r12] lea r12, [r12+32] cmp r8d, r9d sbb rax, 0 sub rcx, 1 jz .done mov r9d, [r13] lea r13, [r13+32] cmp r9d, r8d sbb rax, 0 sub rcx, 1 jz .done mov r8d, [r14] lea r14, [r14+32] cmp r8d, r9d sbb rax, 0 sub rcx, 1 jz .done test rax, rax jns .Loop .done: not rax shr rax, 63 .return: ret endp |
|||
![]() |
|
DimonSoft 11 Jan 2018, 19:35
x64 wrote:
It didn’t. It went up by up to 100% of running time since now it depends on the actual values in the array: it stops as soon as it finds that neighbours are in the wrong order, so it consumes nearly “zero” time if the first two items are in the wrong order and could’ve even become slower for sorted arrays. This needs to be measured. Your last attempt seems to have new bugs introduced. 7 times out of 8 you jump to .return label if rcx becomes zero. But at the end of the loop body you perform a jnz .Loop. If rcx has become zero right before it, you go to the next instruction and return 0 instead of 1. In your last attempt you may check a few useless comparisons after you find the first wrong-ordered pair. If your measurements are correct, this seems to have lower performance impact than correctly predicted branches. That’s quite possible, but recheck your experiment once again. You may also want to zero out rax before the loop, jc out of it when you find wrong-ordered neighbours and then, after your .done/.Unsorted label just use setnc al. And don’t forget you can try to implement the same logic with SIMD instructions. SSE2 seems to be supported on all 64-bit platforms. Last edited by DimonSoft on 12 Jan 2018, 05:38; edited 2 times in total |
|||
![]() |
|
x64 11 Jan 2018, 19:40
The bad order gets detected by mine too. I tested (with the carry bug) with already sorted buffers, so the carry bug doesn't inflict the time results in any of the tests. After the carry bug was removed, the timings didn't change.
Maybe if you place an out of order sequence in within the first eight dwords, your function would gain an advantage, but is it really worth it if you consider the maximum extra time is 41%? What happens if the very last elements are out of order? Please show me your way of doing it, if you don't mind. |
|||
![]() |
|
DimonSoft 11 Jan 2018, 20:03
x64 wrote: The bad order gets detected by mine too. I tested (with the carry bug) with already sorted buffers, so the carry bug doesn't inflict the time results in any of the tests. After the carry bug was removed, the timings didn't change. Sorry, forgot about the branch on CF at the end of the original loop implementation. In that case you don’t gain much algorithmically. Definitely, you won’t get large performance boost by possibly skipping up to 7 steps but if you measure debug-targeted functions even such case may be interesting to you. Besides, note the edit to my previous post. Replacing cmp+sbb with cmp+jc let me simplify logic and suggest a more straightforward way to get boolean result. As well as avoid a new bug with end-of-loop branch. Not that it is “The Absolutely Right Thing To Do” but I find writing more readable code more important since it helps me to avoid bugs and optimize the code on the algorithmic level which is generally more effective than sucking out every single tick from low-level optimizations. After all, the whole low-level optimization might disappear after you put a procedure into real project with production code due to something like instruction pairing, cache misses, <put_your_favourite_crazy_hardware_optimization_stuff_here>. I wonder what array size do you use for testing and what absolute values do you get in your measurements. |
|||
![]() |
|
x64 11 Jan 2018, 21:18
It's just a garbage function, and it is highly dependent on memory speeds, it's just for learning not much else. One can take what one learns and apply it elsewhere later. But even if it is highly memory dependent, there is still a "compact" room for optimization, it just seem like it's irrelevant when dealing with large amounts of memory.
I use pretty much any array size but for the tests I've done here, I've used an array of 65536 dwords (262,144 bytes) and done huge amounts of repeats on that array, so it would have ended up pretty high in the cache. As the function stands now, I can do 65,536 dwords in 22 microseconds. (Sorted array, so it actually parses the entire array every time) With an array size of 10 million dwords I get 3.4 ms (Lowest) 3.5 ms (Average) 3.7 ms (Maximum) those are from performing the test multiple times, not by changing the order of the array. |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.