flat assembler
Message board for the users of flat assembler.
Index
> Main > jcc vs cmov - which is faster? Goto page Previous 1, 2 |
Author |
|
Nikolay Petrov 30 Jun 2009, 08:55
Of course not, but if possible yes.
Some processors have little memory and extravagance to use similar methods _________________ regards |
|||
30 Jun 2009, 08:55 |
|
r22 30 Jun 2009, 16:27
Thought 1: Very odd that the LUT is the slowest. It's likely the penalty of the byte * operations and the fact that the elements in the LUT are not aligned.
Code: align 16 NikolayPetrovR1: movzx eax, byte [esp+4] mov eax, [_is_dec_tableR+eax*4] ret align 16 NikolayPetrovR2: mov eax, [esp+4] ;; movzx "should" be faster ... and eax, 0FFh mov eax, [_is_dec_tableR+eax*4] ret align 16 _is_dec_tableR: dd 48 dup(0) dd 10 dup(1) dd 198 dup(0) Thought 2: I think the real problem is that LUT should be used in parallel rather than one function call at a time. Code: mov esi,LUT movzx eax,byte[Array+0] movzx ebx,byte[Array+4] movzx ecx,byte[Array+8] movzx edx,byte[Array+12] mov eax,[esi+eax*4] mov ebx,[esi+ebx*4] mov ecx,[esi+ecx*4] mov edx,[esi+edx*4] Thought 3: Could the fact that the LUT is in the Code section instead of Data change how the data is cached on the processor? Thought 4: It's lunch time so I can indulge my curiosity for a few moments. |
|||
30 Jun 2009, 16:27 |
|
r22 30 Jun 2009, 16:48
Quote:
Quote:
Code: format PE console 4.0 entry _start include 'win32ax.inc' section '__text__' code readable executable macro tester func { local ..loop invoke Sleep, 1000 xor eax, eax cpuid call [GetTickCount] mov [timestart], eax mov ebx, $80000000 align 16 ..loop: push ebx ; Instead of push 0 to "sabotage" the branch predictor at my1 a bit call func add esp, 4 sub ebx, 1 jnz ..loop ; Serialize xor eax, eax cpuid call [GetTickCount] sub eax, [timestart] push eax call @f db `func, 0 @@: push fmt call [printf] add esp, 12 align 16 } _start: invoke GetCurrentProcess invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS invoke GetCurrentThread invoke SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL tester my1 tester my2 tester locos1 tester borsucs tester locos2 tester NikolayPetrov tester r22_codeLUT tester r22_dataLUT push _pause call [system] add esp,4 xor eax, eax ret align 16 my1: mov eax, [esp+4] cmp al, '0' jb .false cmp al, '9' ja .false mov eax, 1 ret .false: xor eax, eax ret align 16 my2: mov edx, [esp+4] xor ecx, ecx mov eax, 1 cmp edx, '0' cmovb eax, ecx cmp edx, '9' cmova eax, ecx ret align 16 locos1: mov eax, [esp+4] sub eax, '0' cmp eax, 9 setbe al movzx eax, al ret align 16 borsucs: mov eax, [esp+4] sub eax, '0' cmp eax, 10 sbb eax, eax ret align 16 locos2: mov eax, -'0' add eax, [esp+4] cmp eax, 10 sbb eax, eax ret align 16 NikolayPetrov: movzx eax, byte [esp+4] movzx eax, byte [_is_dec_table+eax] ret align 16 _is_dec_table: db 48 dup(0) db 10 dup(1) db 198 dup(0) align 16 r22_dataLUT: movzx eax,byte[esp+4] mov eax,[_is_dec_CODE+eax*4] ret align 16 r22_codeLUT: movzx eax,byte[esp+4] mov eax,[_is_dec_DATA+eax*4] ret align 16 _is_dec_CODE: dd 48 dup(0) dd 10 dup(1) dd 198 dup(0) section '__data__' data readable writable align 16 _is_dec_DATA: dd 48 dup(0) dd 10 dup(1) dd 198 dup(0) fmt db "%s: ", "%dms", 10, 0 timestart dd 0 _pause db 'pause',0 section '_import_' import readable library msvcrt, 'msvcrt.dll',\ kernel32, 'kernel32.dll' import msvcrt,\ printf, 'printf',\ system, 'system' include 'api/kernel32.inc' Looks like this optimization depends on the processor. Intel Core2 Q8200 |
|||
30 Jun 2009, 16:48 |
|
LocoDelAssembly 30 Jun 2009, 17:01
Code: my1: 10547ms my2: 7563ms locos1: 7562ms borsucs: 6484ms locos2: 6484ms NikolayPetrov: 12016ms r22_codeLUT: 12016ms r22_dataLUT: 12016ms AMD Athlon64 Venice (S939) |
|||
30 Jun 2009, 17:01 |
|
bitRAKE 01 Jul 2009, 01:24
Code: my1: 7719ms my2: 5609ms locos1: 5563ms borsucs: 5547ms locos2: 5547ms NikolayPetrov: 5546ms r22_codeLUT: 5547ms r22_dataLUT: 5547ms |
|||
01 Jul 2009, 01:24 |
|
LocoDelAssembly 01 Jul 2009, 01:54
For the n-cores/threads owners the benchmark won't run very well, the part adjusting the affinity mask is missing. I'll have to rescue that code from asmcommunity or write it again.
|
|||
01 Jul 2009, 01:54 |
|
manfred 01 Jul 2009, 08:34
Code: my1: 8031ms my2: 5750ms locos1: 5750ms borsucs: 4938ms locos2: 4938ms NikolayPetrov: 9156ms r22_codeLUT: 9157ms r22_dataLUT: 9157ms The manufacturer is important, not the number of cores. _________________ Sorry for my English... |
|||
01 Jul 2009, 08:34 |
|
Borsuc 01 Jul 2009, 14:12
AMDs are poor at caching, that's why the lookup table code isn't so good.
|
|||
01 Jul 2009, 14:12 |
|
Nikolay Petrov 01 Jul 2009, 18:06
in test example
in: align 16 ..loop: push ebx ; Instead of push 0 to "sabotage" the branch predictor at my1 a bit call func add esp, 4 sub ebx, 1 jnz ..loop comment "push ebx" and "add esp,4" in procs: change "mov reg,[esp+4]" with "mov reg, ebx", or "movzx reg,bl" and see results. _________________ regards |
|||
01 Jul 2009, 18:06 |
|
LocoDelAssembly 01 Jul 2009, 19:29
Nikolay Petrov, now the expected results. I have changed the benchmark in such a way that all of the procs adhere to fastcall calling convention. Also, I've fixed some because they were working on the entire dword instead of the lower byte as (I believe) they should.
New times: Code: my1: 11609ms my2: 7563ms locos1: 6484ms borsucs: 6484ms NikolayPetrov: 6484ms r22_codeLUT: 6484ms r22_dataLUT: 6468ms -- my1: 11594ms my2: 7563ms locos1: 6485ms borsucs: 6485ms NikolayPetrov: 6485ms r22_codeLUT: 6485ms r22_dataLUT: 6485ms -- my1: 11594ms my2: 7641ms locos1: 6469ms borsucs: 6469ms NikolayPetrov: 6485ms r22_codeLUT: 6485ms r22_dataLUT: 6485ms -- my1: 11594ms my2: 7563ms locos1: 6485ms borsucs: 6485ms NikolayPetrov: 6485ms r22_codeLUT: 6485ms r22_dataLUT: 6485ms New code: Code: format PE console 4.0 entry start include 'win32ax.inc' section '__text__' code readable executable proc start local ProcessAffinityMask:DWORD, SystemAffinityMask:DWORD invoke GetCurrentProcess mov ebx, eax invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS invoke GetCurrentThread invoke SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL ;;;; On multi-{core|CPU} systems is required to set affinity as well but I don't have one :D ;;;; f0dder insisted in adding it anyway so here it is :P invoke GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask test eax, eax jz .beginTests ; Lets choose only one of the processors allowed for this process mov esi, 1 bsf ecx, [ProcessAffinityMask] shl esi, cl ; Now, since Win9x/Me is smart enough to have GetProcessAffinityMask but not its counterpart we must check its existence first invoke GetModuleHandle, 'KERNEL32.DLL' test eax, eax jz .beginTests invoke GetProcAddress, eax, 'SetProcessAffinityMask' test eax, eax jz .beginTests stdcall eax, ebx, esi macro tester func { local ..loop invoke Sleep, 1000 xor eax, eax cpuid invoke GetTickCount mov [timestart], eax mov ebx, $80000000 align 16 ..loop: mov eax, ebx call func dec ebx jnz ..loop ; Serialize xor eax, eax cpuid invoke GetTickCount sub eax, [timestart] push eax call @f db `func, 0 @@: push fmt call [printf] add esp, 12 align 16 } ;;;;;;;;;;;;;;;;;;;;;;;;;; .beginTests: tester my1 tester my2 tester locos1 tester borsucs tester NikolayPetrov tester r22_codeLUT tester r22_dataLUT push _pause call [system] add esp,4 invoke ExitProcess, 0 endp ;;;;;;;;;;;;;;;;;;;;;;;;;; align 16 my1: cmp al, '0' jb .false cmp al, '9' ja .false mov eax, 1 ret .false: xor eax, eax ret align 16 my2: mov dl, al xor ecx, ecx mov eax, 1 cmp dl, '0' cmovb eax, ecx cmp dl, '9' cmova eax, ecx ret align 16 locos1: mov eax, ebx sub eax, '0' cmp eax, 9 setbe al movzx eax, al ret align 16 borsucs: sub al, '0' cmp al, 10 sbb eax, eax ret align 16 NikolayPetrov: movzx eax, al movzx eax, byte [_is_dec_table + eax] ret align 16 _is_dec_table: db 48 dup(0) db 10 dup(1) db 198 dup(0) align 16 r22_dataLUT: movzx eax, al mov eax, [_is_dec_DATA + eax*4] ret align 16 r22_codeLUT: movzx eax, al mov eax, [_is_dec_CODE+eax*4] ret align 16 _is_dec_CODE: dd 48 dup(0) dd 10 dup(1) dd 198 dup(0) section '__data__' data readable writable align 16 _is_dec_DATA: dd 48 dup(0) dd 10 dup(1) dd 198 dup(0) timestart dd 0 fmt db "%s: ", "%dms", 10, 0 _pause db "pause", 0 section '_import_' import readable library msvcrt, 'msvcrt.dll',\ kernel32, 'kernel32.dll' import msvcrt,\ printf, 'printf',\ system, 'system' include 'api/kernel32.inc' I have rescued the affinity mask thing too from here(hello f0dder :P). OK, any ideas why reading the parameter from memory produce such a huge overhead in the LUT-based procs on AMD processors? (if that was the real cause) |
|||
01 Jul 2009, 19:29 |
|
r22 01 Jul 2009, 20:01
Mem.LUT + (Mem.Stack - > Reg) - > Reg
Back to back memory reads from two different sources Using the data from the first read in the second read Partial stall penalty? Borsuc hit the nail on the head, AMD's caching logic (or just the amount of cache) is inferior to Intel's |
|||
01 Jul 2009, 20:01 |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.