flat assembler
Message board for the users of flat assembler.
Index
> Main > BT instruction? Goto page 1, 2 Next |
Author |
|
Plue 18 Jan 2007, 19:23
Are there any cases where BT is better than a normal TEST?
_________________ Roses are red Violets are blue Some poems rhyme And some don't. |
|||
18 Jan 2007, 19:23 |
|
MazeGen 19 Jan 2007, 07:33
BTW, Agner Fog's manual (optimizing_assembly.pdf) doesn't recommend BT family of instructions with memory operand:
Agner Fog wrote:
I think that by "other processors" he means also Core and Core 2. |
|||
19 Jan 2007, 07:33 |
|
f0dder 19 Jan 2007, 09:49
Hm, does he give any reason for avoiding? (too lazy to look it up myself right now). As long as it's "just" performance, no big sweat
|
|||
19 Jan 2007, 09:49 |
|
MazeGen 19 Jan 2007, 10:40
Yeah, it is just performance
Code: P4 Uops/Microcode/Latency BT reg, imm 3/0/4 BT reg, reg 2/0/4 BT mem, imm 4/0/4 BT mem, reg 4/12/12 |
|||
19 Jan 2007, 10:40 |
|
LocoDelAssembly 19 Jan 2007, 15:39
Code: BT mem16/32/64, reg16/32/64 0Fh A3h mm-xxx-xxx VectorPath 8 TEST mem16/32/64, reg16/32/64 85h mm-xxx-xxx DirectPath 4 Is bad for AMD64 too, though, how much clocks you need to prepare a mask and effective address for test an arbitrary bit? |
|||
19 Jan 2007, 15:39 |
|
MazeGen 19 Jan 2007, 17:48
I know you meant something different, Loco, but let's take optimizing this loop for fun as an exercise, maybe we learn something new. Assumptions:
In the original loop, the latency of chain BT+ADC should be 4+6 = 10 uops. Now, what latency would take the same loop with TEST and SETNZ instruction? Code: add ebx, esi neg esi xor eax, eax xor ebx, ebx .loop: test byte [ebx+esi], BIT_TO_TEST setnz bl add eax, ebx add esi, 1 jnz .loop The latency of chain TEST+SETNZ+ADD should be 5*+5+0.25 = 10.25 uops. Arrrgh, the TEST loop should be a bit slower , but it would need some real testing to make sure * the latency of test mem, imm is not listed in the manual so tried to estimate it EDIT: JZ depends on the last ADD so changed the throughput EDIT2: I completely removed uops for the last ADD since it is always independent instruction from the core of the loop |
|||
19 Jan 2007, 17:48 |
|
LocoDelAssembly 19 Jan 2007, 19:40
I corrected your code a little because it destroy EBX. I also corrected my code because BT doesn't accept byte ptr...
So the test now is testing a given bit on a DWORD My test Code: format PE GUI 4.0 BIT_TO_TEST = 5 MASK = 1 shl BIT_TO_TEST mov edi, 10 align 16 start: xor eax, eax cpuid rdtsc mov [BT_Start], eax mov ebx, buff mov esi, 64 call BT.Count xor eax, eax cpuid rdtsc mov [BT_End], eax align 16 xor eax, eax cpuid rdtsc mov [TEST_Start], eax mov ebx, buff mov esi, 64 call TEST.Count xor eax, eax cpuid rdtsc mov [TEST_End], eax dec edi jnz start mov eax, [BT_End] sub eax, [BT_Start] mov edx, [TEST_End] sub edx, [TEST_Start] ; EAX = Number of cycles for BT instruction ; EDX = Number of cycles for TEST instruction int3 align 16 BT.Count: add ebx, esi neg esi xor eax, eax .loop: bt dword [ebx+esi], BIT_TO_TEST adc eax, 0 add esi, 4 ; instead of INC to make Pentium IV happy... jnz .loop retn align 16 TEST.Count: add ebx, esi neg esi xor eax, eax xor edx, edx .loop: test dword [ebx+esi], MASK setnz dl add eax, edx add esi, 4 jnz .loop retn align 64 ; AMD64 cache line size buff rd 64/4 BT_Start dd ? BT_End dd ? TEST_Start dd ? TEST_End dd ? BT = 116 cycles on every run of the program (114 cycles if I put "mov eax, eax" just before BT.Count label and remove the alignment ) TEST = 118 cycles on every run of the program (115 cycles if I remove the alignment for TEST.Count label ) Someone knows why my Athlon64 dislikes alignment??? |
|||
19 Jan 2007, 19:40 |
|
MazeGen 19 Jan 2007, 20:14
Thanks for coding the test!
I have the very same results on my Turion64 Mobile... (including playing with the alignment) It is too late now, I will play with it next week. |
|||
19 Jan 2007, 20:14 |
|
r22 20 Jan 2007, 00:34
On my amd3800+ using a millisecond timer and loops of 4billion on a stationary memory address there doesn't seem to be any speed difference in the BT vs TEST.
Here's another variation of TEST you can try using CMOVNZ TEST xor eax,eax ;;;xor edx,edx not needed .loop: lea edx,[eax+1] ;; 1clock test dword [ebx+esi], MASK cmovnz eax,edx ;; 1clock |
|||
20 Jan 2007, 00:34 |
|
LocoDelAssembly 20 Jan 2007, 03:45
After adding your code for testing the counter for BT was messed up but after replacing all the RETs by REP RET solved the problem and again BT is 116 cycles, TEST 118 cycles and CMOVNZ 114 cycles.
Now a little more real life example, a chars set type Code: format PE GUI 4.0 mov edi, 10 align 16 start: BT_TESTING: xor eax, eax cpuid rdtsc mov [BT_Start], eax mov ebx, string + string.size mov esi, -string.size .loop: mov eax, string movzx edx, byte [ebx+esi] call BT.includes? add esi, 1 jnz .loop xor eax, eax cpuid rdtsc mov [BT_End], eax align 16 TEST_TESTING: xor eax, eax cpuid rdtsc mov [TEST_Start], eax mov ebx, string + string.size mov esi, -string.size .loop: mov eax, string movzx edx, byte [ebx+esi] call TEST.includes? add esi, 1 jnz .loop xor eax, eax cpuid rdtsc mov [TEST_End], eax dec edi jnz start mov eax, [BT_End] sub eax, [BT_Start] mov edx, [TEST_End] sub edx, [TEST_Start] ; EAX = Number of cycles for BT instruction ; EDX = Number of cycles for TEST instruction int3 align 16 BT.includes?:; EAX = Pointer to CharSet | EDX = char to add bt dword [eax], edx sbb eax, eax ; With the commented code takes 1 cycle more so I use SBB instead ; setc al ; movsx eax, al ret align 16 TEST.includes?:; EAX = Pointer to CharSet | EDX = char to add mov ecx, edx mov edx, 1 shl edx, cl shr ecx, 5 mov eax, [eax+ecx] ; With the commented code takes 1 cycle more ; test [eax+ecx], edx ; sete al ; movsx eax, al and eax, edx ret align 64 ; AMD64 cache line size aCharSet rb 256 / 8 string db "flat assembler is GREAT " string.size = $ - string BT_Start dd ? BT_End dd ? TEST_Start dd ? TEST_End dd ? BT = 297 cycles TEST = 270 cycles I feel so disappointed... In all cases my test CPU is an Athlon64 3200+ Venice 2.0 Ghz (S939). PS: BTW, someone knows why I need to set EDI to at least 5 to get constant results? |
|||
20 Jan 2007, 03:45 |
|
r22 20 Jan 2007, 04:55
To get accurate timings with RDTSC you need longer running loops. The non loop instructions for initialization and such will throw off the timing if the loop doesn't run long enough.
~300 nanoseconds isn't a good benchmark for timing, you should shoot for loop counts that take at least a second to run., BUT if you do that you'll have to save the full 64bits of the RDTSC (why do you use cpuid before it?). rdtsc mov ebx,eax mov ebp,edx ... rdtsc sub eax,ebx sbb edx,ebp push eax push edx push <'%x%x',0> call [printf] |
|||
20 Jan 2007, 04:55 |
|
LocoDelAssembly 20 Jan 2007, 05:42
Quote:
For serialization. The problem of running too long time is that an interruption will obviously occur and results will never be constant. With 4<EDI<11 I get constant timings (but presents strange situations like being better to have unaligned labels). I'll give a try using long time loops just to see what happens but the results will be irremediably poisoned by the unpredictable environment of interruptions. Cheers |
|||
20 Jan 2007, 05:42 |
|
f0dder 20 Jan 2007, 15:01
Always boost priority to timecritical and set thread affinity to one-cpu-only while doing tests of this nature... while it's still not perfect, you get much more reliable results. (thread affinity is needed to avoid negative deltas on multicore AMD64 CPUs).
|
|||
20 Jan 2007, 15:01 |
|
LocoDelAssembly 20 Jan 2007, 16:10
Here with the suggestions
Code: format PE GUI 4.0 LOOP_TIMES = 20000000 mov edi, 5 call [GetCurrentThread] push 1 push eax call [SetThreadAffinityMask] call [GetCurrentProcess] push REALTIME_PRIORITY_CLASS push eax call [SetPriorityClass] align 16 start: BT_TESTING: mov ebp, LOOP_TIMES xor eax, eax cpuid rdtsc mov [BT_LowStart], eax mov [BT_HighStart], edx .loop: mov ebx, string + string.size mov esi, -string.size .innerLoop: mov eax, string movzx edx, byte [ebx+esi] call BT.includes? add esi, 1 jnz .innerLoop sub ebp, 1 jnz .loop xor eax, eax cpuid rdtsc mov [BT_LowEnd], eax mov [BT_HighEnd], edx align 16 TEST_TESTING: mov ebp, LOOP_TIMES xor eax, eax cpuid rdtsc mov [TEST_LowStart], eax mov [TEST_HighStart], edx .loop: mov ebx, string + string.size mov esi, -string.size .innerLoop: mov eax, string movzx edx, byte [ebx+esi] call TEST.includes? add esi, 1 jnz .innerLoop sub ebp, 1 jnz .loop xor eax, eax cpuid rdtsc mov [TEST_LowEnd], eax mov [TEST_HighEnd], edx dec edi jnz start mov eax, [BT_LowEnd] mov ebx, [BT_HighEnd] sub eax, [BT_LowStart] sbb ebx, [BT_HighStart] mov ecx, [TEST_LowEnd] mov edx, [TEST_HighEnd] sub ecx, [TEST_LowStart] sbb edx, [TEST_HighStart] mov edi, eax mov esi, ebx sub edi, ecx sbb esi, edx ; EBX:EAX = Number of cycles for BT instruction ; EDX:ECX = Number of cycles for TEST instruction ; ESI:EDI = EBX:EAX - EDX:ECX int3 align 16 BT.includes?:; EAX = Pointer to CharSet | EDX = char to add bt dword [eax], edx sbb eax, eax ; With the commented code takes 1 cycle more so I use SBB instead ; setc al ; movsx eax, al ret align 16 TEST.includes?:; EAX = Pointer to CharSet | EDX = char to add mov ecx, edx mov edx, 1 shl edx, cl shr ecx, 5 mov eax, [eax+ecx] ; With the commented code takes 1 cycle more ; test [eax+ecx], edx ; sete al ; movsx eax, al and eax, edx ret align 64 ; AMD64 cache line size aCharSet rb 256 / 8 string db "flat assembler is GREAT " string.size = $ - string BT_LowStart dd ? BT_HighStart dd ? BT_LowEnd dd ? BT_HighEnd dd ? TEST_LowStart dd ? TEST_HighStart dd ? TEST_LowEnd dd ? TEST_HighEnd dd ? include 'win32a.inc' align 4 data import library kernel,'KERNEL32.DLL' import kernel,\ GetCurrentProcess, 'GetCurrentProcess',\ GetCurrentThread, 'GetCurrentThread',\ SetPriorityClass, 'SetPriorityClass',\ SetThreadAffinityMask, 'SetThreadAffinityMask' end data Again BT is worst, it is aproximatelly 500,000,000 cycles slower. About affinity sorry I'm not sure how to set it correctly nor I have a multi-core CPU. [EDIT] Added Affinity Mask, thanks f0dder[/EDIT] Last edited by LocoDelAssembly on 20 Jan 2007, 16:49; edited 1 time in total |
|||
20 Jan 2007, 16:10 |
|
f0dder 20 Jan 2007, 16:29
SetThreadAffinity(GetCurrentThread(), 1) - limits affinity to first processor (second parameter is a bitmap of what processors it's allowed to run on).
Results from my AMD64x2 4400+ --- EBX:EAX = 00000001:818E2A2A EDX:ECX = 00000001:59BEA13F ESI:EDI = 00000000:27CF88EB 0x27CF88EB / 20000000 ~ 33 cycles slower per iteration.
_________________ - carpe noctem |
|||||||||||
20 Jan 2007, 16:29 |
|
LocoDelAssembly 20 Jan 2007, 16:40
And the first code
Code: format PE GUI 4.0 BIT_TO_TEST = 5 MASK = 1 shl BIT_TO_TEST LOOP_TIMES = 5000000 mov edi, 5 call [GetCurrentProcess] push REALTIME_PRIORITY_CLASS push eax call [SetPriorityClass] align 16 start: BT_TESTING: mov ebp, LOOP_TIMES xor eax, eax cpuid rdtsc mov [BT_Start], eax .loop: mov ebx, buff mov esi, 64 call BT.Count sub ebp, 1 jnz .loop xor eax, eax cpuid rdtsc mov [BT_End], eax align 16 TEST_TESTING: mov ebp, LOOP_TIMES xor eax, eax cpuid rdtsc mov [TEST_Start], eax .loop: mov ebx, buff mov esi, 64 call TEST.Count sub ebp, 1 jnz .loop xor eax, eax cpuid rdtsc mov [TEST_End], eax align 16 R22_TESTING: mov ebp, LOOP_TIMES xor eax, eax cpuid rdtsc mov [LEA_Start], eax .loop: mov ebx, buff mov esi, 64 call LEA.Count sub ebp, 1 jnz .loop xor eax, eax cpuid rdtsc mov [LEA_End], eax dec edi jnz start mov eax, [BT_End] sub eax, [BT_Start] mov edx, [TEST_End] sub edx, [TEST_Start] mov ecx, [LEA_End] sub ecx, [LEA_Start] ; EAX = Number of cycles for BT instruction ; EDX = Number of cycles for TEST instruction ; ECX = Number of cycles for r22's suggestion int3 align 16 BT.Count: add ebx, esi neg esi xor eax, eax .loop: bt dword [ebx+esi], BIT_TO_TEST adc eax, 0 add esi, 4 ; instead of INC to make Pentium IV happy... jnz .loop rep retn align 16 TEST.Count: add ebx, esi neg esi xor eax, eax xor edx, edx .loop: test dword [ebx+esi], MASK setnz dl add eax, edx add esi, 4 jnz .loop rep retn align 16 LEA.Count: add ebx, esi neg esi xor eax, eax .loop: lea edx,[eax+1] ;; 1clock test dword [ebx+esi], MASK cmovnz eax,edx ;; 1clock add esi, 4 jnz .loop rep retn align 64 ; AMD64 cache line size buff rd 64/4 BT_Start dd ? BT_End dd ? TEST_Start dd ? TEST_End dd ? LEA_Start dd ? LEA_End dd ? include 'win32a.inc' align 4 data import library kernel,'KERNEL32.DLL' import kernel,\ GetCurrentProcess, 'GetCurrentProcess',\ SetPriorityClass, 'SetPriorityClass' end data 1st = r22's code 2nd = BT code 3rd = TEST code But, removing the alignment for TEST.Count 1st = r22's code 2nd = TEST code 3rd = BT code Removing the alignment for TEST.Count and BT.Count 1st = BT code 2nd = r22's code 3rd = TEST code And removing the alignment for TEST.Count, BT.Count and LEA.Count Same as above Or transferring control to unaligned addresses fakes the TSC or sometimes it's better to transfer control to unaligned addresses (something very opposed to the optimization literature...). Later I'll replace RDTSC by GetTickCount to see if I get a more expectable result |
|||
20 Jan 2007, 16:40 |
|
LocoDelAssembly 20 Jan 2007, 17:12
Tested replacing all the occurencies of RDTSC with CALL [GetTickCount] but exactly same results. (This time I used LOOP_TIMES = 50000000 and added the affinity mask code -unnecesarelly for my CPU-).
f0dder, I decided to not use SetThreadPriority because it auments the cycles to ~600,000,000. I edited the post so you can see my final code. |
|||
20 Jan 2007, 17:12 |
|
MCD 21 Jan 2007, 20:43
Has anyone thought about calling Sleep 0 before each cpuid/rdtsc to restart thread time-slice yet?
|
|||
21 Jan 2007, 20:43 |
|
LocoDelAssembly 21 Jan 2007, 21:54
Tested, reduced the count to ~450,000,000 cycles for the char set example and the other example remains the same (maybe fewer milliseconds but the top speed table didn't change).
|
|||
21 Jan 2007, 21:54 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.