flat assembler
Message board for the users of flat assembler.

Index > Main > BT instruction?

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
Plue



Joined: 15 Dec 2005
Posts: 151
Plue 18 Jan 2007, 19:23
Are there any cases where BT is better than a normal TEST?

_________________
Roses are red
Violets are blue
Some poems rhyme
And some don't.
Post 18 Jan 2007, 19:23
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 18 Jan 2007, 20:03
In case of a "BT mem32, reg32" you can test 2^32 bits just one BT while with TEST you need to first calculate the effective address of the bit to test and then testing with the mask.

I don't know if exists another case.

Well, supposing a very silly situation in which you need to count how many bytes has the interested bit set then with BT could be this branchless (except for the loop...) way:

Code:
  add  ebx, esi 
  neg esi
  xor eax, eax

.loop:
  bt  byte [ebx+esi], BIT_TO_TEST 
  adc eax, 0
  add esi, 1 ; instead of INC to make Pentium IV happy...
  jnz .loop

; EAX has the number of bytes with BIT_TO_TEST set at this point
    


Silly example but shows why sometimes could be choiced just because it set CF.
Post 18 Jan 2007, 20:03
View user's profile Send private message Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 19 Jan 2007, 07:33
BTW, Agner Fog's manual (optimizing_assembly.pdf) doesn't recommend BT family of instructions with memory operand:
Agner Fog wrote:

16.6 Bit test (all processors)
BT, BTC, BTR, and BTS instructions should preferably be replaced by instructions like TEST,
AND, OR, XOR, or shifts on P1, PMMX and P4. Bit tests with a memory operand should be
avoided on other Intel processors.
BTC, BTR, and BTS use 2 uops on AMD processors. Bit
test instructions are useful when optimizing for size.

I think that by "other processors" he means also Core and Core 2.
Post 19 Jan 2007, 07:33
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 19 Jan 2007, 09:49
Hm, does he give any reason for avoiding? (too lazy to look it up myself right now). As long as it's "just" performance, no big sweat Smile
Post 19 Jan 2007, 09:49
View user's profile Send private message Visit poster's website Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 19 Jan 2007, 10:40
Yeah, it is just performance Smile
Code:
P4 Uops/Microcode/Latency

BT reg, imm 3/0/4
BT reg, reg 2/0/4
BT mem, imm 4/0/4
BT mem, reg 4/12/12
    
Post 19 Jan 2007, 10:40
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 19 Jan 2007, 15:39
Code:
BT   mem16/32/64, reg16/32/64 0Fh A3h mm-xxx-xxx VectorPath 8
TEST mem16/32/64, reg16/32/64 85h     mm-xxx-xxx DirectPath 4    


Is bad for AMD64 too, though, how much clocks you need to prepare a mask and effective address for test an arbitrary bit?
Post 19 Jan 2007, 15:39
View user's profile Send private message Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 19 Jan 2007, 17:48
I know you meant something different, Loco, but let's take optimizing this loop for fun as an exercise, maybe we learn something new. Assumptions:

  • we optimize for P4 processor (I'm still not familiar with those new parameters of Core processors)
  • we use Agner Fog's manual to determine instruction's parameters
  • we assume the instruction is already decoded and stored in the cache (we don't take "Uops" and "Microcode" into account) so only Latency (the number of uops between dependent instructions) and Reciprocal throughput (same for independent instruction, marked bold) are important

In the original loop, the latency of chain BT+ADC should be 4+6 = 10 uops. Now, what latency would take the same loop with TEST and SETNZ instruction?
Code:
 add ebx, esi 
 neg esi
 xor eax, eax
 xor ebx, ebx

.loop:
  test byte [ebx+esi], BIT_TO_TEST 
  setnz bl
  add eax, ebx
  add esi, 1
  jnz .loop
    

The latency of chain TEST+SETNZ+ADD should be 5*+5+0.25 = 10.25 uops.

Arrrgh, the TEST loop should be a bit slower Wink, but it would need some real testing to make sure Smile

* the latency of test mem, imm is not listed in the manual so tried to estimate it

EDIT: JZ depends on the last ADD so changed the throughput
EDIT2: I completely removed uops for the last ADD since it is always independent instruction from the core of the loop
Post 19 Jan 2007, 17:48
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 19 Jan 2007, 19:40
I corrected your code a little because it destroy EBX. I also corrected my code because BT doesn't accept byte ptr...

So the test now is testing a given bit on a DWORD

My test

Code:
format PE GUI 4.0
BIT_TO_TEST = 5
MASK        = 1 shl BIT_TO_TEST

  mov     edi, 10

align 16
start:
  xor     eax, eax
  cpuid
  rdtsc
  mov     [BT_Start], eax

  mov     ebx, buff
  mov     esi, 64
  call    BT.Count

  xor     eax, eax
  cpuid
  rdtsc
  mov     [BT_End], eax

align 16
  xor     eax, eax
  cpuid
  rdtsc
  mov     [TEST_Start], eax

  mov     ebx, buff
  mov     esi, 64
  call    TEST.Count

  xor     eax, eax
  cpuid
  rdtsc
  mov     [TEST_End], eax

  dec     edi
  jnz     start


  mov     eax, [BT_End]
  sub     eax, [BT_Start]

  mov     edx, [TEST_End]
  sub     edx, [TEST_Start]

  ; EAX = Number of cycles for BT instruction
  ; EDX = Number of cycles for TEST instruction
  int3

align 16

BT.Count:
  add     ebx, esi
  neg     esi
  xor     eax, eax

.loop: 
  bt      dword [ebx+esi], BIT_TO_TEST
  adc     eax, 0
  add     esi, 4 ; instead of INC to make Pentium IV happy...
  jnz     .loop

  retn

align 16

TEST.Count:
  add     ebx, esi
  neg     esi
  xor     eax, eax
  xor     edx, edx

.loop:
  test    dword [ebx+esi], MASK
  setnz   dl
  add     eax, edx
  add     esi, 4
  jnz     .loop

  retn

align 64 ; AMD64 cache line size
buff rd 64/4

BT_Start   dd ?
BT_End     dd ?

TEST_Start dd ?
TEST_End   dd ?    

BT = 116 cycles on every run of the program (114 cycles if I put "mov eax, eax" just before BT.Count label and remove the alignment Confused)
TEST = 118 cycles on every run of the program (115 cycles if I remove the alignment for TEST.Count label Confused)

Someone knows why my Athlon64 dislikes alignment???
Post 19 Jan 2007, 19:40
View user's profile Send private message Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 19 Jan 2007, 20:14
Thanks for coding the test! Surprised

I have the very same results on my Turion64 Mobile... Shocked (including playing with the alignment)

It is too late now, I will play with it next week.
Post 19 Jan 2007, 20:14
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 20 Jan 2007, 00:34
On my amd3800+ using a millisecond timer and loops of 4billion on a stationary memory address there doesn't seem to be any speed difference in the BT vs TEST.

Here's another variation of TEST you can try using CMOVNZ
TEST
xor eax,eax
;;;xor edx,edx not needed
.loop:
lea edx,[eax+1] ;; 1clock
test dword [ebx+esi], MASK
cmovnz eax,edx ;; 1clock
Post 20 Jan 2007, 00:34
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 20 Jan 2007, 03:45
After adding your code for testing the counter for BT was messed up but after replacing all the RETs by REP RET solved the problem and again BT is 116 cycles, TEST 118 cycles and CMOVNZ 114 cycles.

Now a little more real life example, a chars set type
Code:
format PE GUI 4.0

  mov     edi, 10 

align 16
start:

BT_TESTING:
  xor     eax, eax
  cpuid 
  rdtsc 
  mov     [BT_Start], eax 

  mov     ebx, string + string.size
  mov     esi, -string.size

.loop:
  mov     eax, string
  movzx   edx, byte [ebx+esi]
  call    BT.includes?

  add     esi, 1
  jnz     .loop

  xor     eax, eax
  cpuid 
  rdtsc 
  mov     [BT_End], eax

align 16
TEST_TESTING:

  xor     eax, eax
  cpuid 
  rdtsc 
  mov     [TEST_Start], eax

  mov     ebx, string + string.size
  mov     esi, -string.size

.loop:
  mov     eax, string
  movzx   edx, byte [ebx+esi]
  call    TEST.includes?

  add     esi, 1
  jnz     .loop

  xor     eax, eax
  cpuid 
  rdtsc 
  mov     [TEST_End], eax




  dec     edi
  jnz     start 

  mov     eax, [BT_End]
  sub     eax, [BT_Start] 

  mov     edx, [TEST_End] 
  sub     edx, [TEST_Start]

  ; EAX = Number of cycles for BT instruction
  ; EDX = Number of cycles for TEST instruction
  int3

align 16
BT.includes?:; EAX = Pointer to CharSet | EDX = char to add
  bt      dword [eax], edx

  sbb     eax, eax
; With the commented code takes 1 cycle more so I use SBB instead
;  setc    al
;  movsx   eax, al
  ret

align 16
TEST.includes?:; EAX = Pointer to CharSet | EDX = char to add
  mov     ecx, edx
  mov     edx, 1
  shl     edx, cl
  shr     ecx, 5
  mov     eax, [eax+ecx]

; With the commented code takes 1 cycle more
;  test    [eax+ecx], edx

;  sete    al
;  movsx   eax, al

  and     eax, edx
  ret

align 64 ; AMD64 cache line size
aCharSet rb 256 / 8

string   db "flat assembler is GREAT Very Happy"

string.size = $ - string


BT_Start    dd ?
BT_End      dd ?

TEST_Start  dd ?
TEST_End    dd ?    


BT = 297 cycles
TEST = 270 cycles

I feel so disappointed...

In all cases my test CPU is an Athlon64 3200+ Venice 2.0 Ghz (S939).

PS: BTW, someone knows why I need to set EDI to at least 5 to get constant results?
Post 20 Jan 2007, 03:45
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 20 Jan 2007, 04:55
To get accurate timings with RDTSC you need longer running loops. The non loop instructions for initialization and such will throw off the timing if the loop doesn't run long enough.
~300 nanoseconds isn't a good benchmark for timing, you should shoot for loop counts that take at least a second to run.,
BUT if you do that you'll have to save the full 64bits of the RDTSC (why do you use cpuid before it?).

rdtsc
mov ebx,eax
mov ebp,edx
...
rdtsc
sub eax,ebx
sbb edx,ebp
push eax
push edx
push <'%x%x',0>
call [printf]
Post 20 Jan 2007, 04:55
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 20 Jan 2007, 05:42
Quote:

(why do you use cpuid before it?).


For serialization. The problem of running too long time is that an interruption will obviously occur and results will never be constant. With 4<EDI<11 I get constant timings (but presents strange situations like being better to have unaligned labels).

I'll give a try using long time loops just to see what happens but the results will be irremediably poisoned by the unpredictable environment of interruptions.

Cheers
Post 20 Jan 2007, 05:42
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 20 Jan 2007, 15:01
Always boost priority to timecritical and set thread affinity to one-cpu-only while doing tests of this nature... while it's still not perfect, you get much more reliable results. (thread affinity is needed to avoid negative deltas on multicore AMD64 CPUs).
Post 20 Jan 2007, 15:01
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 20 Jan 2007, 16:10
Here with the suggestions
Code:
format PE GUI 4.0
LOOP_TIMES = 20000000

  mov     edi, 5

  call    [GetCurrentThread]

  push    1
  push    eax
  call    [SetThreadAffinityMask]

  call    [GetCurrentProcess]

  push    REALTIME_PRIORITY_CLASS
  push    eax
  call    [SetPriorityClass]

align 16
start:

BT_TESTING:
  mov     ebp, LOOP_TIMES

  xor     eax, eax
  cpuid 
  rdtsc 
  mov     [BT_LowStart], eax
  mov     [BT_HighStart], edx

.loop:
  mov     ebx, string + string.size
  mov     esi, -string.size

.innerLoop:
  mov     eax, string
  movzx   edx, byte [ebx+esi]
  call    BT.includes?

  add     esi, 1
  jnz     .innerLoop

  sub     ebp, 1
  jnz     .loop

  xor     eax, eax
  cpuid 
  rdtsc 
  mov     [BT_LowEnd], eax
  mov     [BT_HighEnd], edx

align 16
TEST_TESTING:
  mov     ebp, LOOP_TIMES

  xor     eax, eax
  cpuid 
  rdtsc 
  mov     [TEST_LowStart], eax
  mov     [TEST_HighStart], edx

.loop:
  mov     ebx, string + string.size
  mov     esi, -string.size

.innerLoop:
  mov     eax, string
  movzx   edx, byte [ebx+esi]
  call    TEST.includes?

  add     esi, 1
  jnz     .innerLoop

  sub     ebp, 1
  jnz     .loop

  xor     eax, eax
  cpuid 
  rdtsc 
  mov     [TEST_LowEnd], eax
  mov     [TEST_HighEnd], edx




  dec     edi
  jnz     start 

  mov     eax, [BT_LowEnd]
  mov     ebx, [BT_HighEnd]
  sub     eax, [BT_LowStart]
  sbb     ebx, [BT_HighStart]

  mov     ecx, [TEST_LowEnd]
  mov     edx, [TEST_HighEnd]
  sub     ecx, [TEST_LowStart]
  sbb     edx, [TEST_HighStart]

  mov     edi, eax
  mov     esi, ebx
  sub     edi, ecx
  sbb     esi, edx

  ; EBX:EAX = Number of cycles for BT instruction
  ; EDX:ECX = Number of cycles for TEST instruction
  ; ESI:EDI = EBX:EAX - EDX:ECX
  int3

align 16
BT.includes?:; EAX = Pointer to CharSet | EDX = char to add
  bt      dword [eax], edx

  sbb     eax, eax
; With the commented code takes 1 cycle more so I use SBB instead
;  setc    al
;  movsx   eax, al
  ret

align 16
TEST.includes?:; EAX = Pointer to CharSet | EDX = char to add
  mov     ecx, edx
  mov     edx, 1
  shl     edx, cl
  shr     ecx, 5
  mov     eax, [eax+ecx]

; With the commented code takes 1 cycle more
;  test    [eax+ecx], edx

;  sete    al
;  movsx   eax, al

  and     eax, edx
  ret

align 64 ; AMD64 cache line size
aCharSet rb 256 / 8

string   db "flat assembler is GREAT Very Happy"

string.size = $ - string


BT_LowStart     dd ?
BT_HighStart    dd ?
BT_LowEnd       dd ?
BT_HighEnd      dd ?

TEST_LowStart   dd ?
TEST_HighStart  dd ?
TEST_LowEnd     dd ?
TEST_HighEnd    dd ?

include 'win32a.inc'

align 4
data import
  library kernel,'KERNEL32.DLL'
  import kernel,\
         GetCurrentProcess, 'GetCurrentProcess',\
         GetCurrentThread,  'GetCurrentThread',\
         SetPriorityClass,  'SetPriorityClass',\
         SetThreadAffinityMask, 'SetThreadAffinityMask'

end data    


Again BT is worst, it is aproximatelly 500,000,000 cycles slower.

About affinity sorry I'm not sure how to set it correctly nor I have a multi-core CPU.

[EDIT] Added Affinity Mask, thanks f0dder[/EDIT]


Last edited by LocoDelAssembly on 20 Jan 2007, 16:49; edited 1 time in total
Post 20 Jan 2007, 16:10
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 20 Jan 2007, 16:29
SetThreadAffinity(GetCurrentThread(), 1) - limits affinity to first processor (second parameter is a bitmap of what processors it's allowed to run on).

Results from my AMD64x2 4400+
---
EBX:EAX = 00000001:818E2A2A
EDX:ECX = 00000001:59BEA13F
ESI:EDI = 00000000:27CF88EB
0x27CF88EB / 20000000 ~ 33 cycles slower per iteration.


Description:
Download
Filename: test.asm
Filesize: 3.09 KB
Downloaded: 547 Time(s)


_________________
Image - carpe noctem
Post 20 Jan 2007, 16:29
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 20 Jan 2007, 16:40
And the first code
Code:
format PE GUI 4.0
BIT_TO_TEST = 5 
MASK        = 1 shl BIT_TO_TEST 

LOOP_TIMES  = 5000000

  mov     edi, 5

  call    [GetCurrentProcess]

  push    REALTIME_PRIORITY_CLASS
  push    eax
  call    [SetPriorityClass]

align 16
start:

BT_TESTING:
  mov     ebp, LOOP_TIMES

  xor     eax, eax
  cpuid 
  rdtsc 
  mov     [BT_Start], eax 

.loop:
  mov     ebx, buff
  mov     esi, 64 
  call    BT.Count 

  sub     ebp, 1
  jnz     .loop

  xor     eax, eax 
  cpuid 
  rdtsc 
  mov     [BT_End], eax 

align 16
TEST_TESTING:
  mov     ebp, LOOP_TIMES

  xor     eax, eax 
  cpuid 
  rdtsc 
  mov     [TEST_Start], eax 

.loop:
  mov     ebx, buff 
  mov     esi, 64 
  call    TEST.Count

  sub     ebp, 1
  jnz     .loop

  xor     eax, eax 
  cpuid 
  rdtsc 
  mov     [TEST_End], eax 

align 16
R22_TESTING:
  mov     ebp, LOOP_TIMES

  xor     eax, eax 
  cpuid 
  rdtsc 
  mov     [LEA_Start], eax

.loop:
  mov     ebx, buff 
  mov     esi, 64 
  call    LEA.Count

  sub     ebp, 1
  jnz     .loop

  xor     eax, eax 
  cpuid 
  rdtsc 
  mov     [LEA_End], eax

  dec     edi
  jnz     start 


  mov     eax, [BT_End] 
  sub     eax, [BT_Start] 

  mov     edx, [TEST_End] 
  sub     edx, [TEST_Start]

  mov     ecx, [LEA_End]
  sub     ecx, [LEA_Start]

  ; EAX = Number of cycles for BT instruction 
  ; EDX = Number of cycles for TEST instruction
  ; ECX = Number of cycles for r22's suggestion
  int3 

align 16
BT.Count:
  add     ebx, esi 
  neg     esi 
  xor     eax, eax 

.loop:  
  bt      dword [ebx+esi], BIT_TO_TEST 
  adc     eax, 0 
  add     esi, 4 ; instead of INC to make Pentium IV happy... 
  jnz     .loop 

  rep     retn

align 16
TEST.Count: 
  add     ebx, esi 
  neg     esi 
  xor     eax, eax 
  xor     edx, edx 

.loop: 
  test    dword [ebx+esi], MASK 
  setnz   dl 
  add     eax, edx 
  add     esi, 4 
  jnz     .loop 

  rep     retn

align 16
LEA.Count:
  add     ebx, esi 
  neg     esi 
  xor     eax, eax

.loop:
  lea     edx,[eax+1] ;; 1clock
  test    dword [ebx+esi], MASK
  cmovnz  eax,edx ;; 1clock
  add     esi, 4
  jnz     .loop

  rep     retn

align 64 ; AMD64 cache line size
buff rd 64/4 

BT_Start   dd ? 
BT_End     dd ? 

TEST_Start dd ?
TEST_End   dd ?

LEA_Start  dd ?
LEA_End    dd ?

include 'win32a.inc'

align 4
data import
  library kernel,'KERNEL32.DLL'

  import kernel,\
         GetCurrentProcess, 'GetCurrentProcess',\
         SetPriorityClass,  'SetPriorityClass'

end data    


1st = r22's code
2nd = BT code
3rd = TEST code

But, removing the alignment for TEST.Count

1st = r22's code
2nd = TEST code
3rd = BT code

Removing the alignment for TEST.Count and BT.Count

1st = BT code
2nd = r22's code
3rd = TEST code

And removing the alignment for TEST.Count, BT.Count and LEA.Count

Same as above

Or transferring control to unaligned addresses fakes the TSC or sometimes it's better to transfer control to unaligned addresses (something very opposed to the optimization literature...).

Later I'll replace RDTSC by GetTickCount to see if I get a more expectable result
Post 20 Jan 2007, 16:40
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 20 Jan 2007, 17:12
Tested replacing all the occurencies of RDTSC with CALL [GetTickCount] but exactly same results. (This time I used LOOP_TIMES = 50000000 and added the affinity mask code -unnecesarelly for my CPU-).

f0dder, I decided to not use SetThreadPriority because it auments the cycles to ~600,000,000. I edited the post so you can see my final code.
Post 20 Jan 2007, 17:12
View user's profile Send private message Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 21 Jan 2007, 20:43
Has anyone thought about calling Sleep 0 before each cpuid/rdtsc to restart thread time-slice yet?
Post 21 Jan 2007, 20:43
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 21 Jan 2007, 21:54
Tested, reduced the count to ~450,000,000 cycles for the char set example and the other example remains the same (maybe fewer milliseconds but the top speed table didn't change).
Post 21 Jan 2007, 21:54
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.