flat assembler
Message board for the users of flat assembler.

Index > Main > jcc vs cmov - which is faster?

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
Nikolay Petrov



Joined: 22 Apr 2004
Posts: 101
Location: Bulgaria
Nikolay Petrov
Smile Of course not, but if possible yes.
Some processors have little memory and extravagance to use similar methods

_________________
regards
Post 30 Jun 2009, 08:55
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Nikolay Petrov, surprisingly your code resulted to be the worst in my PC (Athlon64 Venice).

Code:
C:\Documents and Settings\Hernan\Escritorio>test.exe
my1: 10531ms
my2: 7563ms
locos1: 7562ms
borsucs: 6484ms
locos2: 6484ms
NikolayPetrov: 12016ms

C:\Documents and Settings\Hernan\Escritorio>test.exe
my1: 10531ms
my2: 7563ms
locos1: 7562ms
borsucs: 6484ms
locos2: 6484ms
NikolayPetrov: 12015ms

C:\Documents and Settings\Hernan\Escritorio>test.exe
my1: 10531ms
my2: 7563ms
locos1: 7563ms
borsucs: 6469ms
locos2: 6484ms
NikolayPetrov: 12016ms    


Here the code so you can see if I made something wrong with your code:
Code:
format PE console 4.0

entry _start

include 'win32ax.inc'

section '__text__' code readable executable

macro tester func
{
  local ..loop

  invoke Sleep, 1000

  xor eax, eax
  cpuid

  call [GetTickCount]
  mov [timestart], eax
  mov ebx, $80000000

  align 16
  ..loop:
    push ebx ; Instead of push 0 to "sabotage" the branch predictor at my1 a bit
    call func
    add esp, 4
    sub ebx, 1
    jnz ..loop

; Serialize
  xor eax, eax
  cpuid

  call [GetTickCount]
  sub eax, [timestart]
  push eax
  call @f
  db `func, 0
@@:
  push fmt
  call [printf]
  add esp, 12
  align 16
}

_start:

  invoke  GetCurrentProcess
  invoke  SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
  invoke  GetCurrentThread
  invoke  SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL

  tester my1
  tester my2
  tester locos1
  tester borsucs
  tester locos2
  tester NikolayPetrov
  xor eax, eax
  ret
  
align 16
my1:
  mov eax, [esp+4]
  cmp al, '0'
  jb .false
  cmp al, '9'
  ja .false
  mov eax, 1
  ret
  .false:
  xor eax, eax
  ret
  
align 16
my2:
  mov edx, [esp+4]
  xor ecx, ecx
  mov eax, 1
  cmp edx, '0'
  cmovb eax, ecx
  cmp edx, '9'
  cmova eax, ecx
  ret
  
align 16
locos1:
  mov   eax, [esp+4]
  sub   eax, '0'
  cmp   eax, 9
  setbe al
  movzx eax, al
  ret
  
align 16
borsucs:
  mov   eax, [esp+4]
  sub   eax, '0'
  cmp   eax, 10
  sbb   eax, eax
  ret
  
align 16
locos2:
  mov   eax, -'0'
  add   eax, [esp+4]
  cmp   eax, 10
  sbb   eax, eax
  ret

align 16
NikolayPetrov:
    movzx   eax, byte [esp+4]
    movzx   eax, byte [_is_dec_table+eax]
    ret

align 16
_is_dec_table:
    db 48 dup(0)
    db 10 dup(1)
    db 198 dup(0)

  
section '__data__' data readable writable

  fmt db "%s: ", "%dms", 10, 0
  timestart dd 0
  
section '_import_' import readable

  library msvcrt, 'msvcrt.dll',\
    kernel32, 'kernel32.dll'
    
  import msvcrt,\
    printf, 'printf'
    
  include 'api/kernel32.inc'    


I was expecting borsucs==locos2==NikolayPetrov, not my1<NikolayPetrov Confused
Post 30 Jun 2009, 15:57
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Thought 1: Very odd that the LUT is the slowest. It's likely the penalty of the byte * operations and the fact that the elements in the LUT are not aligned.

Code:
align 16
NikolayPetrovR1:
    movzx   eax, byte [esp+4]
    mov   eax, [_is_dec_tableR+eax*4]
    ret

align 16
NikolayPetrovR2:
    mov   eax, [esp+4] ;; movzx "should" be faster ...
    and    eax, 0FFh
    mov   eax, [_is_dec_tableR+eax*4]
    ret

align 16
_is_dec_tableR:
    dd 48 dup(0)
    dd 10 dup(1)
    dd 198 dup(0) 
    


Thought 2: I think the real problem is that LUT should be used in parallel rather than one function call at a time.

Code:
mov esi,LUT
movzx eax,byte[Array+0]
movzx ebx,byte[Array+4]
movzx ecx,byte[Array+8]
movzx edx,byte[Array+12]
mov eax,[esi+eax*4]
mov ebx,[esi+ebx*4]
mov ecx,[esi+ecx*4]
mov edx,[esi+edx*4]
    


Thought 3: Could the fact that the LUT is in the Code section instead of Data change how the data is cached on the processor?

Thought 4: It's lunch time so I can indulge my curiosity for a few moments.
Post 30 Jun 2009, 16:27
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Quote:

my1: 7766ms
my2: 6062ms
locos1: 5578ms
borsucs: 5579ms
locos2: 5593ms
NikolayPetrov: 5578ms
r22_codeLUT: 5594ms
r22_dataLUT: 5578ms
Press any key to continue . . .


Quote:

my1: 7750ms
my2: 6000ms
locos1: 5547ms
borsucs: 5547ms
locos2: 5547ms
NikolayPetrov: 5547ms
r22_codeLUT: 5547ms
r22_dataLUT: 5547ms
Press any key to continue . . .


Code:
format PE console 4.0

entry _start

include 'win32ax.inc'

section '__text__' code readable executable

macro tester func
{
  local ..loop

  invoke Sleep, 1000

  xor eax, eax
  cpuid

  call [GetTickCount]
  mov [timestart], eax
  mov ebx, $80000000

  align 16
  ..loop:
    push ebx ; Instead of push 0 to "sabotage" the branch predictor at my1 a bit
    call func
    add esp, 4
    sub ebx, 1
    jnz ..loop

; Serialize
  xor eax, eax
  cpuid

  call [GetTickCount]
  sub eax, [timestart]
  push eax
  call @f
  db `func, 0
@@:
  push fmt
  call [printf]
  add esp, 12
  align 16
}

_start:

  invoke  GetCurrentProcess
  invoke  SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
  invoke  GetCurrentThread
  invoke  SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL

  tester my1
  tester my2
  tester locos1
  tester borsucs
  tester locos2
  tester NikolayPetrov
  tester r22_codeLUT
  tester r22_dataLUT
  push _pause
  call [system]
  add esp,4
  xor eax, eax
  ret
  
align 16
my1:
  mov eax, [esp+4]
  cmp al, '0'
  jb .false
  cmp al, '9'
  ja .false
  mov eax, 1
  ret
  .false:
  xor eax, eax
  ret
  
align 16
my2:
  mov edx, [esp+4]
  xor ecx, ecx
  mov eax, 1
  cmp edx, '0'
  cmovb eax, ecx
  cmp edx, '9'
  cmova eax, ecx
  ret
  
align 16
locos1:
  mov   eax, [esp+4]
  sub   eax, '0'
  cmp   eax, 9
  setbe al
  movzx eax, al
  ret
  
align 16
borsucs:
  mov   eax, [esp+4]
  sub   eax, '0'
  cmp   eax, 10
  sbb   eax, eax
  ret
  
align 16
locos2:
  mov   eax, -'0'
  add   eax, [esp+4]
  cmp   eax, 10
  sbb   eax, eax
  ret

align 16
NikolayPetrov:
    movzx   eax, byte [esp+4]
    movzx   eax, byte [_is_dec_table+eax]
    ret

align 16
_is_dec_table:
    db 48 dup(0)
    db 10 dup(1)
    db 198 dup(0)

align 16
r22_dataLUT:
        movzx eax,byte[esp+4]
        mov eax,[_is_dec_CODE+eax*4]
        ret

align 16
r22_codeLUT:
        movzx eax,byte[esp+4]
        mov eax,[_is_dec_DATA+eax*4]
        ret

align 16
_is_dec_CODE:
    dd 48 dup(0)
    dd 10 dup(1)
    dd 198 dup(0)

section '__data__' data readable writable

align 16
_is_dec_DATA:
    dd 48 dup(0)
    dd 10 dup(1)
    dd 198 dup(0)

  fmt db "%s: ", "%dms", 10, 0
  timestart dd 0
  _pause db 'pause',0

section '_import_' import readable

  library msvcrt, 'msvcrt.dll',\
    kernel32, 'kernel32.dll'
    
  import msvcrt,\
    printf, 'printf',\
    system, 'system'
    
  include 'api/kernel32.inc'
    


Looks like this optimization depends on the processor.

Intel Core2 Q8200
Post 30 Jun 2009, 16:48
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Code:
my1: 10547ms
my2: 7563ms
locos1: 7562ms
borsucs: 6484ms
locos2: 6484ms
NikolayPetrov: 12016ms
r22_codeLUT: 12016ms
r22_dataLUT: 12016ms    

AMD Athlon64 Venice (S939)
Post 30 Jun 2009, 17:01
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3018
Location: vpcmipstrm
bitRAKE
Code:
my1: 7719ms
my2: 5609ms
locos1: 5563ms
borsucs: 5547ms
locos2: 5547ms
NikolayPetrov: 5546ms
r22_codeLUT: 5547ms
r22_dataLUT: 5547ms    
...not a huge difference on my L5410 Xeon - all code just runs great. Very Happy
Post 01 Jul 2009, 01:24
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
For the n-cores/threads owners the benchmark won't run very well, the part adjusting the affinity mask is missing. I'll have to rescue that code from asmcommunity or write it again.
Post 01 Jul 2009, 01:54
View user's profile Send private message Reply with quote
manfred



Joined: 28 Feb 2009
Posts: 43
Location: Racibórz, Poland
manfred
Code:
my1: 8031ms
my2: 5750ms
locos1: 5750ms
borsucs: 4938ms
locos2: 4938ms
NikolayPetrov: 9156ms
r22_codeLUT: 9157ms
r22_dataLUT: 9157ms    
Athlon 64 X2 5000+ "Windsor".
The manufacturer is important, not the number of cores.

_________________
Sorry for my English...
Post 01 Jul 2009, 08:34
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
AMDs are poor at caching, that's why the lookup table code isn't so good.
Post 01 Jul 2009, 14:12
View user's profile Send private message Reply with quote
Nikolay Petrov



Joined: 22 Apr 2004
Posts: 101
Location: Bulgaria
Nikolay Petrov
in test example
in:
align 16
..loop:
push ebx ; Instead of push 0 to "sabotage" the branch predictor at my1 a bit
call func
add esp, 4
sub ebx, 1
jnz ..loop
comment "push ebx" and "add esp,4"
in procs:
change "mov reg,[esp+4]" with "mov reg, ebx", or "movzx reg,bl"
and see results.

_________________
regards
Post 01 Jul 2009, 18:06
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Nikolay Petrov, now the expected results. I have changed the benchmark in such a way that all of the procs adhere to fastcall calling convention. Also, I've fixed some because they were working on the entire dword instead of the lower byte as (I believe) they should.

New times:
Code:
my1: 11609ms
my2: 7563ms
locos1: 6484ms
borsucs: 6484ms
NikolayPetrov: 6484ms
r22_codeLUT: 6484ms
r22_dataLUT: 6468ms
--
my1: 11594ms
my2: 7563ms
locos1: 6485ms
borsucs: 6485ms
NikolayPetrov: 6485ms
r22_codeLUT: 6485ms
r22_dataLUT: 6485ms
--
my1: 11594ms
my2: 7641ms
locos1: 6469ms
borsucs: 6469ms
NikolayPetrov: 6485ms
r22_codeLUT: 6485ms
r22_dataLUT: 6485ms
--
my1: 11594ms
my2: 7563ms
locos1: 6485ms
borsucs: 6485ms
NikolayPetrov: 6485ms
r22_codeLUT: 6485ms
r22_dataLUT: 6485ms    


New code:
Code:
format PE console 4.0
entry start

include 'win32ax.inc'

section '__text__' code readable executable

proc start
local ProcessAffinityMask:DWORD, SystemAffinityMask:DWORD

      invoke   GetCurrentProcess
      mov      ebx, eax
      invoke   SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
      invoke   GetCurrentThread
      invoke   SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL

;;;; On multi-{core|CPU} systems is required to set affinity as well but I don't have one :D
;;;; f0dder insisted in adding it anyway so here it is :P

      invoke   GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask
      test     eax, eax
      jz       .beginTests

; Lets choose only one of the processors allowed for this process
      mov      esi, 1
      bsf      ecx, [ProcessAffinityMask]
      shl      esi, cl

; Now, since Win9x/Me is smart enough to have GetProcessAffinityMask but not its counterpart we must check its existence first

      invoke   GetModuleHandle, 'KERNEL32.DLL'
      test     eax, eax
      jz       .beginTests

      invoke   GetProcAddress, eax, 'SetProcessAffinityMask'
      test     eax, eax
      jz       .beginTests

      stdcall  eax, ebx, esi



macro tester func
{
local ..loop

      invoke  Sleep, 1000

      xor     eax, eax
      cpuid

      invoke  GetTickCount
      mov     [timestart], eax
      mov     ebx, $80000000

  align 16
  ..loop:
        mov     eax, ebx
        call    func

        dec     ebx
        jnz     ..loop

; Serialize
      xor     eax, eax
      cpuid

      invoke  GetTickCount
      sub     eax, [timestart]

      push    eax
      call    @f
      db      `func, 0

@@:
      push    fmt
      call    [printf]
      add     esp, 12

align 16
}

;;;;;;;;;;;;;;;;;;;;;;;;;;

.beginTests:
      tester  my1
      tester  my2
      tester  locos1
      tester  borsucs
      tester  NikolayPetrov
      tester  r22_codeLUT
      tester  r22_dataLUT

      push    _pause
      call    [system]

      add     esp,4
      invoke  ExitProcess, 0
endp


;;;;;;;;;;;;;;;;;;;;;;;;;;


align 16
my1:
      cmp     al, '0'
      jb      .false

      cmp     al, '9'
      ja      .false

      mov     eax, 1

      ret

.false:
      xor     eax, eax

      ret


  
align 16
my2:
      mov     dl, al
      xor     ecx, ecx
      mov     eax, 1

      cmp     dl, '0'
      cmovb   eax, ecx

      cmp     dl, '9'
      cmova   eax, ecx

      ret



align 16
locos1:
      mov     eax, ebx
      sub     eax, '0'

      cmp     eax, 9
      setbe   al

      movzx   eax, al

      ret
  


align 16
borsucs:
      sub     al, '0'
      cmp     al, 10
      sbb     eax, eax

      ret
  


align 16
NikolayPetrov:
      movzx   eax, al
      movzx   eax, byte [_is_dec_table + eax]

      ret
align 16
_is_dec_table:
      db      48 dup(0)
      db      10 dup(1)
      db      198 dup(0)



align 16
r22_dataLUT:
      movzx   eax, al
      mov     eax, [_is_dec_DATA + eax*4]

      ret



align 16
r22_codeLUT:
      movzx   eax, al
      mov     eax, [_is_dec_CODE+eax*4]

      ret
align 16
_is_dec_CODE:
      dd      48 dup(0)
      dd      10 dup(1)
      dd      198 dup(0)

section '__data__' data readable writable

align 16
_is_dec_DATA:
      dd      48 dup(0)
      dd      10 dup(1)
      dd      198 dup(0)

timestart dd 0

fmt    db "%s: ", "%dms", 10, 0
_pause db "pause", 0

section '_import_' import readable

  library msvcrt, 'msvcrt.dll',\
    kernel32, 'kernel32.dll'
    
  import msvcrt,\
    printf, 'printf',\
    system, 'system'
    
  include 'api/kernel32.inc'    


I have rescued the affinity mask thing too from here(hello f0dder :P).

OK, any ideas why reading the parameter from memory produce such a huge overhead in the LUT-based procs on AMD processors? (if that was the real cause)
Post 01 Jul 2009, 19:29
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Mem.LUT + (Mem.Stack - > Reg) - > Reg

Back to back memory reads from two different sources
Using the data from the first read in the second read

Partial stall penalty?

Borsuc hit the nail on the head, AMD's caching logic (or just the amount of cache) is inferior to Intel's
Post 01 Jul 2009, 20:01
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.