flat assembler
Message board for the users of flat assembler.

Index > Main > mul instruction latency on

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 14 Jan 2010, 21:28
IIRC, the intel CPUs can do a 32 bit multiply in 4 cycles and and a 64 bit multiply in 10-14 cycles. Seeing as both addition and multiplication circuits can be designed with a propagational delay of O(log n) , I see no reason for this be discrepancy (AMD CPU latencies are 3 and 5 respectively). Could it be that:
a) intel whipped up a lousy multiplier for their 64-bit CPUs but is planning to fix it later versions.
b) the latency will forever be this high, but will rely on good pipelining for performance
c) the multiply instruction is not seen as that important for mainstream aps, and thus has not been improved

PS: I've already asked this question on their site with no response.
Post 14 Jan 2010, 21:28
View user's profile Send private message Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2465
Location: Bucharest, Romania
Borsuc 14 Jan 2010, 23:31
A bit off topic but I thought addition was done with combinatorial logic, thus 1 clock cycle.
Post 14 Jan 2010, 23:31
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 14 Jan 2010, 23:37
Quote:

A bit off topic but I thought addition was done with combinatorial logic, thus 1 clock cycle.

Yes, but he means here how much time it takes for the signals to propagate to the entire circuit until the correct result is output. With some adders, the time increases logarithmically as more bits are added (n is number of bits here).

tthsqe, I'm really surprised the time it takes for Intel to multiply 64x64 to get 128 bits (completely sure about this?). For the time it takes it is probable that it is micro-coded and there it is the mayor delay.
Post 14 Jan 2010, 23:37
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 15 Jan 2010, 00:05
tthsqe, I've just checked Agner Fog's manuals because the Intel Optimization manual has very little information about this. Core i7, has 3 cycles latency to do a "(i)MUL r64" (probably a typo since r32 takes 5 cycles) and can execute a new one every 2 cycles.

The latency in the Pentium4E is 11 cycles for (i)MUL r64, and 16 cycles for Pentium4 model 2 (but for (i)MUL r8/32). In Core2 (i)MUL r64 takes 7 cycles.

In all cases seems to be above AMD, this may be to either, micro-coded design or it is just pipelined in short steps.

PS: BTW, PhenonII can still get the multiplication done in 4 cycles at 3.0 GHz? Perhaps they had to sacrifice a little performance to attain that clock speed.
Post 15 Jan 2010, 00:05
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 15 Jan 2010, 00:54
Yeh, that 3 cycle latency does not sound right. I guess I'll just have to test it out myself when I get back.
I first noticed this when looking at the GMP multiplication algorithms. For 64 bit systems, I think the same code using mul is being used, but you can see the differences between a corei7 and a opteron at http://gmplib.org/gmpbench.html.
For testing purposes, should this be accurate?:
Code:
   rdtsc
   mov [T],eax
   mov [T+4],edx
   mov rax,somenum
   mov rdx,somenum'
   mov ecx,1000
.1:
   rept 10 {mul rdx}
   dec ecx
   jnz .1
   rdtsc
   sub eax,[T]
   sbb edx,[T+4]
   mov ecx,10000
   div ecx
    
[/code]
Post 15 Jan 2010, 00:54
View user's profile Send private message Reply with quote
windwakr



Joined: 30 Jun 2004
Posts: 827
windwakr 15 Jan 2010, 01:11
Where's the serialization before the rdtsc?

I'd take a read through this if I were you:
http://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf

Should also lock it to one core if you're on a multicore system and set the thread to high priority.

_________________
----> * <---- My star, won HERE
Post 15 Jan 2010, 01:11
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 15 Jan 2010, 01:13
I don't know how to test timings at clock cycle level but I think you'll need to do this things:
* Fix thread affinity to one core
* Warm up the processor because on newer Intel processors the TSC runs at fixed speed instead of incrementing slower on low power mode.
* Add CPUID or some serializing instruction to take the time when the multiplications actually finished (and perhaps you'll need one before entering the loop but I'm not sure about the exact position it should go).

PS: Well, windwakr posted while I was previewing.
Post 15 Jan 2010, 01:13
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 15 Jan 2010, 01:19
to measure time at CPU level, you can do it with RDTSC.
put rdtsc before the fnction that will use the code you want to measure.
put the code you want to measure inside the function
compute the difference between two timings.
you will know how much CPU cycles takes one instruction in a function.
and the effectof alignment of this instruction can be analysed too.

the code to do it should not be very hard to do.

also, i read one tuime in IA manuals that:

timing at instruction level can be measured with step by step instruction mode (int3 i think is generated each time an instructoin is executed.

after, i read somewhere else there are MSRs that will do the toiming measure for instructions, and also, count the times an instruction is executed...
but i don't know more about.
Post 15 Jan 2010, 01:19
View user's profile Send private message Visit poster's website Reply with quote
windwakr



Joined: 30 Jun 2004
Posts: 827
windwakr 15 Jan 2010, 04:29
I'm a bit sleepy, can someone take a look over my code?

This is for 32-bit, as I have no 64-bit OS, but it should be easy enough to modify into 64-bit.

It SHOULD return the number of clocks for 1000 muls....SHOULD....but like I said, I'm a little tired, so who knows.

EDIT:*CODE REMOVED* Hope no one saw it, I was way too tired to be coding anything....

_________________
----> * <---- My star, won HERE


Last edited by windwakr on 15 Jan 2010, 06:55; edited 2 times in total
Post 15 Jan 2010, 04:29
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 15 Jan 2010, 06:19
ok, implementing everyone's suggestions I get:
Code:
format PE GUI 4.0
entry start
include 'win32a.inc'
Runs = 10000
LoopUnroll equ 10
HIGH_PRIORITY_CLASS = 0x080

t1.l equ ebp
t1.h equ esi
t2.l equ edi
t2.h equ ebx

section '.text' code readable executable
start:
   invoke GetCurrentProcess
   invoke SetPriorityClass,eax,HIGH_PRIORITY_CLASS

   mov esi,100000
.warmup:
   irps reg, eax ebx ecx edx {xor reg,reg}
   dec esi
   jnz .warmup

   cpuid
   rdtsc
   irps reg, eax ebx ecx edx {xor reg,reg}
   cpuid
   rdtsc
   mov t1.l,eax
   mov t1.h,edx
   irps reg, eax ebx ecx edx {xor reg,reg}
   cpuid
   rdtsc
   mov t2.l,eax
   push edx

   mov eax,0xf9d8e41
   mov edx,0x59db234
   mov ecx,Runs
.1: 
   rept LoopUnroll {mul edx}
   dec ecx 
   jnz .1

   irps reg, eax ebx ecx edx {xor reg,reg}
   cpuid
   rdtsc
   pop t2.h
   sub eax,t2.l
   sbb edx,t2.h
   sub t1.l,t2.l
   sbb t1.h,t2.h
   add eax,t1.l
   adc edx,t1.h

   sub esp,4*2
   mov dword[esp-4],Runs*LoopUnroll
   mov [esp],eax
   mov [esp+4],edx
   fild qword[esp]
   fidiv dword[esp-4]
   fstp qword[esp]
   push messageformat
   push message
   call [sprintf]
   add esp,4*4

   invoke MessageBox,0,message,caption,0
   invoke ExitProcess,0

section '.data' data readable writeable
message rb 256
messageformat db '%f',0
caption db 'results',0
section '.idata' import data readable writeable
library kernel,'KERNEL32.DLL',\
        user,'USER32.DLL',\
        msvcrt,'MSVCRT.DLL'
import kernel,\
       ExitProcess,'ExitProcess',\
       GetCurrentProcess,'GetCurrentProcess',\
       SetPriorityClass,'SetPriorityClass'
import user,MessageBox,'MessageBoxA'
import msvcrt,sprintf,'sprintf'    


I am still getting unstable values between 7.3 and 7.5 on an atom.[/code]
Post 15 Jan 2010, 06:19
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 15 Jan 2010, 06:21
wndwalker, your loop is too small - If LoopUnroll is set to 1, the timing goes up to 8.9 on mine.
EDIT: your sode given anywhere from 8.9 to 9.5 on mine.
Post 15 Jan 2010, 06:21
View user's profile Send private message Reply with quote
windwakr



Joined: 30 Jun 2004
Posts: 827
windwakr 15 Jan 2010, 07:25
Ya, as I was dozing off to sleep I saw all the errors in my code. Had to get up and try to fix them. Try this:
Code:
format PE GUI 4.0
entry start

include 'win32a.inc'

section '.text' code readable executable

  start:
        invoke GetCurrentProcess
        invoke SetPriorityClass,eax,HIGH_PRIORITY_CLASS
        invoke GetCurrentThread
        push eax
        invoke SetThreadAffinityMask,eax,1
        pop eax
        invoke  SetThreadPriority,eax,THREAD_PRIORITY_HIGHEST

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;Measure the overhead of the timing instructions....lifted from the intel doc Smile with some modifications
        cpuid
        rdtsc
        mov [subtime1], eax
        mov [subtime2], edx
        mov ecx,1000
     .2:
        dec ecx
        jnz .2
        cpuid
        rdtsc
        sub eax, [subtime1]
        sub edx, [subtime2]
        mov [subtime1], eax
        mov [subtime2], edx

        cpuid
        rdtsc
        mov [subtime1], eax
        mov [subtime2], edx
        mov ecx,1000
     .3:
        dec ecx
        jnz .3
        cpuid
        rdtsc
        sub eax, [subtime1]
        sub edx, [subtime2]
        mov [subtime1], eax
        mov [subtime2], edx

        cpuid
        rdtsc
        mov [subtime1], eax
        mov [subtime2], edx
        mov ecx,1000
     .4:
        dec ecx
        jnz .4
        cpuid
        rdtsc
        sub eax, [subtime1]
        sub edx, [subtime2]
        mov [subtime1], eax
        mov [subtime2], edx
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
        mov ecx,1
        jmp .1
.0:
        dec [pass1]
        mov ebx,0
        cpuid
        rdtsc
        mov [time],eax
        mov [time2],edx
        mov ecx,1000
  .1:
        mul edx
        mul edx
        mul edx
        mul edx
        mul edx
        mul edx
        mul edx
        mul edx
        mul edx
        mul edx
        dec ecx
        jnz .1
        cpuid
        rdtsc
        sub eax,[time]
        sub edx,[time2]
        sub eax,[subtime1]
        sub edx,[subtime2]
        mov [time],eax
        mov [time2],edx

        cmp [pass1],0
        ja .0

        fild qword [time]
        fidiv [tenthou]
        fstp qword[esp]
        push fmt
        push buf
        call [sprintf]
        invoke MessageBox,NULL,buf,title,MB_OK
        invoke ExitProcess,0

section '.data' data readable writeable

  pass1 dd 10000
  subtime1 dd 0
  subtime2 dd 0
  outpt dd 0
  tenthou dd 10000
  time dd 0
  time2 dd 0
  title db '...',0
  fmt db '~%f clocks',0
  buf rb 256


section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          user32,'USER32.DLL',\
          msvcrt,'MSVCRT.DLL'

  include 'api\kernel32.inc'
  include 'api\user32.inc'
  import msvcrt,sprintf,'sprintf'
    


Gives me a smaller number than yours does, by about 0.3 clocks......BUT, that could just be another lack of sleep induced error in my code. Laughing

_________________
----> * <---- My star, won HERE
Post 15 Jan 2010, 07:25
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4024
Location: vpcmpistri
bitRAKE 15 Jan 2010, 07:53
Try using MUL EAX/ECX instead of MUL EDX - which forces every MUL to wait on the result of the upper DWORD. All tests are useful. Many processors have the result in EAX earlier than EDX is ready.
Post 15 Jan 2010, 07:53
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 15 Jan 2010, 07:56
Well, based on previous windwakr code, I hope I'm not carrying critical errors

Code:
format PE GUI 4.0
entry start
include 'win32ax.inc'
Runs = 1
LoopUnroll equ 1
HIGH_PRIORITY_CLASS = 0x080
WARMUP_TIME = 1000
SAMPLES = 3 ; IF YOU NEED HIGHER THAN 8 YOU'LL NEED TO EDIT THE CODE TOO
ALIGNMENT = 4

t1.l equ ebp
t1.h equ esi
t2.l equ edi
t2.h equ ebx

section '.text' code readable executable
start:
      invoke   GetCurrentProcess
      mov      ebx, eax
      invoke   SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
      invoke   GetCurrentThread
      invoke   SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL

      invoke   GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask
      test     eax, eax
      jz       .begin

      bsf      ecx, [ProcessAffinityMask]
      shl      eax, cl

      invoke   SetProcessAffinityMask, ebx, eax

.begin:
   invoke GetTickCount
   mov    ebx, eax
   add    ebx, 1000
   sub    esp, 4*2

.warmup:
   invoke GetTickCount
   sub    eax, ebx
   jc     .warmup

   invoke Sleep, 30 ; An attempt to begin with a fresh time slice. I hope Windows won't take the opportunity to downgrade processor's speed for this little pause.

.begin_test:

repeat 2
if % and 1
   virtual at $
else
  while  ($+LENGTH) mod ALIGNMENT ; VERY STUPID I KNOW, MY BRAIN ISN'T WORKING MUCH REALLY...
    nop
  end while
end if

   cpuid
   rdtsc
   xor  eax, eax
   cpuid
   rdtsc
   mov t1.l,eax
   mov t1.h,edx
   xor eax, eax
   cpuid
   rdtsc
   mov t2.l,eax
   push edx

   mov eax,0xf9d8e41
   mov edx,0x59db234
   mov ecx,Runs

if % and 1
  LENGTH = $-$$
  end virtual
end if

end repeat

.1: ; Aligned to ALIGNMENT bytes
   rept LoopUnroll {mul edx}
   dec ecx 
   jnz .1

   xor eax, eax
   cpuid
   rdtsc
   pop t2.h
   sub eax,t2.l
   sbb edx,t2.h
   sub t1.l,t2.l
   sbb t1.h,t2.h
   add eax,t1.l
   adc edx,t1.h

   mov dword[esp-4],Runs*LoopUnroll
   mov [esp],eax
   mov [esp+4],edx
   fild qword[esp]
   fidiv dword[esp-4]

   dec  [samples]
   jnz  .begin_test

rept 2
{
   fcomi st, st1
   ja    @f
   fxch  st1
@@:
   fstp  st0
}

   fstp qword[esp]
   push messageformat
   push message
   call [sprintf]
   add esp, 4*4 + 4*2

   invoke MessageBox,0,message,caption,0
   invoke ExitProcess,0
ret

section '.data' data readable writeable
message rb 256
messageformat db '%f',0
caption db 'results',0

align 4
samples dd SAMPLES

ProcessAffinityMask dd ?
SystemAffinityMask dd ?

section '.idata' import data readable writeable
library kernel,'KERNEL32.DLL',\
        user,'USER32.DLL',\
        msvcrt,'MSVCRT.DLL'
import kernel,\
       ExitProcess,'ExitProcess',\
       GetCurrentThread, 'GetCurrentThread',\
       GetCurrentProcess,'GetCurrentProcess',\
       GetTickCount, 'GetTickCount',\
       GetProcessAffinityMask, 'GetProcessAffinityMask',\
       SetProcessAffinityMask, 'SetProcessAffinityMask',\
       SetPriorityClass, 'SetPriorityClass',\
       SetThreadPriority, 'SetThreadPriority',\
       Sleep, 'Sleep'

import user,MessageBox,'MessageBoxA'
import msvcrt,sprintf,'sprintf'    


I've run that more than 10 times on my XP and in all of them gave me 3.000000.

BTW, when I change LoopUnroll from 3 to N it starts to take 2 cycles and then grows from there. There is something wrong but it is almost 5 AM here so I'll let others figure that out Smile

One more note: In the testings don't make very long loops, you should try to ensure the code will run in a single time-slice (or quantum) and also avoid interrupts as much as possible while executing, because you'll get your results contaminated otherwise (it is even questionable if setting high priority is really needed here).
Post 15 Jan 2010, 07:56
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4024
Location: vpcmpistri
bitRAKE 15 Jan 2010, 10:06
Hm...I either get 7.000 or 0.000 through several runs.

Yet, my own code returns the following:

MUL EDX 4.4890
MUL RDX 6.5664

It's tricky timing a single instruction. I can fill the bytes following the MUL with NOPs (up to eight bytes total) without effecting the timing. Numbers seem good as Agner Fog lists the latency for my processor at 5 and 7 cycles for 32bit and 64bit, respectively.
Post 15 Jan 2010, 10:06
View user's profile Send private message Visit poster's website Reply with quote
bitshifter



Joined: 04 Dec 2007
Posts: 796
Location: Massachusetts, USA
bitshifter 15 Jan 2010, 10:37
Its better to count uops than to use rdtsc.
Agner Fog gives a few versions of each tester.
The rdtsc will give funny results because of processor multiplier factor.
So instead of the TSC version use PMC version instead.
If configured to your chip correctly then everyones numbers match up.
And dont forget that priming the cache and ignoring the first pass is good.
There is a sample test in his PMC tester that everyone can baseline from.
Post 15 Jan 2010, 10:37
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 15 Jan 2010, 20:24
bitshifter, but counting uops will give the correct latency? Perhaps the instruction issues several uops distributed to several ports so the latency could be shorter in that case?

This is my latest code based on code from here and bitRAKE's link:
Code:
format PE GUI 4.0
entry start
include 'win32ax.inc'
HIGH_PRIORITY_CLASS = 0x080
WARMUP_TIME = 1000
SAMPLES = 7
ALIGNMENT = 64

macro   prolog
{
local ..begin, ..end

while  ($+..end-..begin) mod ALIGNMENT
        nop
end while

..begin:
        xor     eax, eax
        cpuid
        rdtsc
        mov     [tsc_lo], eax
        mov     [tsc_hi], edx
        xor     eax, eax
        cpuid
..end:
}

macro   epilog
{
        xor     eax, eax
        cpuid
        rdtsc
        sub     eax, [tsc_lo]
        sbb     edx, [tsc_hi]
}

macro empty
{
  xor ecx, edx
}

; Please also try this (the program reports -1 here):
;macro empty
;{
;  push edx
;  pop  edx
;}

; And this one too (the program reports 1 cycle here)
;macro empty
;{
;}

macro mult
{
  rept 1 ; Tried several numbers but it always reported 3*n cycles (provided the XOR ECX, EDX variant of "empty" macro is called)
  \{
    mul edx
  \}
  empty ; Removing this makes the code not take multiples of 3 cyles
}

section '.text' code readable executable
start:
      invoke   GetCurrentProcess
      mov      ebx, eax
      invoke   SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
      invoke   GetCurrentThread
      invoke   SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL

      invoke   GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask
      test     eax, eax
      jz       .begin

      bsf      ecx, [ProcessAffinityMask]
      shl      eax, cl

      invoke   SetProcessAffinityMask, ebx, eax

.begin:
   invoke GetTickCount
   mov    ebx, eax
   add    ebx, 1000

.warmup:
   invoke GetTickCount
   sub    eax, ebx
   jc     .warmup

irp algo, empty, mult
{
local .begin_test, .find_min, .next

   mov    esi, -SAMPLES

align ALIGNMENT
.begin_test:
   prolog

     algo

   epilog

   mov  dword [timings+SAMPLES*8+esi*8], eax
   mov  dword [timings+SAMPLES*8+esi*8+4], edx

   inc  esi
   jnz  .begin_test

   mov  esi, -SAMPLES
   mov  ebx, dword [timings]
   mov  ecx, dword [timings+4]
   jmp  .next

.find_min:
   mov  eax, ebx
   mov  edx, ecx

   sub  eax, dword [timings+SAMPLES*8+esi*8]
   sbb  edx, dword [timings+SAMPLES*8+esi*8+4]
   jc   .next

   mov  ebx, dword [timings+SAMPLES*8+esi*8]
   mov  ecx, dword [timings+SAMPLES*8+esi*8+4]

.next:
   inc  esi
   jnz  .find_min
   push ecx
   push ebx
}

   fild qword[esp]
   fild qword[esp+8]
   fsubp
   fstp qword[esp]
   push messageformat
   push message
   call [sprintf]
   add esp, 4*2 + 8*2

   invoke MessageBox,0,message,caption,0
   invoke ExitProcess,0


section '.data' data readable writeable
messageformat db '%f',0
caption db 'results',0

align 8
tsc_lo  dd ?
tsc_hi  dd ?

timings rq SAMPLES

ProcessAffinityMask dd ?
SystemAffinityMask dd ?

message rb 256

section '.idata' import data readable writeable
library kernel,'KERNEL32.DLL',\
        user,'USER32.DLL',\
        msvcrt,'MSVCRT.DLL'
import kernel,\
       ExitProcess,'ExitProcess',\
       GetCurrentThread, 'GetCurrentThread',\
       GetCurrentProcess,'GetCurrentProcess',\
       GetTickCount, 'GetTickCount',\
       GetProcessAffinityMask, 'GetProcessAffinityMask',\
       SetProcessAffinityMask, 'SetProcessAffinityMask',\
       SetPriorityClass, 'SetPriorityClass',\
       SetThreadPriority, 'SetThreadPriority',\
       Sleep, 'Sleep'

import user,MessageBox,'MessageBoxA'
import msvcrt,sprintf,'sprintf'    


This always gives me 3 cycles on my Athlon64 Venice with WinXP SP3 (tested more than 20 times for sure). Check comments for extra info. Clearly enough, instructions latency is very hard to measure properly Sad

Perhaps processors with RDTSCP can get better measurements? (this one is a serializing instruction)

[edit]On an old Athlon 750 MHz SlotA it takes 4 cycles but I don't know the official timing so I'm not sure if it was measured correctly.[/edit]
Post 15 Jan 2010, 20:24
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 15 Jan 2010, 20:58
LocoDelAssembly wrote:
bitshifter, but counting uops will give the correct latency? Perhaps the instruction issues several uops distributed to several ports so the latency could be shorter in that case?

This is my latest code based on code from here and bitRAKE's link:
Code:
format PE GUI 4.0
entry start
include 'win32ax.inc'
HIGH_PRIORITY_CLASS = 0x080
WARMUP_TIME = 1000
SAMPLES = 7
ALIGNMENT = 64

macro   prolog
{
local ..begin, ..end

while  ($+..end-..begin) mod ALIGNMENT
        nop
end while

..begin:
        xor     eax, eax
        cpuid
        rdtsc
        mov     [tsc_lo], eax
        mov     [tsc_hi], edx
        xor     eax, eax
        cpuid
..end:
}

macro   epilog
{
        xor     eax, eax
        cpuid
        rdtsc
        sub     eax, [tsc_lo]
        sbb     edx, [tsc_hi]
}

macro empty
{
  xor ecx, edx
}

; Please also try this (the program reports -1 here):
;macro empty
;{
;  push edx
;  pop  edx
;}

; And this one too (the program reports 1 cycle here)
;macro empty
;{
;}

macro mult
{
  rept 1 ; Tried several numbers but it always reported 3*n cycles (provided the XOR ECX, EDX variant of "empty" macro is called)
  \{
    mul edx
  \}
  empty ; Removing this makes the code not take multiples of 3 cyles
}

section '.text' code readable executable
start:
      invoke   GetCurrentProcess
      mov      ebx, eax
      invoke   SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
      invoke   GetCurrentThread
      invoke   SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL

      invoke   GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask
      test     eax, eax
      jz       .begin

      bsf      ecx, [ProcessAffinityMask]
      shl      eax, cl

      invoke   SetProcessAffinityMask, ebx, eax

.begin:
   invoke GetTickCount
   mov    ebx, eax
   add    ebx, 1000

.warmup:
   invoke GetTickCount
   sub    eax, ebx
   jc     .warmup

irp algo, empty, mult
{
local .begin_test, .find_min, .next

   mov    esi, -SAMPLES

align ALIGNMENT
.begin_test:
   prolog

     algo

   epilog

   mov  dword [timings+SAMPLES*8+esi*8], eax
   mov  dword [timings+SAMPLES*8+esi*8+4], edx

   inc  esi
   jnz  .begin_test

   mov  esi, -SAMPLES
   mov  ebx, dword [timings]
   mov  ecx, dword [timings+4]
   jmp  .next

.find_min:
   mov  eax, ebx
   mov  edx, ecx

   sub  eax, dword [timings+SAMPLES*8+esi*8]
   sbb  edx, dword [timings+SAMPLES*8+esi*8+4]
   jc   .next

   mov  ebx, dword [timings+SAMPLES*8+esi*8]
   mov  ecx, dword [timings+SAMPLES*8+esi*8+4]

.next:
   inc  esi
   jnz  .find_min
   push ecx
   push ebx
}

   fild qword[esp]
   fild qword[esp+8]
   fsubp
   fstp qword[esp]
   push messageformat
   push message
   call [sprintf]
   add esp, 4*2 + 8*2

   invoke MessageBox,0,message,caption,0
   invoke ExitProcess,0


section '.data' data readable writeable
messageformat db '%f',0
caption db 'results',0

align 8
tsc_lo  dd ?
tsc_hi  dd ?

timings rq SAMPLES

ProcessAffinityMask dd ?
SystemAffinityMask dd ?

message rb 256

section '.idata' import data readable writeable
library kernel,'KERNEL32.DLL',\
        user,'USER32.DLL',\
        msvcrt,'MSVCRT.DLL'
import kernel,\
       ExitProcess,'ExitProcess',\
       GetCurrentThread, 'GetCurrentThread',\
       GetCurrentProcess,'GetCurrentProcess',\
       GetTickCount, 'GetTickCount',\
       GetProcessAffinityMask, 'GetProcessAffinityMask',\
       SetProcessAffinityMask, 'SetProcessAffinityMask',\
       SetPriorityClass, 'SetPriorityClass',\
       SetThreadPriority, 'SetThreadPriority',\
       Sleep, 'Sleep'

import user,MessageBox,'MessageBoxA'
import msvcrt,sprintf,'sprintf'    


This always gives me 3 cycles on my Athlon64 Venice with WinXP SP3 (tested more than 20 times for sure). Check comments for extra info. Clearly enough, instructions latency is very hard to measure properly :(

Perhaps processors with RDTSCP can get better measurements? (this one is a serializing instruction)

[edit]On an old Athlon 750 MHz SlotA it takes 4 cycles but I don't know the official timing so I'm not sure if it was measured correctly.[/edit]


I get 0.00000 or 7.00000 (mostly 0) on an Intel Core2 Q8200
Post 15 Jan 2010, 20:58
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
bitshifter



Joined: 04 Dec 2007
Posts: 796
Location: Massachusetts, USA
bitshifter 15 Jan 2010, 21:02
The PMC tester has to be configured for your processor.
I tried your tester and get 10.000000 with 2.8 Celeron.
It doesnt mean that its coded wrong, just not good measurement.
Post 15 Jan 2010, 21:02
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 15 Jan 2010, 21:24
Tried on an AMD PhenomII 3.2 GHz and I got 0 and 2 (mostly 2). The code is a shit but I won't invest more time on this Smile
Post 15 Jan 2010, 21:24
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.