flat assembler
Message board for the users of flat assembler.
Index
> Main > mul instruction latency on Goto page 1, 2 Next |
Author |
|
Borsuc 14 Jan 2010, 23:31
A bit off topic but I thought addition was done with combinatorial logic, thus 1 clock cycle.
|
|||
14 Jan 2010, 23:31 |
|
LocoDelAssembly 14 Jan 2010, 23:37
Quote:
Yes, but he means here how much time it takes for the signals to propagate to the entire circuit until the correct result is output. With some adders, the time increases logarithmically as more bits are added (n is number of bits here). tthsqe, I'm really surprised the time it takes for Intel to multiply 64x64 to get 128 bits (completely sure about this?). For the time it takes it is probable that it is micro-coded and there it is the mayor delay. |
|||
14 Jan 2010, 23:37 |
|
LocoDelAssembly 15 Jan 2010, 00:05
tthsqe, I've just checked Agner Fog's manuals because the Intel Optimization manual has very little information about this. Core i7, has 3 cycles latency to do a "(i)MUL r64" (probably a typo since r32 takes 5 cycles) and can execute a new one every 2 cycles.
The latency in the Pentium4E is 11 cycles for (i)MUL r64, and 16 cycles for Pentium4 model 2 (but for (i)MUL r8/32). In Core2 (i)MUL r64 takes 7 cycles. In all cases seems to be above AMD, this may be to either, micro-coded design or it is just pipelined in short steps. PS: BTW, PhenonII can still get the multiplication done in 4 cycles at 3.0 GHz? Perhaps they had to sacrifice a little performance to attain that clock speed. |
|||
15 Jan 2010, 00:05 |
|
tthsqe 15 Jan 2010, 00:54
Yeh, that 3 cycle latency does not sound right. I guess I'll just have to test it out myself when I get back.
I first noticed this when looking at the GMP multiplication algorithms. For 64 bit systems, I think the same code using mul is being used, but you can see the differences between a corei7 and a opteron at http://gmplib.org/gmpbench.html. For testing purposes, should this be accurate?: Code: rdtsc mov [T],eax mov [T+4],edx mov rax,somenum mov rdx,somenum' mov ecx,1000 .1: rept 10 {mul rdx} dec ecx jnz .1 rdtsc sub eax,[T] sbb edx,[T+4] mov ecx,10000 div ecx |
|||
15 Jan 2010, 00:54 |
|
windwakr 15 Jan 2010, 01:11
Where's the serialization before the rdtsc?
I'd take a read through this if I were you: http://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf Should also lock it to one core if you're on a multicore system and set the thread to high priority. |
|||
15 Jan 2010, 01:11 |
|
LocoDelAssembly 15 Jan 2010, 01:13
I don't know how to test timings at clock cycle level but I think you'll need to do this things:
* Fix thread affinity to one core * Warm up the processor because on newer Intel processors the TSC runs at fixed speed instead of incrementing slower on low power mode. * Add CPUID or some serializing instruction to take the time when the multiplications actually finished (and perhaps you'll need one before entering the loop but I'm not sure about the exact position it should go). PS: Well, windwakr posted while I was previewing. |
|||
15 Jan 2010, 01:13 |
|
edfed 15 Jan 2010, 01:19
to measure time at CPU level, you can do it with RDTSC.
put rdtsc before the fnction that will use the code you want to measure. put the code you want to measure inside the function compute the difference between two timings. you will know how much CPU cycles takes one instruction in a function. and the effectof alignment of this instruction can be analysed too. the code to do it should not be very hard to do. also, i read one tuime in IA manuals that: timing at instruction level can be measured with step by step instruction mode (int3 i think is generated each time an instructoin is executed. after, i read somewhere else there are MSRs that will do the toiming measure for instructions, and also, count the times an instruction is executed... but i don't know more about. |
|||
15 Jan 2010, 01:19 |
|
windwakr 15 Jan 2010, 04:29
I'm a bit sleepy, can someone take a look over my code?
This is for 32-bit, as I have no 64-bit OS, but it should be easy enough to modify into 64-bit. It SHOULD return the number of clocks for 1000 muls....SHOULD....but like I said, I'm a little tired, so who knows. EDIT:*CODE REMOVED* Hope no one saw it, I was way too tired to be coding anything.... Last edited by windwakr on 15 Jan 2010, 06:55; edited 2 times in total |
|||
15 Jan 2010, 04:29 |
|
tthsqe 15 Jan 2010, 06:19
ok, implementing everyone's suggestions I get:
Code: format PE GUI 4.0 entry start include 'win32a.inc' Runs = 10000 LoopUnroll equ 10 HIGH_PRIORITY_CLASS = 0x080 t1.l equ ebp t1.h equ esi t2.l equ edi t2.h equ ebx section '.text' code readable executable start: invoke GetCurrentProcess invoke SetPriorityClass,eax,HIGH_PRIORITY_CLASS mov esi,100000 .warmup: irps reg, eax ebx ecx edx {xor reg,reg} dec esi jnz .warmup cpuid rdtsc irps reg, eax ebx ecx edx {xor reg,reg} cpuid rdtsc mov t1.l,eax mov t1.h,edx irps reg, eax ebx ecx edx {xor reg,reg} cpuid rdtsc mov t2.l,eax push edx mov eax,0xf9d8e41 mov edx,0x59db234 mov ecx,Runs .1: rept LoopUnroll {mul edx} dec ecx jnz .1 irps reg, eax ebx ecx edx {xor reg,reg} cpuid rdtsc pop t2.h sub eax,t2.l sbb edx,t2.h sub t1.l,t2.l sbb t1.h,t2.h add eax,t1.l adc edx,t1.h sub esp,4*2 mov dword[esp-4],Runs*LoopUnroll mov [esp],eax mov [esp+4],edx fild qword[esp] fidiv dword[esp-4] fstp qword[esp] push messageformat push message call [sprintf] add esp,4*4 invoke MessageBox,0,message,caption,0 invoke ExitProcess,0 section '.data' data readable writeable message rb 256 messageformat db '%f',0 caption db 'results',0 section '.idata' import data readable writeable library kernel,'KERNEL32.DLL',\ user,'USER32.DLL',\ msvcrt,'MSVCRT.DLL' import kernel,\ ExitProcess,'ExitProcess',\ GetCurrentProcess,'GetCurrentProcess',\ SetPriorityClass,'SetPriorityClass' import user,MessageBox,'MessageBoxA' import msvcrt,sprintf,'sprintf' I am still getting unstable values between 7.3 and 7.5 on an atom.[/code] |
|||
15 Jan 2010, 06:19 |
|
tthsqe 15 Jan 2010, 06:21
wndwalker, your loop is too small - If LoopUnroll is set to 1, the timing goes up to 8.9 on mine.
EDIT: your sode given anywhere from 8.9 to 9.5 on mine. |
|||
15 Jan 2010, 06:21 |
|
windwakr 15 Jan 2010, 07:25
Ya, as I was dozing off to sleep I saw all the errors in my code. Had to get up and try to fix them. Try this:
Code: format PE GUI 4.0 entry start include 'win32a.inc' section '.text' code readable executable start: invoke GetCurrentProcess invoke SetPriorityClass,eax,HIGH_PRIORITY_CLASS invoke GetCurrentThread push eax invoke SetThreadAffinityMask,eax,1 pop eax invoke SetThreadPriority,eax,THREAD_PRIORITY_HIGHEST ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;Measure the overhead of the timing instructions....lifted from the intel doc with some modifications cpuid rdtsc mov [subtime1], eax mov [subtime2], edx mov ecx,1000 .2: dec ecx jnz .2 cpuid rdtsc sub eax, [subtime1] sub edx, [subtime2] mov [subtime1], eax mov [subtime2], edx cpuid rdtsc mov [subtime1], eax mov [subtime2], edx mov ecx,1000 .3: dec ecx jnz .3 cpuid rdtsc sub eax, [subtime1] sub edx, [subtime2] mov [subtime1], eax mov [subtime2], edx cpuid rdtsc mov [subtime1], eax mov [subtime2], edx mov ecx,1000 .4: dec ecx jnz .4 cpuid rdtsc sub eax, [subtime1] sub edx, [subtime2] mov [subtime1], eax mov [subtime2], edx ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; mov ecx,1 jmp .1 .0: dec [pass1] mov ebx,0 cpuid rdtsc mov [time],eax mov [time2],edx mov ecx,1000 .1: mul edx mul edx mul edx mul edx mul edx mul edx mul edx mul edx mul edx mul edx dec ecx jnz .1 cpuid rdtsc sub eax,[time] sub edx,[time2] sub eax,[subtime1] sub edx,[subtime2] mov [time],eax mov [time2],edx cmp [pass1],0 ja .0 fild qword [time] fidiv [tenthou] fstp qword[esp] push fmt push buf call [sprintf] invoke MessageBox,NULL,buf,title,MB_OK invoke ExitProcess,0 section '.data' data readable writeable pass1 dd 10000 subtime1 dd 0 subtime2 dd 0 outpt dd 0 tenthou dd 10000 time dd 0 time2 dd 0 title db '...',0 fmt db '~%f clocks',0 buf rb 256 section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',\ user32,'USER32.DLL',\ msvcrt,'MSVCRT.DLL' include 'api\kernel32.inc' include 'api\user32.inc' import msvcrt,sprintf,'sprintf' Gives me a smaller number than yours does, by about 0.3 clocks......BUT, that could just be another lack of sleep induced error in my code. |
|||
15 Jan 2010, 07:25 |
|
bitRAKE 15 Jan 2010, 07:53
Try using MUL EAX/ECX instead of MUL EDX - which forces every MUL to wait on the result of the upper DWORD. All tests are useful. Many processors have the result in EAX earlier than EDX is ready.
|
|||
15 Jan 2010, 07:53 |
|
LocoDelAssembly 15 Jan 2010, 07:56
Well, based on previous windwakr code, I hope I'm not carrying critical errors
Code: format PE GUI 4.0 entry start include 'win32ax.inc' Runs = 1 LoopUnroll equ 1 HIGH_PRIORITY_CLASS = 0x080 WARMUP_TIME = 1000 SAMPLES = 3 ; IF YOU NEED HIGHER THAN 8 YOU'LL NEED TO EDIT THE CODE TOO ALIGNMENT = 4 t1.l equ ebp t1.h equ esi t2.l equ edi t2.h equ ebx section '.text' code readable executable start: invoke GetCurrentProcess mov ebx, eax invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS invoke GetCurrentThread invoke SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL invoke GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask test eax, eax jz .begin bsf ecx, [ProcessAffinityMask] shl eax, cl invoke SetProcessAffinityMask, ebx, eax .begin: invoke GetTickCount mov ebx, eax add ebx, 1000 sub esp, 4*2 .warmup: invoke GetTickCount sub eax, ebx jc .warmup invoke Sleep, 30 ; An attempt to begin with a fresh time slice. I hope Windows won't take the opportunity to downgrade processor's speed for this little pause. .begin_test: repeat 2 if % and 1 virtual at $ else while ($+LENGTH) mod ALIGNMENT ; VERY STUPID I KNOW, MY BRAIN ISN'T WORKING MUCH REALLY... nop end while end if cpuid rdtsc xor eax, eax cpuid rdtsc mov t1.l,eax mov t1.h,edx xor eax, eax cpuid rdtsc mov t2.l,eax push edx mov eax,0xf9d8e41 mov edx,0x59db234 mov ecx,Runs if % and 1 LENGTH = $-$$ end virtual end if end repeat .1: ; Aligned to ALIGNMENT bytes rept LoopUnroll {mul edx} dec ecx jnz .1 xor eax, eax cpuid rdtsc pop t2.h sub eax,t2.l sbb edx,t2.h sub t1.l,t2.l sbb t1.h,t2.h add eax,t1.l adc edx,t1.h mov dword[esp-4],Runs*LoopUnroll mov [esp],eax mov [esp+4],edx fild qword[esp] fidiv dword[esp-4] dec [samples] jnz .begin_test rept 2 { fcomi st, st1 ja @f fxch st1 @@: fstp st0 } fstp qword[esp] push messageformat push message call [sprintf] add esp, 4*4 + 4*2 invoke MessageBox,0,message,caption,0 invoke ExitProcess,0 ret section '.data' data readable writeable message rb 256 messageformat db '%f',0 caption db 'results',0 align 4 samples dd SAMPLES ProcessAffinityMask dd ? SystemAffinityMask dd ? section '.idata' import data readable writeable library kernel,'KERNEL32.DLL',\ user,'USER32.DLL',\ msvcrt,'MSVCRT.DLL' import kernel,\ ExitProcess,'ExitProcess',\ GetCurrentThread, 'GetCurrentThread',\ GetCurrentProcess,'GetCurrentProcess',\ GetTickCount, 'GetTickCount',\ GetProcessAffinityMask, 'GetProcessAffinityMask',\ SetProcessAffinityMask, 'SetProcessAffinityMask',\ SetPriorityClass, 'SetPriorityClass',\ SetThreadPriority, 'SetThreadPriority',\ Sleep, 'Sleep' import user,MessageBox,'MessageBoxA' import msvcrt,sprintf,'sprintf' I've run that more than 10 times on my XP and in all of them gave me 3.000000. BTW, when I change LoopUnroll from 3 to N it starts to take 2 cycles and then grows from there. There is something wrong but it is almost 5 AM here so I'll let others figure that out One more note: In the testings don't make very long loops, you should try to ensure the code will run in a single time-slice (or quantum) and also avoid interrupts as much as possible while executing, because you'll get your results contaminated otherwise (it is even questionable if setting high priority is really needed here). |
|||
15 Jan 2010, 07:56 |
|
bitRAKE 15 Jan 2010, 10:06
Hm...I either get 7.000 or 0.000 through several runs.
Yet, my own code returns the following: MUL EDX 4.4890 MUL RDX 6.5664 It's tricky timing a single instruction. I can fill the bytes following the MUL with NOPs (up to eight bytes total) without effecting the timing. Numbers seem good as Agner Fog lists the latency for my processor at 5 and 7 cycles for 32bit and 64bit, respectively. |
|||
15 Jan 2010, 10:06 |
|
bitshifter 15 Jan 2010, 10:37
Its better to count uops than to use rdtsc.
Agner Fog gives a few versions of each tester. The rdtsc will give funny results because of processor multiplier factor. So instead of the TSC version use PMC version instead. If configured to your chip correctly then everyones numbers match up. And dont forget that priming the cache and ignoring the first pass is good. There is a sample test in his PMC tester that everyone can baseline from. |
|||
15 Jan 2010, 10:37 |
|
LocoDelAssembly 15 Jan 2010, 20:24
bitshifter, but counting uops will give the correct latency? Perhaps the instruction issues several uops distributed to several ports so the latency could be shorter in that case?
This is my latest code based on code from here and bitRAKE's link: Code: format PE GUI 4.0 entry start include 'win32ax.inc' HIGH_PRIORITY_CLASS = 0x080 WARMUP_TIME = 1000 SAMPLES = 7 ALIGNMENT = 64 macro prolog { local ..begin, ..end while ($+..end-..begin) mod ALIGNMENT nop end while ..begin: xor eax, eax cpuid rdtsc mov [tsc_lo], eax mov [tsc_hi], edx xor eax, eax cpuid ..end: } macro epilog { xor eax, eax cpuid rdtsc sub eax, [tsc_lo] sbb edx, [tsc_hi] } macro empty { xor ecx, edx } ; Please also try this (the program reports -1 here): ;macro empty ;{ ; push edx ; pop edx ;} ; And this one too (the program reports 1 cycle here) ;macro empty ;{ ;} macro mult { rept 1 ; Tried several numbers but it always reported 3*n cycles (provided the XOR ECX, EDX variant of "empty" macro is called) \{ mul edx \} empty ; Removing this makes the code not take multiples of 3 cyles } section '.text' code readable executable start: invoke GetCurrentProcess mov ebx, eax invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS invoke GetCurrentThread invoke SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL invoke GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask test eax, eax jz .begin bsf ecx, [ProcessAffinityMask] shl eax, cl invoke SetProcessAffinityMask, ebx, eax .begin: invoke GetTickCount mov ebx, eax add ebx, 1000 .warmup: invoke GetTickCount sub eax, ebx jc .warmup irp algo, empty, mult { local .begin_test, .find_min, .next mov esi, -SAMPLES align ALIGNMENT .begin_test: prolog algo epilog mov dword [timings+SAMPLES*8+esi*8], eax mov dword [timings+SAMPLES*8+esi*8+4], edx inc esi jnz .begin_test mov esi, -SAMPLES mov ebx, dword [timings] mov ecx, dword [timings+4] jmp .next .find_min: mov eax, ebx mov edx, ecx sub eax, dword [timings+SAMPLES*8+esi*8] sbb edx, dword [timings+SAMPLES*8+esi*8+4] jc .next mov ebx, dword [timings+SAMPLES*8+esi*8] mov ecx, dword [timings+SAMPLES*8+esi*8+4] .next: inc esi jnz .find_min push ecx push ebx } fild qword[esp] fild qword[esp+8] fsubp fstp qword[esp] push messageformat push message call [sprintf] add esp, 4*2 + 8*2 invoke MessageBox,0,message,caption,0 invoke ExitProcess,0 section '.data' data readable writeable messageformat db '%f',0 caption db 'results',0 align 8 tsc_lo dd ? tsc_hi dd ? timings rq SAMPLES ProcessAffinityMask dd ? SystemAffinityMask dd ? message rb 256 section '.idata' import data readable writeable library kernel,'KERNEL32.DLL',\ user,'USER32.DLL',\ msvcrt,'MSVCRT.DLL' import kernel,\ ExitProcess,'ExitProcess',\ GetCurrentThread, 'GetCurrentThread',\ GetCurrentProcess,'GetCurrentProcess',\ GetTickCount, 'GetTickCount',\ GetProcessAffinityMask, 'GetProcessAffinityMask',\ SetProcessAffinityMask, 'SetProcessAffinityMask',\ SetPriorityClass, 'SetPriorityClass',\ SetThreadPriority, 'SetThreadPriority',\ Sleep, 'Sleep' import user,MessageBox,'MessageBoxA' import msvcrt,sprintf,'sprintf' This always gives me 3 cycles on my Athlon64 Venice with WinXP SP3 (tested more than 20 times for sure). Check comments for extra info. Clearly enough, instructions latency is very hard to measure properly Perhaps processors with RDTSCP can get better measurements? (this one is a serializing instruction) [edit]On an old Athlon 750 MHz SlotA it takes 4 cycles but I don't know the official timing so I'm not sure if it was measured correctly.[/edit] |
|||
15 Jan 2010, 20:24 |
|
r22 15 Jan 2010, 20:58
LocoDelAssembly wrote: bitshifter, but counting uops will give the correct latency? Perhaps the instruction issues several uops distributed to several ports so the latency could be shorter in that case? I get 0.00000 or 7.00000 (mostly 0) on an Intel Core2 Q8200 |
|||
15 Jan 2010, 20:58 |
|
bitshifter 15 Jan 2010, 21:02
The PMC tester has to be configured for your processor.
I tried your tester and get 10.000000 with 2.8 Celeron. It doesnt mean that its coded wrong, just not good measurement. |
|||
15 Jan 2010, 21:02 |
|
LocoDelAssembly 15 Jan 2010, 21:24
Tried on an AMD PhenomII 3.2 GHz and I got 0 and 2 (mostly 2). The code is a shit but I won't invest more time on this
|
|||
15 Jan 2010, 21:24 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.