flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
Inagawa 02 May 2012, 21:27
This test is no longer relevant. I simply wanted to know how many of you are on post-Nehalem architectures. In the end I decided to choose compatibility over comfort (the finished macro here)
A big thank you to all who helped me ![]() Last edited by Inagawa on 03 May 2012, 15:12; edited 3 times in total |
|||
![]() |
|
Enko 02 May 2012, 21:48
CPU not suported. AMD Athlon 1600+
|
|||
![]() |
|
Inagawa 02 May 2012, 21:55
Enko: Anyone with CPU older than cca 3 years(a rough guess) won't be able to run this.
typedef: Don't I seem like a trustworthy guy? ![]() Jokes aside, I will release the source code once I gather enough information from the program, though I can swear by my good conscience there isn't anything even remotely malicious about the code, it is simply to help me tweak the final version. Edit: Seems you need at least a Nehalem arch. to run this. |
|||
![]() |
|
r22 02 May 2012, 22:30
Code: -=== 1048575 repetitions x 20 runs ===- ====> Run 001 = 1283 cycles ====> Run 002 = 1283 cycles ====> Run 003 = 1283 cycles ====> Run 004 = 1283 cycles ====> Run 005 = 1283 cycles ====> Run 006 = 1283 cycles ====> Run 007 = 1283 cycles ====> Run 008 = 1283 cycles ====> Run 009 = 1283 cycles ====> Run 010 = 1283 cycles ====> Run 011 = 1283 cycles ====> Run 012 = 1283 cycles ====> Run 013 = 1283 cycles ====> Run 014 = 1283 cycles ====> Run 015 = 1283 cycles ====> Run 016 = 1283 cycles ====> Run 017 = 1283 cycles ====> Run 018 = 1283 cycles ====> Run 019 = 1283 cycles ====> Run 020 = 1283 cycles ====> Average: 1283 cycles AMD Phenom II X6 1055T 2.8 GHz |
|||
![]() |
|
LocoDelAssembly 02 May 2012, 23:27
[edit]WARNING: The disassembly comes from an older version of the program.[/edit]
Code: .data:00401000 ; .data:00401000 ; +-------------------------------------------------------------------------+ .data:00401000 ; ¦ This file is generated by The Interactive Disassembler (IDA) ¦ .data:00401000 ; ¦ Copyright (c) 2010 by Hex-Rays SA, <support@hex-rays.com> ¦ .data:00401000 ; ¦ Licensed to: Freeware version ¦ .data:00401000 ; +-------------------------------------------------------------------------+ .data:00401000 ; .data:00401000 ; Input MD5 : F48ABD8AB8116CA38179EDF1D28B2F71 .data:00401000 .data:00401000 ; File Name : C:\Users\Hernan\AppData\Local\Temp\Rar$DR54.080\PC.EXE .data:00401000 ; Format : Portable executable for 80386 (PE) .data:00401000 ; Imagebase : 400000 .data:00401000 ; Section 1. (virtual address 00001000) .data:00401000 ; Virtual size : 00000020 ( 32.) .data:00401000 ; Section size in file : 00000200 ( 512.) .data:00401000 ; Offset to raw data for section: 00000400 .data:00401000 ; Flags C0000040: Data Readable Writable .data:00401000 ; Alignment : default .data:00401000 .data:00401000 Ideal .data:00401000 p686 .data:00401000 pmmx .data:00401000 model flat .data:00401000 .data:00401000 ; --------------------------------------------------------------------------- .data:00401000 .data:00401000 ; Segment type: Pure data .data:00401000 ; Segment permissions: Read/Write .data:00401000 segment _data para public 'DATA' use32 .data:00401000 assume cs:_data .data:00401000 ;org 401000h .data:00401000 dword_401000 dd 0 ; DATA XREF: .text:004030B2w .data:00401000 ; .text:00403187r ... .data:00401004 dword_401004 dd 0 ; DATA XREF: .text:004030BCw .data:00401004 ; .text:0040317Fr ... .data:00401008 dword_401008 dd 0 ; DATA XREF: .text:004030C6w .data:00401008 ; .text:00403121r ... .data:0040100C dword_40100C dd 0 ; DATA XREF: .text:loc_4030D0w .data:0040100C ; .text:00403119r ... .data:00401010 db 0 .data:00401011 db 0 .data:00401012 db 0 .data:00401013 db 0 .data:00401014 dword_401014 dd 0 ; DATA XREF: .text:004031B5w .data:00401014 ; .text:004031BBw ... .data:00401018 dword_401018 dd 0 ; DATA XREF: .text:loc_403248r .data:0040101C dword_40101C dd 0 ; DATA XREF: .text:00403090w .data:0040101C ; sub_4031F1+Br .data:00401020 align 200h .data:00401020 ends _data .data:00401020 .data:00402000 ; Section 2. (virtual address 00002000) .data:00402000 ; Virtual size : 00000001 ( 1.) .data:00402000 ; Section size in file : 00000200 ( 512.) .data:00402000 ; Offset to raw data for section: 00000600 .data:00402000 ; Flags C0000040: Data Readable Writable .data:00402000 ; Alignment : default .data:00402000 ; --------------------------------------------------------------------------- .data:00402000 .data:00402000 ; Segment type: Pure data .data:00402000 ; Segment permissions: Read/Write .data:00402000 segment _data para public 'DATA' use32 .data:00402000 assume cs:_data .data:00402000 ;org 402000h .data:00402000 db 90h, 1FFh dup(0) .data:00402000 ends _data .data:00402000 .text:00403000 ; Section 3. (virtual address 00003000) .text:00403000 ; Virtual size : 000002A0 ( 672.) .text:00403000 ; Section size in file : 00000400 ( 1024.) .text:00403000 ; Offset to raw data for section: 00000800 .text:00403000 ; Flags 60000020: Text Executable Readable .text:00403000 ; Alignment : default .text:00403000 ; --------------------------------------------------------------------------- .text:00403000 .text:00403000 ; Segment type: Pure code .text:00403000 ; Segment permissions: Read/Execute .text:00403000 segment _text para public 'CODE' use32 .text:00403000 assume cs:_text .text:00403000 ;org 403000h .text:00403000 assume es:nothing, ss:nothing, ds:_data, fs:nothing, gs:nothing .text:00403000 .text:00403000 ; ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ S U B R O U T I N E ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ .text:00403000 .text:00403000 .text:00403000 public start .text:00403000 proc start near .text:00403000 mov eax, 80000001h .text:00403005 cpuid .text:00403007 bt edx, 1Bh .text:0040300B jb short loc_40304D .text:0040300D call loc_403036 .text:0040300D ; --------------------------------------------------------------------------- .text:00403012 aSorryYourCpuIs db 'Sorry, your CPU is not supported.',0Dh,0Ah,0 .text:00403036 ; --------------------------------------------------------------------------- .text:00403036 .text:00403036 loc_403036: ; CODE XREF: start+Dp .text:00403036 call [ds:printf] .text:00403036 endp start .text:00403036 .text:0040303C add esp, 4 .text:0040303F call [ds:getchar] .text:00403045 push 0 .text:00403047 call [ds:ExitProcess] .text:0040304D .text:0040304D loc_40304D: ; CODE XREF: start+Bj .text:0040304D push eax .text:0040304E push ecx .text:0040304F push edx .text:00403050 push ebx .text:00403051 push ebp .text:00403052 push esi .text:00403053 push edi .text:00403054 push 14h .text:00403056 push 0FFFFFh ; dwPriorityClass .text:0040305B call loc_403087 .text:0040305B ; --------------------------------------------------------------------------- .text:00403060 aIRepetitionsXI db '-=== %i repetitions x %i runs ===-',0Dh,0Ah .text:00403060 db 0Dh,0Ah,0 .text:00403087 ; --------------------------------------------------------------------------- .text:00403087 .text:00403087 loc_403087: ; CODE XREF: .text:0040305Bp .text:00403087 call [ds:printf] .text:0040308D add esp, 0Ch .text:00403090 mov [ds:dword_40101C], 14h .text:0040309A call [ds:GetCurrentProcess] .text:004030A0 push 20h .text:004030A2 push eax ; hProcess .text:004030A3 .text:004030A3 loc_4030A3: .text:004030A3 call [ds:SetPriorityClass] .text:004030A9 xor edi, edi .text:004030AB push edi .text:004030AC mov ebx, 1 .text:004030B1 push ebx .text:004030B2 mov [ds:dword_401000], 0FFFFFFFFh .text:004030BC mov [ds:dword_401004], 0FFFFFFFFh .text:004030C6 mov [ds:dword_401008], 0FFFFFFFFh .text:004030D0 .text:004030D0 loc_4030D0: .text:004030D0 mov [ds:dword_40100C], 0FFFFFFFFh .text:004030DA nop .text:004030DB nop .text:004030DC nop .text:004030DD nop .text:004030DE nop .text:004030DF nop .text:004030E0 .text:004030E0 loc_4030E0: ; CODE XREF: sub_4031F1+12j .text:004030E0 mov esi, 0FFFFFh .text:004030E5 xor eax, eax .text:004030E7 cpuid .text:004030E9 xor eax, eax .text:004030EB cpuid .text:004030ED xor eax, eax .text:004030EF cpuid .text:004030F1 push 0 ; dwMilliseconds .text:004030F3 call [ds:Sleep] .text:004030F9 push esi .text:004030FA nop .text:004030FB nop .text:004030FC nop .text:004030FD nop .text:004030FE nop .text:004030FF nop .text:00403100 .text:00403100 loc_403100: ; CODE XREF: .text:00403135j .text:00403100 xor eax, eax .text:00403102 cpuid .text:00403104 rdtsc .text:00403106 push edx .text:00403107 push eax .text:00403108 xor eax, eax .text:0040310A cpuid .text:0040310C xor eax, eax .text:0040310E cpuid .text:00403110 invlpg cl .text:00403113 pop ecx .text:00403114 sub eax, ecx .text:00403116 pop ecx .text:00403117 sbb edx, ecx .text:00403119 cmp edx, [ds:dword_40100C] .text:0040311F jnz short loc_403129 .text:00403121 cmp eax, [ds:dword_401008] .text:00403127 jnb short loc_403134 .text:00403129 .text:00403129 loc_403129: ; CODE XREF: .text:0040311Fj .text:00403129 mov [ds:dword_401008], eax .text:0040312E mov [ds:dword_40100C], edx .text:00403134 .text:00403134 loc_403134: ; CODE XREF: .text:00403127j .text:00403134 dec esi .text:00403135 jnz short loc_403100 .text:00403137 pop esi .text:00403138 push esi .text:00403139 push 0 ; dwMilliseconds .text:0040313B call [ds:Sleep] .text:00403141 nop .text:00403142 nop .text:00403143 nop .text:00403144 nop .text:00403145 nop .text:00403146 nop .text:00403147 nop .text:00403148 nop .text:00403149 nop .text:0040314A nop .text:0040314B nop .text:0040314C nop .text:0040314D nop .text:0040314E nop .text:0040314F nop .text:00403150 .text:00403150 loc_403150: ; CODE XREF: .text:0040319Bj .text:00403150 push esi .text:00403151 push edi .text:00403152 invlpg cl .text:00403155 push edx .text:00403156 push eax .text:00403157 xor eax, eax .text:00403159 cpuid .text:0040315B mov ecx, 12Ch .text:00403160 .text:00403160 loc_403160: ; CODE XREF: .text:0040316Ej .text:00403160 mov eax, 0FFFFFFFFh .text:00403165 mov edx, 0FFFFFFFFh .text:0040316A rcr eax, 1Fh .text:0040316D dec ecx .text:0040316E jnz short loc_403160 .text:00403170 xor eax, eax .text:00403172 cpuid .text:00403174 invlpg cl .text:00403177 pop ecx .text:00403178 sub eax, ecx .text:0040317A pop ecx .text:0040317B sbb edx, ecx .text:0040317D pop edi .text:0040317E pop esi .text:0040317F cmp edx, [ds:dword_401004] .text:00403185 jnz short loc_40318F .text:00403187 cmp eax, [ds:dword_401000] .text:0040318D jnb short loc_40319A .text:0040318F .text:0040318F loc_40318F: ; CODE XREF: .text:00403185j .text:0040318F mov [ds:dword_401000], eax .text:00403194 mov [ds:dword_401004], edx .text:0040319A .text:0040319A loc_40319A: ; CODE XREF: .text:0040318Dj .text:0040319A dec esi .text:0040319B jnz short loc_403150 .text:0040319D pop esi .text:0040319E mov eax, [ds:dword_401000] .text:004031A3 sub eax, [ds:dword_401008] .text:004031A9 mov edx, [ds:dword_401004] .text:004031AF sbb edx, [ds:dword_40100C] .text:004031B5 add [ds:dword_401014], eax .text:004031BB adc [ds:dword_401014], edx .text:004031C1 pop ebx .text:004031C2 cmp eax, 0 .text:004031C5 jle short loc_4031FB .text:004031C7 pop edi .text:004031C8 inc edi .text:004031C9 push edi .text:004031CA push eax .text:004031CB push eax .text:004031CC push ebx .text:004031CD push 3 .text:004031CF call sub_4031F1 .text:004031CF ; --------------------------------------------------------------------------- .text:004031D4 aRun0IICycles db '====> Run %0*i = %i cycles',0Dh,0Ah,0 .text:004031F1 .text:004031F1 ; ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ S U B R O U T I N E ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ .text:004031F1 .text:004031F1 .text:004031F1 proc sub_4031F1 near ; CODE XREF: .text:004031CFp .text:004031F1 call [ds:printf] .text:004031F7 add esp, 10h .text:004031FA pop eax .text:004031FB .text:004031FB loc_4031FB: ; CODE XREF: .text:004031C5j .text:004031FB inc ebx .text:004031FC cmp ebx, [ds:dword_40101C] .text:00403202 push ebx .text:00403203 jle loc_4030E0 .text:00403209 pop ebx .text:0040320A pop edi .text:0040320B test edi, edi .text:0040320D jnz short loc_403243 .text:0040320F push eax ; char * .text:00403210 call loc_403237 .text:00403210 ; --------------------------------------------------------------------------- .text:00403215 aTheCodeWasProb db 'The code was probably too short',0Dh,0Ah,0 .text:00403237 ; --------------------------------------------------------------------------- .text:00403237 .text:00403237 loc_403237: ; CODE XREF: sub_4031F1+1Fp .text:00403237 call [ds:printf] .text:00403237 endp sub_4031F1 .text:00403237 .text:0040323D add esp, 4 .text:00403240 pop eax .text:00403241 jmp short loc_40327C .text:00403243 ; --------------------------------------------------------------------------- .text:00403243 .text:00403243 loc_403243: ; CODE XREF: sub_4031F1+1Cj .text:00403243 mov eax, [ds:dword_401014] .text:00403248 .text:00403248 loc_403248: .text:00403248 mov edx, [ds:dword_401018] .text:0040324E idiv edi .text:00403250 push eax ; dwPriorityClass .text:00403251 call loc_403273 .text:00403251 ; --------------------------------------------------------------------------- .text:00403256 aAverageICycles db 0Ah .text:00403256 db 0Dh,'====> Average: %i cycles',0Dh,0Ah,0 .text:00403273 ; --------------------------------------------------------------------------- .text:00403273 .text:00403273 loc_403273: ; CODE XREF: .text:00403251p .text:00403273 call [ds:printf] .text:00403279 add esp, 8 .text:0040327C .text:0040327C loc_40327C: ; CODE XREF: .text:00403241j .text:0040327C call [ds:GetCurrentProcess] .text:00403282 push 20h .text:00403284 push eax .text:00403285 call [ds:SetPriorityClass] .text:0040328B pop edi .text:0040328C pop esi .text:0040328D pop ebp .text:0040328E pop ebx .text:0040328F pop edx .text:00403290 pop ecx .text:00403291 pop eax .text:00403292 call [ds:getchar] .text:00403298 push 0 .text:0040329A call [ds:ExitProcess] .text:0040329A ; --------------------------------------------------------------------------- .text:004032A0 dd 2 dup(0) .text:004032A8 dd 6 dup(0) .text:004032C0 dd 0 .text:004032C4 dd 4Fh dup(0) .text:004032C4 ends _text .text:004032C4 .idata:0040406C ; .idata:0040406C ; Imports from KERNEL32.DLL .idata:0040406C ; .idata:0040406C ; Section 4. (virtual address 00004000) .idata:0040406C ; Virtual size : 000000EC ( 236.) .idata:0040406C ; Section size in file : 00000200 ( 512.) .idata:0040406C ; Offset to raw data for section: 00000C00 .idata:0040406C ; Flags C0000040: Data Readable Writable .idata:0040406C ; Alignment : default .idata:0040406C ; --------------------------------------------------------------------------- .idata:0040406C .idata:0040406C ; Segment type: Externs .idata:0040406C ; _idata .idata:0040406C ; void __stdcall ExitProcess(UINT uExitCode) .idata:0040406C extrn ExitProcess:dword ; DATA XREF: .text:00403047r .idata:0040406C ; .text:0040329Ar .idata:00404070 ; HANDLE GetCurrentProcess(void) .idata:00404070 extrn GetCurrentProcess:dword ; DATA XREF: .text:0040309Ar .idata:00404070 ; .text:loc_40327Cr .idata:00404074 ; BOOL __stdcall SetPriorityClass(HANDLE hProcess,DWORD dwPriorityClass) .idata:00404074 extrn SetPriorityClass:dword ; DATA XREF: .text:loc_4030A3r .idata:00404074 ; .text:00403285r .idata:00404078 ; void __stdcall Sleep(DWORD dwMilliseconds) .idata:00404078 extrn Sleep:dword ; DATA XREF: .text:004030F3r .idata:00404078 ; .text:0040313Br .idata:0040407C .idata:00404080 .idata:004040CC ; .idata:004040CC ; Imports from MSVCRT.DLL .idata:004040CC ; .idata:004040CC ; int printf(const char *,...) .idata:004040CC extrn printf:dword ; DATA XREF: start:loc_403036r .idata:004040CC ; .text:loc_403087r ... .idata:004040D0 ; int getchar(void) .idata:004040D0 extrn getchar:dword ; DATA XREF: .text:0040303Fr .idata:004040D0 ; .text:00403292r .idata:004040D4 .idata:004040D4 .idata:004040D4 My results: Code: -=== 1048575 repetitions x 20 runs ===- ====> Run 001 = 2120 cycles ====> Run 002 = 2120 cycles ====> Run 003 = 2120 cycles ====> Run 004 = 2120 cycles ====> Run 005 = 2120 cycles ====> Run 006 = 2120 cycles ====> Run 007 = 2120 cycles ====> Run 008 = 2120 cycles ====> Run 009 = 2120 cycles ====> Run 010 = 2120 cycles ====> Run 011 = 2120 cycles ====> Run 012 = 2120 cycles ====> Run 013 = 2120 cycles ====> Run 014 = 2120 cycles ====> Run 015 = 2120 cycles ====> Run 016 = 2120 cycles ====> Run 017 = 2120 cycles ====> Run 018 = 2120 cycles ====> Run 019 = 2120 cycles ====> Run 020 = 2120 cycles ====> Average: 2120 cycles Last edited by LocoDelAssembly on 03 May 2012, 00:09; edited 1 time in total |
|||
![]() |
|
Inagawa 02 May 2012, 23:43
Thanks a lot for the runs. I have updated the version so it gives me more info and I don't have to bother you with constant tweaks. (It's in the first post.)
|
|||
![]() |
|
typedef 03 May 2012, 01:32
Code: -=== 1048575 repetitions x 20 runs ===- ====> Run 001 = 5 cycles ====> Run 002 = 5 cycles ====> Run 003 = 5 cycles ====> Run 004 = 5 cycles ====> Run 005 = 5 cycles ====> Run 006 = 5 cycles ====> Run 007 = 5 cycles ====> Run 008 = 5 cycles ====> Run 009 = 5 cycles ====> Run 010 = 5 cycles ====> Run 011 = 5 cycles ====> Run 012 = 5 cycles ====> Run 013 = 5 cycles ====> Run 014 = 5 cycles ====> Run 015 = 5 cycles ====> Run 016 = 5 cycles ====> Run 017 = 5 cycles ====> Run 018 = 5 cycles ====> Run 019 = 5 cycles ====> Run 020 = 5 cycles ====> Average: 5 cycles HPE h8z series AMD FX-8150 eight-core processor [3.6GHz, 8MB L2/8MB L3 Cache] 16GB DDR3-1333MHz SDRAM [4 DIMMs] 120TB SATA SSD RAID 0 (2 x 160GB HDD) 600W Power supply 3GB AMD Radeon HD 7950 [Dual Bracket, DVI, HDMI, 2x mini-DP] Blu-ray player/writer & SuperMulti DVD burner Win7[64Bit] 15-in-1 memory card reader, 4 USB 2.0 (front), 2 USB 3.0 (top) |
|||
![]() |
|
typedef 03 May 2012, 01:33
![]() |
|||
![]() |
|
Enko 03 May 2012, 02:52
Cpu not supported. This time Intel T4500.
|
|||
![]() |
|
revolution 03 May 2012, 03:42
LocoDelAssembly wrote:
|
|||
![]() |
|
Inagawa 03 May 2012, 05:37
Of course I am using it. I'll post the code
Code: ;============================================================================= ;=== PERFORMANCE COUNTER MACRO =============================================== ;============================================================================= section '.data' data readable writeable ; ; Optimal alignment for a QWORD ; align 8 __Count dq 0 __Overhead dq 0 __LoopCounter dd 0 __Average dq 0 __AverageLoopCount dd 0 macro StartPerformanceCounter InnerLoopCount = 0xFFFFF, AverageLoopCount = 20, ProcessPriority = 32 { local AverageLoop, OverheadLoop, WorkingLoop, Lower, Higher, RDTSCP_Compliant ;============================================== ;=== INITIALIZE =============================== ; ; Preserve the registers ; push eax ecx edx ebx ebp esi edi ; ; First, check if the RDTSCP instruction is supported, ; exit if it's not. ; mov eax, 0x80000001 cpuid bt edx, 0x1B jc RDTSCP_Compliant cinvoke printf, <'Sorry, your CPU is not supported.', 13, 10> cinvoke getchar invoke ExitProcess, 0 RDTSCP_Compliant: ; ; Output the info string ; cinvoke printf, <'-=== %i repetitions x %i runs ===-', 13, 10, 13, 10>, InnerLoopCount, AverageLoopCount ; ; Initialize the AverageLoopCount ; mov [__AverageLoopCount], AverageLoopCount ; ; Set the thread priority ; invoke GetCurrentProcess invoke SetPriorityClass, eax, ProcessPriority ; ; Initialize the successful loop counter ; xor edi, edi push edi ; ; Initialize the AverageLoop counter ; mov ebx, 1 push ebx ; ; Initialize the working variables ; mov DWORD [__Count], -1 mov DWORD [__Count+4], -1 mov DWORD [__Overhead], -1 mov DWORD [__Overhead+4], -1 mov DWORD [__Average], -1 mov DWORD [__Average+4], 0 ;=============================================== ;=== THE AVERAGE LOOP =========================== ; ; This loop repeats the whole process of calculating ; an overhead and getting the cycle results. ; It reports the cycle-count after each run and ; the Average of all runs at the end. ; ; Align 16 is recommended for P6+ ; align 16 AverageLoop: ; ; Address visible outside of this macro. ; __AverageLoop equ AverageLoop ; ; Initialize ESI (inner loop count) ; mov esi, InnerLoopCount ; ; Intel suggests warming up the CPUID ; xor eax, eax cpuid xor eax, eax cpuid xor eax, eax cpuid ; ; Start a new time slice for the overhead run ; invoke Sleep, 0 ; ; Save the inner loop count ; push esi ;=============================================== ;=== THE OVERHEAD LOOP ========================= ; ; This loop measures the overhead to be subtracted ; from the final cycle count. ; ; Align 16 is recommended for P6+ ; align 16 OverheadLoop: ; ; Serialize ; Read the TimeStampCounter ; xor eax, eax cpuid rdtsc ; ; Save the HO 32 bits of starting count ; Save the LO 32 bits of starting count ; push edx push eax ; ; Force the instructions to finish ; xor eax, eax cpuid ; ; Call RDTSCP again ; xor eax, eax cpuid rdtscp ; ; Restore the LO 32 bits of starting count ; Subtract the LO bits ; pop ecx sub eax, ecx ; ; Restore the HO 32 bits of starting count ; Subtract with carry the HO bits ; pop ecx sbb edx, ecx ; ; Check if this loop has lower cycle count ; cmp edx, DWORD [__Overhead+4] jne Lower cmp eax, DWORD [__Overhead] jnb Higher Lower: ; ; Save the lowest cycle count ; mov DWORD [__Overhead], eax mov DWORD [__Overhead+4], edx Higher: ; ; Repeat until ESI (inner loop count) is at zero ; dec esi jnz OverheadLoop ;=== THE OVERHEAD LOOP END ===================== ;=============================================== ; ; Reinitialize ESI (inner loop count) for the next run ; pop esi push esi ; ; Start a new time slice for the working loop ; invoke Sleep, 0 ;=============================================== ;=== THE WORKING LOOP ========================== ; ; This loop counts the cycles between the Start ; and End macro ; ; Align 16 is recommended for P6+ ; align 16 WorkingLoop: ; ; Address visible outside of this macro. ; __WorkingLoop equ WorkingLoop ; ; Preserve the registers ; They have to be stored before the call ; to RDTSCP, to avoid influencing the timing ; push esi edi ; ; Call RDTSCP again ; rdtscp ; ; Save the HO 32 bits of starting count ; Save the LO 32 bits of starting count ; push edx push eax ; ; Force the instructions to finish ; xor eax, eax cpuid } macro EndPerformanceCounter { local Lower, Higher, Exit, UnsuccessfulLoop, OutputAverage ; ; Call RDTSCP again ; xor eax, eax cpuid rdtscp ; ; Restore the LO 32 bits of starting count ; Subtract the LO bits ; pop ecx sub eax, ecx ; ; Restore the HO 32 bits of starting count ; Subtract with carry the HO bits ; pop ecx sbb edx, ecx ; ; Return the original registers. ; They have to be popped here to avoid ; messing up the RDTSCP code ; pop edi esi ; ; Check if this loop has lower cycle count ; cmp edx, DWORD [__Count+4] jne Lower cmp eax, DWORD [__Count] jnb Higher Lower: ; ; Save the lowest cycle count ; mov DWORD [__Count], eax mov DWORD [__Count+4], edx Higher: ; ; Repeat until the __LoopCounter is at 0 ; dec esi jnz __WorkingLoop ;=== THE WORKING LOOP END ====================== ;=============================================== ; ; Reinitialize ESI (inner loop count) for the next run ; pop esi ; ; Update the results. ; mov eax, DWORD [__Count] sub eax, DWORD [__Overhead] mov edx, DWORD [__Count+4] sbb edx, DWORD [__Overhead+4] ; ; Add to the Average ; add DWORD [__Average], eax adc DWORD [__Average], edx ; ; Print the result ; Repeat until the EBX is at 10 ; pop ebx cmp eax, 0 jle UnsuccessfulLoop pop edi inc edi push edi push eax cinvoke printf, <'====> Pass %0*i = %i cycles', 13, 10>, 3, ebx, eax pop eax ; ; Some code sequences are too short to be properly ; measured. In that case the loop is skipped entirely. ; UnsuccessfulLoop: ; ; Repeat the AverageLoop until EBX reaches the max limit ; inc ebx cmp ebx, [__AverageLoopCount] push ebx jng __AverageLoop ;=== THE AVERAGE LOOP END ======================= ;=============================================== ; ; Restore EBX and EDI ; pop ebx pop edi ; ; If EDI (successful loop count) is zero, there ; is no point in computing an average ; ; In such case, output an error message ; test edi, edi jnz OutputAverage push eax cinvoke printf, <'The code was probably too short', 13, 10> pop eax jmp Exit ; ; Compute the Average of all runs. (Average is the sum ; of all the successful runs divided by the number of runs (EDI)) ; OutputAverage: ; ; Compute the Average and output the result ; mov eax, DWORD [__Average] mov edx, DWORD [__Average+4] idiv edi cinvoke printf, <10, 13, '====> Average: %i cycles', 13, 10, 13, 10>, eax ; ; Exit the macro prematurely. It is probably because the ; tested code was too short, or the run just didn't run ; "right". Restart the macro several times and you will ; get results ; Exit: ; ; Restore the process priority ; invoke GetCurrentProcess invoke SetPriorityClass, eax, NORMAL_PRIORITY_CLASS ; ; Return the original registers ; pop edi esi ebp ebx edx ecx eax } You simply call it like so Code: StartPerformanceCounter ;You can also specify the number of reps, how many passes (runs) there are and what process priority is this test using. ;StartPerformanceCounter 155550, 50, REALTIME__PRIORITY_CLASS ;StartPerformanceCounter,,REALTIME__PRIORITY_CLASS ; Watch the commas! mov ecx, 300 ; The code to be timed @@: mov eax, -1 mov edx, -1 rcr eax, 31 dec ecx jnz @B EndPerformanceCounter I have been working long and hard for this macro. I have used a basic structure from a MASM32 version of a much simpler PerfCounter. There is no register contamination between the macros, so you shouldn't be able to break it by doing anything inside it. I hope this will be useful to someone to time their code, also it would be sweet if anyone with a knowledge on RDTSCP checked the code. ![]() Please keep in mind that I'm a rookie, before you try to bash my head in for possibly coding something wrong. Last edited by Inagawa on 03 May 2012, 06:55; edited 1 time in total |
|||
![]() |
|
bzdashek 03 May 2012, 06:18
Thanks for the source, Inagawa. It doesn't run on my Atom.
Did you consider using the QueryPerfomanceCounter API, like AsmGuru62 suggested in one of your topics? It also returns a QWORD (in form of LARGE_INTEGER structure), and you don't have to write such a long macros. |
|||
![]() |
|
Inagawa 03 May 2012, 06:38
I am planning on updating this macro to automatically use QPC on processors that do not have RDTSCP in their arsenal. I have to figure out the source of weird results I've been getting from it, though.
You should be able to run it on Intel Atom by deleting the p on the RDTSCP instructions. |
|||
![]() |
|
revolution 03 May 2012, 06:59
Inagawa wrote: You should be able to run it on Intel Atom by deleting the p on the RDTSCP instructions. |
|||
![]() |
|
Inagawa 03 May 2012, 07:11
But I will gladly listen to anything you can teach me! Simply saying I am using it wrong won't really help me. I barely understand the Intel Software Developer's Manual, I only got out of it something about RDTSCP calling the CPUID before the ReadTimeStamp, but I still have to call it after to force the execution to be serial, no?
If you have a way to improve my code and my understanding, I'm more than willing to listen and learn I have tried to understand the Intel's code to the best of my abilities. "The solution to the problem presented in Section 0 is to add a CPUID instruction just after the RDTPSCP and the two mov instructions (to store in memory the value of edx and eax). The implementation is as follows:" Code: asm volatile ("CPUID\n\t" "RDTSC\n\t" "mov %%edx, %0\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); /***********************************/ /*call the function to measure here*/ /***********************************/ asm volatile("RDTSCP\n\t" "mov %%edx, %0\n\t" "mov %%eax, %1\n\t" "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx"); "In the code above, the first CPUID call implements a barrier to avoid out-of-order execution of the instructions above and below the RDTSC instruction. Nevertheless, this call does not affect the measurement since it comes before the RDTSC (i.e., before the timestamp register is read)." I deciphered the ugly mess of a code, read the text and then tried to apply that. So - could you please help me improve the code? I'd be grateful. |
|||
![]() |
|
bzdashek 03 May 2012, 08:47
Inagawa, Intel has updated their CPUID manual, take a look at this:
http://www.intel.com/Assets/PDF/appnote/241618.pdf In the appendix part they have examples, which are very useful. |
|||
![]() |
|
Picnic 03 May 2012, 13:43
Pentium E6800 3.33GHz not supported.
|
|||
![]() |
|
revolution 03 May 2012, 13:54
Inagawa wrote: So - could you please help me improve the code? I'd be grateful. |
|||
![]() |
|
Inagawa 03 May 2012, 14:32
You're probably right on this one. I have modified the code to check for CPUID, and then check for RDTSC.
I'll post the finished code in a new topic in Macroinstructions. bzdashek: Thanks, I had a look into it, definitely useful for CPUID related stuff |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.