flat assembler
Message board for the users of flat assembler.

Index > Windows > Speed test results for loops

Goto page 1, 2  Next

Have you utilized this in your programs?
Yes
60%
 60%  [ 3 ]
No
40%
 40%  [ 2 ]
Didn't even realize you could?!
0%
 0%  [ 0 ]
Total Votes : 5

Author
Thread Post new topic Reply to topic
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 14 Jan 2008, 01:34
I devoted some time today to making a few simple programs to test whether concepts like loop unrolling and an odd conditional loop that I run into often. The first test of loop unrolling I have here:
Code:
;This will test whether unrolling a loop will cause a speed boost, several different unroll's

;Results:
;One roll =            11 1/2 seconds
;Two rolls =           9 seconds
;Five rolls =          8 1/2 seconds
;Ten rolls =           7 1/2 seconds
;Twenty rolls =       ~7 1/4 seconds
;Hundred rolls =      ~7 1/4 seconds
;Thousand rolls =     ~7 1/4 seconds

;Analysis:
;The breaking point appears to be around ten-twenty
;rolls of this simple loop (consists of four
;arithmetic operations).  After that, the speed
;difference is unmeasurable.

;Applications:
;The loop should be unrolled between two-five times
;for optimum relation between speed and code size.

format PE console
include '%fasminc%\win32ax.inc'
entry Efficiency

section '.code' code readable executable
Efficiency:
        pusha
        mov ebp,4000000000

        ;Testing Loop
    @@: add ebx,5
        sub eax,5
        sub edx,7
        dec ebp
        jnz @b

        ;Ending
        push Done
        call [printf]
        add esp,4
        call [getch]
        popa
        push 0
        call [ExitProcess]

section '.data' data readable writeable
Done DB 'Done.',0

section '.idata' import data readable writeable

library Kernel,'Kernel32.dll',\
        Msvcrt,'Msvcrt.dll'

import Kernel,\
       ExitProcess,'ExitProcess'

import Msvcrt,\
       printf,'printf',\
       getch,'_getch'
    

I used four simple operations as the loop, and the results are at the top. It shows that loops unrolled between two and five times have a significant increase in speed, and beyond that (even up to a thousand!) the speed barely changes at all. Here's the second test, which may be a bit confusing:
Code:
 ;This will test the 1,2,3,4 loop structure for a comparison tree
;There will be a comparison tree of four different options
;The optimization will be having four different loops instead of the comparison tree

;The point is of using a cmp tree to locate the correct structure

;Results:
;Unoptimized - 15 seconds at 20000000
;optimized - ~4.5 seconds at 20000000

;Analysis:
;The optimized version was over three times faster

format PE console
include '%fasminc%\win32ax.inc'
entry Efficiency

section '.code' code readable executable
Efficiency:
        pusha
        mov ecx,20000000
        ;
        ;jmp .2 ;comment for un-optimized
        ;
    .O: xor ebp,ebp
        ;First Round
    .1: cmp ebp,15
        ja @f
        ;<=15
        add ebx,5
        sub eax,5
        sub edx,7
        inc esi
        jmp .e
        ;Second Round
    @@: cmp ebp,31
        ja @f
        ;<=31
        add ebx,5
        sub eax,5
        sub edx,7
        inc esi
        jmp .e
        ;Third Round
    @@: cmp ebp,47
        ja @f
        ;<=47
        add ebx,5
        sub eax,5
        sub edx,7
        inc esi
        jmp .e
    @@: ;Fourth Round
        ;<=63
        add ebx,5
        sub eax,5
        sub edx,7
        inc esi
    .e: ;End inner loop
        inc ebp
        cmp ebp,64
        jnz .1
        ;End outer loop
        dec ecx
        jnz .O
        jmp .ee

    .2: ;Optimized version
   .O2: xor ebp,ebp
        ;First Round
   .12: ;<=15
        add ebx,5
        sub eax,5
        sub edx,7
        inc esi
        inc ebp
        cmp ebp,15
        jna .12
        ;Second Round
    @@: ;<=31
        add ebx,5
        sub eax,5
        sub edx,7
        inc esi
        inc ebp
        cmp ebp,31
        jna @b
    @@: ;Third Round
        ;<=47
        add ebx,5
        sub eax,5
        sub edx,7
        inc esi
        inc ebp
        cmp ebp,47
        jna @b
    @@: ;Fourth Round
        ;<=63
        add ebx,5
        sub eax,5
        sub edx,7
        inc esi
        inc ebp
        cmp ebp,63
        jna @b
   .e2: ;End outer loop
        dec ecx
        jnz .O2

        ;Ending
   .ee: push Done
        call [printf]
        add esp,4
        call [getch]
        popa
        push 0
        call [ExitProcess]

section '.data' data readable writeable
Done DB 'Done.',0

section '.idata' import data readable writeable

library Kernel,'Kernel32.dll',\
        Msvcrt,'Msvcrt.dll'

import Kernel,\
       ExitProcess,'ExitProcess'

import Msvcrt,\
       printf,'printf',\
       getch,'_getch'
    

The concept is that in the hashing functions that I was working on had C code much like this:
Code:
for i from 0 to 63
        if 0 d i d 15 then
            f := (b and c) or ((not b) and d)
            g := i
        else if 16 d i d 31
            f := (d and b) or ((not d) and c)
            g := (5×i + 1) mod 16
        else if 32 d i d 47
            f := b xor c xor d
            g := (3×i + 5) mod 16
        else if 48 d i d 63
            f := c xor (b or (not d))
            g := (7×i) mod 16
    

All it was is a switch block encased in a loop, where the loop counter variable would be the block's context. I looked at this and then realized that if I seperated the four if blocks into four small loops that it will take a few less comparisons as the loop goes on. What I didn't realize is how dramtic the results were!! Depending on how many lines of code were in each block, it would EXTREMELY make it more efficient. For anyone reading this, it could make a very large difference in most major hashing functions such as MD family and the SHA family, both of which employ this exact structure as their main loop. The running time can now be reduced, and with the smaller code size in between the blocks, quite dramatically. I suggest everyone updates their libraries when they check this out! (that is, if I'm not the only person who saw this)... There's no doubt many other algorithms that use this, maybe in ciphers or most other hashing spec's. Have fun!
[/code]
Post 14 Jan 2008, 01:34
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 14 Jan 2008, 07:49
How come people still use include '%fasminc%\win32ax.inc'
I always have to replace every occurrence of "%fasminc%\" with "" when I get a source. What's the deal? Isn't typing include 'win32ax.inc' faster?

Maybe I'm too stubborn to use fasminc instead of include that was FASM a long time ago Wink

Okay, lets get to the point now. Loops are great if you don't overuse them Very Happy
Usually unrolling over 4x is hardly useful, but if you've got dec ecx jmp loop then a 8-times unroll could help, but there are other clever solutions for that (i.e. sub ecx,8 ; shr LOOP_COUNTER,3 Very Happy).
The first code gets me 4015ms (that's the code posted, I don't know how you unrolled them)
The second code un-opt gave me 5547ms and
the opt version resulted in 1563ms.

I copy-pasted the 3 instructions in the first example and:
Code:
;Results:              Yours               Mine
;One roll =            11 1/2 seconds    ; 4015ms
;Two rolls =           9 seconds         ; 3218ms (Four rolls 3094ms)
;Five rolls =          8 1/2 seconds     ; 3094...
    

It doesn't get better after 4 unrolls.

Actually these are all NOP loops, meaning the ADD/SUB operations take only one uop and its as fast as times 3 NOP. With these empty spin-wait loops there is no real test data and you can't rely on these results. What you should do is add "fat" to these sources and THEN see if unrolling helps. Right now unrolling helps only because the jump instruction can't squeeze any faster through the time frame given. On Core 2 the set of instructions goes like:
Code:
    ; Instructions        ports
    @@: add ebx,5       ; 0     they're all different registers
        sub eax,5       ;  1
        sub edx,7       ;   5   the end of CLK 1
        dec ebp         ; 0     CLK 2
        jnz @b          ;   5   can only go to port 5
    ; Virtual (simulated) instructions in the stream:
   ;@@: add ebx,5       ;  1    because 0 & 5 are taken
   ;    sub eax,5       ; 0     start of CLK 3
   ;    sub edx,7       ;  1
   ;    dec ebp         ;   5
   ;    jnz @b          ;   5   CLK 4
   ;@@: add ebx,5       ; .... etc.
    

Why this code ALWAYS stays at 2 clocks per loop iteration is that every ADD/SUB/INC/DEC etc. has one clock latency and you can't start another operation on the same registers until the clock after the next one. It finally settles on the 2 clk/iter.


Last edited by Madis731 on 14 Jan 2008, 09:17; edited 1 time in total
Post 14 Jan 2008, 07:49
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4024
Location: vpcmpistri
bitRAKE 14 Jan 2008, 09:14
Code:
unroll
  1    6531 ms
  2    5094 ms
  4    4687 ms
  8    4235 ms
 16   4031 ms
 32   3953 ms
 64   3875 ms
128  3844 ms    
I guess if there is code cache availble may as well use it up, lol.
Post 14 Jan 2008, 09:14
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 14 Jan 2008, 09:26
I think it takes a lot of time to warm my cache (T7200, 4MB L2) up in these cases. You must have a LOT of critical fast-compuation-needed code/data if you pollute your cache in this way Razz
Post 14 Jan 2008, 09:26
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4024
Location: vpcmpistri
bitRAKE 14 Jan 2008, 10:21
A little unrolling got me a ~3 cycle/limb multiply on my Athlon!
Code:
mul ebp
add ebx,eax
mov eax,[esi][j]
mov [esi][j][-4],ebx
mov ebx,ecx
adc ebx,edx    
Must be byte offsets on the memory accesses and all data in cache. ECX is zero and works faster than immediate data. I'd jump into the unrolled loop and let it fly...Notice how far in advance EAX is loaded and the use of EDX is late.
Post 14 Jan 2008, 10:21
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 14 Jan 2008, 12:27
Which syntax is this and what does it do? There's no ebp init and what is "j"?
Post 14 Jan 2008, 12:27
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4024
Location: vpcmpistri
bitRAKE 14 Jan 2008, 15:27
That is the inner loop to a single limb multiply - a limb is just a dword in GMP-speak. j changes based on loop unroll to enable the full byte addressing range $F..80 to $7C to be used. EDI is the loop counter - only register left besides ESP, lol.

The naive inner loop might look like:
Code:
.0: mov eax,[esi]
    mul [uint32]
    add eax,ebx ; carry
    adc edx,0
    mov ebx,edx ; next carry
    mov [esi],eax

    add esi,4
    dec ecx
    jne .0    
A big integer of ECX dwords is multiplied by a single dword - in place.
Post 14 Jan 2008, 15:27
View user's profile Send private message Visit poster's website Reply with quote
OzzY



Joined: 19 Sep 2003
Posts: 1029
Location: Everywhere
OzzY 14 Jan 2008, 23:44
Madis731 wrote:
How come people still use include '%fasminc%\win32ax.inc'
I always have to replace every occurrence of "%fasminc%\" with "" when I get a source. What's the deal? Isn't typing include 'win32ax.inc' faster?


I think it's still required, no?
If I don't put the '%fasminc%' I get "file not found".
Post 14 Jan 2008, 23:44
View user's profile Send private message Reply with quote
handyman



Joined: 04 Jun 2007
Posts: 40
Location: USA - KS
handyman 15 Jan 2008, 01:48
Quote:

I think it's still required, no?
If I don't put the '%fasminc%' I get "file not found".


It depends on what you have in your Fasmw.ini file [Environment] section.
If you put an 'Include=' entry in then you do not need the '%fasminc%', or you can put both the 'Include=' and 'Fasminc=' entries in and have it both ways if you make sure that the path values after the '='s are the same.
Post 15 Jan 2008, 01:48
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 15 Jan 2008, 06:52
@Ozzy: So you're saying the the Beer-example doesn't compile - from the FASM package?
Post 15 Jan 2008, 06:52
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 16 Jan 2008, 01:22
K... Just a little side-question here, I almost gave up on a recent project because I couldn't figure out why my 64-bit shifting/rotating arithmetic in SHA-512 wasn't working. I would use this to add the numbers:
Code:
add eax,[Var]
adc ebx,[Var+4]
    

Except I used the Var variables literally. I considered [var] to be the low dword, and var+4 to be the high dword (later in memory). It took me a long time to realize I was adding them in the wrong order. It's supposed to be:
Code:
add ebx,[Var+4]
adc eax,[Var]
    

Am I right?
Post 16 Jan 2008, 01:22
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 16 Jan 2008, 08:05
Depending on how you use them. The way add rax,[Var] works is that the least significant byte is stored in the first byte so you would do:
Code:
add eax,[Var]
adc ebx,[Var+4]
    

But if you are handling it in another way, then its totally upto you. Though the AMD64 variant is much easier because you can later "upgrade" it to use 64-bit registers or add it with XMM (128-bit) in the future.
Post 16 Jan 2008, 08:05
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 16 Jan 2008, 16:50
Yeah, in mem my numbers are stored as:

Var1 DD 0,0

So that when I need to add two together, I use

Code:
add [Var1+4],[OtherVar+4]
adc [Var1],[OtherVar]
    

Totally hypothetically though Smile The only reason for my backwards logic is that I literally store the left and right portions of the 64-bit number in memory just as they look. That's why I use the carry bit on the first number, because the low bit of the first number for me is the overflow. Little confusing, but should work if I ever get around to testing it. School and studying have deprived me of almost all time for my stuff.

I will take Vid's "BigNum" page into action when I get the chance, he says to store them like this:

Var1 DD -LowerDword-
DD -HighDword-

so in memory it's easier to use. What do you guys say about how to store large numbers (for the ease of accessing and manipulating like shifting)?
Post 16 Jan 2008, 16:50
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20303
Location: In your JS exploiting you and your system
revolution 16 Jan 2008, 17:04
AlexP wrote:
What do you guys say about how to store large numbers (for the ease of accessing and manipulating like shifting)?
Don't confuse yourself by mixing endian. By using little endian throughout there is no chance of confusion about what order to do things in.

I always like to see this:
Code:
add eax,[Var]
adc ebx,[Var+4]    
It looks much cleaner and nicer. I wrote my SHA512 in little endian and only used bswap on the raw input before entering the hash function.
Post 16 Jan 2008, 17:04
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 16 Jan 2008, 23:01
bswap? Well, I have an entire function devoted to converting all the user data to opposite endian-ness before operating on it. It works for me, is bswap an instruction or something? I would love to have access to it!!
Another side question, I'm making a command-line parser for myself, what is the API to get the name of the current program? MSDN is so flooded with forums and comments that most API's I hear about will not come up. What is the best one? If you're wonderin', I use GetCommandLineA and then locate the first /,\,-, or whatever starts off the flags. I've decided for now to use a string as a parameter to the parsing function, which will be what it will search for. So what API is the one to use?
Post 16 Jan 2008, 23:01
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 16 Jan 2008, 23:05
TFM wrote:
bswap reverses the byte order of a 32-bit general register: bits 0 through 7 are swapped with bits 24 through 31, and bits 8 through 15 are swapped with bits 16 through 23. This instruction is provided for converting little-endian values to big-endian format and vice versa.

bswap edx ; swap bytes in register
Post 16 Jan 2008, 23:05
View user's profile Send private message Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 17 Jan 2008, 00:22
Wow.. All this time I didn't know about it. So I could use rep bswap? No, that might act weird... Thanks for telling me Smile
Post 17 Jan 2008, 00:22
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20303
Location: In your JS exploiting you and your system
revolution 17 Jan 2008, 01:33
AlexP: RTFM please. Don't make us your google monkeys. And I nearly forget to mention, RTFM. BTW: check out my website, it talks about rep and bswap extensively.
Post 17 Jan 2008, 01:33
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 17 Jan 2008, 03:24
luv GoogleMonkeys. Do u just wait around on forums waiting for someone with questions to come along??? That site's funny, I'll have to make fun of someone else I know who does that Smile Thanks for trying revolution, grow a soul lol
Post 17 Jan 2008, 03:24
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4024
Location: vpcmpistri
bitRAKE 17 Jan 2008, 04:49
When I first starting programming on x86 (from the Motorola processors: 6502, 680x0, 560x0) I thought the little endian thing was really stupid. Hated it actually - had to change my thinking around quite a bit. Now, everything is done little endian - big endian seems backwards. Why would anyone store the high byte first? No algorithm works better with the high byte first - usually, nothing can be done until the size is known. Little endian works no matter what the data size is - algorithm is the same.

So, I highly recommend little endian no matter what you're doing.
Post 17 Jan 2008, 04:49
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.