flat assembler
Message board for the users of flat assembler.
Index
> Windows > Speed test results for loops Goto page 1, 2 Next |
Have you utilized this in your programs? | |||||||||||||||||||||
|
|||||||||||||||||||||
Total Votes : 5 |
Author |
|
Madis731 14 Jan 2008, 07:49
How come people still use include '%fasminc%\win32ax.inc'
I always have to replace every occurrence of "%fasminc%\" with "" when I get a source. What's the deal? Isn't typing include 'win32ax.inc' faster? Maybe I'm too stubborn to use fasminc instead of include that was FASM a long time ago Okay, lets get to the point now. Loops are great if you don't overuse them Usually unrolling over 4x is hardly useful, but if you've got dec ecx jmp loop then a 8-times unroll could help, but there are other clever solutions for that (i.e. sub ecx,8 ; shr LOOP_COUNTER,3 ). The first code gets me 4015ms (that's the code posted, I don't know how you unrolled them) The second code un-opt gave me 5547ms and the opt version resulted in 1563ms. I copy-pasted the 3 instructions in the first example and: Code: ;Results: Yours Mine ;One roll = 11 1/2 seconds ; 4015ms ;Two rolls = 9 seconds ; 3218ms (Four rolls 3094ms) ;Five rolls = 8 1/2 seconds ; 3094... It doesn't get better after 4 unrolls. Actually these are all NOP loops, meaning the ADD/SUB operations take only one uop and its as fast as times 3 NOP. With these empty spin-wait loops there is no real test data and you can't rely on these results. What you should do is add "fat" to these sources and THEN see if unrolling helps. Right now unrolling helps only because the jump instruction can't squeeze any faster through the time frame given. On Core 2 the set of instructions goes like: Code: ; Instructions ports @@: add ebx,5 ; 0 they're all different registers sub eax,5 ; 1 sub edx,7 ; 5 the end of CLK 1 dec ebp ; 0 CLK 2 jnz @b ; 5 can only go to port 5 ; Virtual (simulated) instructions in the stream: ;@@: add ebx,5 ; 1 because 0 & 5 are taken ; sub eax,5 ; 0 start of CLK 3 ; sub edx,7 ; 1 ; dec ebp ; 5 ; jnz @b ; 5 CLK 4 ;@@: add ebx,5 ; .... etc. Why this code ALWAYS stays at 2 clocks per loop iteration is that every ADD/SUB/INC/DEC etc. has one clock latency and you can't start another operation on the same registers until the clock after the next one. It finally settles on the 2 clk/iter. Last edited by Madis731 on 14 Jan 2008, 09:17; edited 1 time in total |
|||
14 Jan 2008, 07:49 |
|
bitRAKE 14 Jan 2008, 09:14
Code: unroll 1 6531 ms 2 5094 ms 4 4687 ms 8 4235 ms 16 4031 ms 32 3953 ms 64 3875 ms 128 3844 ms |
|||
14 Jan 2008, 09:14 |
|
Madis731 14 Jan 2008, 09:26
I think it takes a lot of time to warm my cache (T7200, 4MB L2) up in these cases. You must have a LOT of critical fast-compuation-needed code/data if you pollute your cache in this way
|
|||
14 Jan 2008, 09:26 |
|
bitRAKE 14 Jan 2008, 10:21
A little unrolling got me a ~3 cycle/limb multiply on my Athlon!
Code: mul ebp add ebx,eax mov eax,[esi][j] mov [esi][j][-4],ebx mov ebx,ecx adc ebx,edx |
|||
14 Jan 2008, 10:21 |
|
Madis731 14 Jan 2008, 12:27
Which syntax is this and what does it do? There's no ebp init and what is "j"?
|
|||
14 Jan 2008, 12:27 |
|
bitRAKE 14 Jan 2008, 15:27
That is the inner loop to a single limb multiply - a limb is just a dword in GMP-speak. j changes based on loop unroll to enable the full byte addressing range $F..80 to $7C to be used. EDI is the loop counter - only register left besides ESP, lol.
The naive inner loop might look like: Code: .0: mov eax,[esi] mul [uint32] add eax,ebx ; carry adc edx,0 mov ebx,edx ; next carry mov [esi],eax add esi,4 dec ecx jne .0 |
|||
14 Jan 2008, 15:27 |
|
OzzY 14 Jan 2008, 23:44
Madis731 wrote: How come people still use include '%fasminc%\win32ax.inc' I think it's still required, no? If I don't put the '%fasminc%' I get "file not found". |
|||
14 Jan 2008, 23:44 |
|
handyman 15 Jan 2008, 01:48
Quote:
It depends on what you have in your Fasmw.ini file [Environment] section. If you put an 'Include=' entry in then you do not need the '%fasminc%', or you can put both the 'Include=' and 'Fasminc=' entries in and have it both ways if you make sure that the path values after the '='s are the same. |
|||
15 Jan 2008, 01:48 |
|
Madis731 15 Jan 2008, 06:52
@Ozzy: So you're saying the the Beer-example doesn't compile - from the FASM package?
|
|||
15 Jan 2008, 06:52 |
|
AlexP 16 Jan 2008, 01:22
K... Just a little side-question here, I almost gave up on a recent project because I couldn't figure out why my 64-bit shifting/rotating arithmetic in SHA-512 wasn't working. I would use this to add the numbers:
Code: add eax,[Var] adc ebx,[Var+4] Except I used the Var variables literally. I considered [var] to be the low dword, and var+4 to be the high dword (later in memory). It took me a long time to realize I was adding them in the wrong order. It's supposed to be: Code: add ebx,[Var+4] adc eax,[Var] Am I right? |
|||
16 Jan 2008, 01:22 |
|
Madis731 16 Jan 2008, 08:05
Depending on how you use them. The way add rax,[Var] works is that the least significant byte is stored in the first byte so you would do:
Code: add eax,[Var] adc ebx,[Var+4] But if you are handling it in another way, then its totally upto you. Though the AMD64 variant is much easier because you can later "upgrade" it to use 64-bit registers or add it with XMM (128-bit) in the future. |
|||
16 Jan 2008, 08:05 |
|
AlexP 16 Jan 2008, 16:50
Yeah, in mem my numbers are stored as:
Var1 DD 0,0 So that when I need to add two together, I use Code: add [Var1+4],[OtherVar+4] adc [Var1],[OtherVar] Totally hypothetically though The only reason for my backwards logic is that I literally store the left and right portions of the 64-bit number in memory just as they look. That's why I use the carry bit on the first number, because the low bit of the first number for me is the overflow. Little confusing, but should work if I ever get around to testing it. School and studying have deprived me of almost all time for my stuff. I will take Vid's "BigNum" page into action when I get the chance, he says to store them like this: Var1 DD -LowerDword- DD -HighDword- so in memory it's easier to use. What do you guys say about how to store large numbers (for the ease of accessing and manipulating like shifting)? |
|||
16 Jan 2008, 16:50 |
|
revolution 16 Jan 2008, 17:04
AlexP wrote: What do you guys say about how to store large numbers (for the ease of accessing and manipulating like shifting)? I always like to see this: Code: add eax,[Var] adc ebx,[Var+4] |
|||
16 Jan 2008, 17:04 |
|
AlexP 16 Jan 2008, 23:01
bswap? Well, I have an entire function devoted to converting all the user data to opposite endian-ness before operating on it. It works for me, is bswap an instruction or something? I would love to have access to it!!
Another side question, I'm making a command-line parser for myself, what is the API to get the name of the current program? MSDN is so flooded with forums and comments that most API's I hear about will not come up. What is the best one? If you're wonderin', I use GetCommandLineA and then locate the first /,\,-, or whatever starts off the flags. I've decided for now to use a string as a parameter to the parsing function, which will be what it will search for. So what API is the one to use? |
|||
16 Jan 2008, 23:01 |
|
LocoDelAssembly 16 Jan 2008, 23:05
TFM wrote: bswap reverses the byte order of a 32-bit general register: bits 0 through 7 are swapped with bits 24 through 31, and bits 8 through 15 are swapped with bits 16 through 23. This instruction is provided for converting little-endian values to big-endian format and vice versa. |
|||
16 Jan 2008, 23:05 |
|
AlexP 17 Jan 2008, 00:22
Wow.. All this time I didn't know about it. So I could use rep bswap? No, that might act weird... Thanks for telling me
|
|||
17 Jan 2008, 00:22 |
|
revolution 17 Jan 2008, 01:33
AlexP: RTFM please. Don't make us your google monkeys. And I nearly forget to mention, RTFM. BTW: check out my website, it talks about rep and bswap extensively.
|
|||
17 Jan 2008, 01:33 |
|
AlexP 17 Jan 2008, 03:24
luv GoogleMonkeys. Do u just wait around on forums waiting for someone with questions to come along??? That site's funny, I'll have to make fun of someone else I know who does that Thanks for trying revolution, grow a soul lol
|
|||
17 Jan 2008, 03:24 |
|
bitRAKE 17 Jan 2008, 04:49
When I first starting programming on x86 (from the Motorola processors: 6502, 680x0, 560x0) I thought the little endian thing was really stupid. Hated it actually - had to change my thinking around quite a bit. Now, everything is done little endian - big endian seems backwards. Why would anyone store the high byte first? No algorithm works better with the high byte first - usually, nothing can be done until the size is known. Little endian works no matter what the data size is - algorithm is the same.
So, I highly recommend little endian no matter what you're doing. |
|||
17 Jan 2008, 04:49 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.