flat assembler
Message board for the users of flat assembler.

 Index > Windows > struct's, macro's and fpu Vector's
Author
bitshifter

Joined: 04 Dec 2007
Posts: 791
Location: Massachusetts, USA
bitshifter 13 Nov 2008, 04:10
Hello, i am studying fasm struct's, macro's and fpu Vector operations.

(assuming useage of win32ax.inc and msvcrt.dll)

Here is my double precision Vector structure.
Code:
```struct Vector
x dq ?
y dq ?
z dq ?
ends    ```

This macro calculates the length of a Vector.
Code:
```macro GetVectorLength v
{
fld   qword[v]
fmul  qword[v]
fld   qword[v+8]
fmul  qword[v+8]
fld   qword[v+16]
fmul  qword[v+16]
fsqrt
; leaves length on fpu stack
}    ```

This macro normalizes a Vector to unit length.
Code:
```macro NormalizeVector v
{
GetVectorLength v
fld  qword[v]
fdiv st0,st1 ; possible divide by zero noted.
fstp qword[v]
fld  qword[v+8]
fdiv st0,st1 ; possible divide by zero noted.
fstp qword[v+8]
fld  qword[v+16]
fdiv st0,st1 ; possible divide by zero noted.
fstp qword[v+16]
fstp st0 ; remove length from fpu stack
}    ```

To test these you can do this.
Code:
```; in data section
vec Vector 1.2,3.4,5.6
buf rb 256
fmt db '%f %f %f',0

; in code section
NormalizeVector vec
cinvoke sprintf,buf,fmt,double[vec.x],double[vec.y],double[vec.z]
invoke MessageBox,0,buf,0,0    ```

So what i would like to know is where can i get info on the cpu cycles for a given asm instruction.
I have some Intel manuals but i couldnt find it in them.

The reason is, lets say i come up with a new way for GetVectorLength like...
Code:
```macro GetVectorLength v
{
fld   qword[v]
fld   st0
fmulp st1,st0
fld   qword[v+8]
fld   st0
fmulp st1,st0
fld   qword[v+16]
fld   st0
fmulp st1,st0
fsqrt
; leaves length on fpu stack
}    ```

What i see is that it accesses the vectors variables less, but uses a few more fpu instructions, and without profiling the code i could not just guess at which version is what is faster.
If you have any comments or constructive criticism, my ears are open.
13 Nov 2008, 04:10
bitRAKE

Joined: 21 Jul 2003
Posts: 3957
Location: vpcmipstrm
bitRAKE 13 Nov 2008, 05:32
Check out Agner Fog's optimization manual/tools. Basically, there are many factors effecting the actual performance of an algorithm; and as the complexity increases the best that can be done it to test and document the conditions under which you are experiencing a particular metric. Providing a test program allow others to offer results for other configurations. Cycle counts can be approximated with RDTSC and performance counters help to see what is going on under the hood.

http://www.agner.org/optimize/

I just watched a talk I thought was decent: Machine Architecture: Things Your Programming Language Never Told You, http://www.nwcpp.org/Meetings/2007/09.html . it applies to assembly language, too.
13 Nov 2008, 05:32
bitshifter

Joined: 04 Dec 2007
Posts: 791
Location: Massachusetts, USA
bitshifter 14 Nov 2008, 02:01
Great set of resources!
The block tester says that my abs() is 3 times faster than C/C++ std::abs()!
One question, the charts show a part in the fpu section about clock cycle overlap if the next instruction is of type fpu.
Can someone explain this to me how this works.
14 Nov 2008, 02:01
f0dder

Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 14 Nov 2008, 02:20
bitshifter wrote:
The block tester says that my abs() is 3 times faster than C/C++ std::abs()!

Which C++ implementation (and version), though? And does it involve CALLing a routine, or is compiler intrinsics used? Are you calling fabs() on float or double values? (on float, it likely involves float->double->fabs()->float on some compilers).

Not trying to bash you, just saying that there's a lot to benchmarking (different CPUs being somewhat of a pain sometimes), and there's no such thing as "C/C++" when discussing speed, you need to name the full platform, since there's quite some differences in regards to standard-library quality and compiler code generation.

Sometimes I wish we were back at the Pentium [U,V] pipes, it was easier to optimize back then. There's a lot of rules these days, sometimes it almost feels that optimizing x86 code these days is nondeterministic (yeah yeah, I suck).

_________________
- carpe noctem
14 Nov 2008, 02:20
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum