flat assembler
Message board for the users of flat assembler.

Index > Main > I'm sure this was discussed a few times before but...

Author
Thread Post new topic Reply to topic
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
It's been a while since i've had time to program, and I've heard here before that mmx, sse, and sse2 is somehow faster than FPU. Though, by the looks of things they don't use the traditional floating point numbers which most apis these days (at least that i've seen) use. I'm curious that if the speed gain still remains after all the instructions used to convert the numbers back and forth, since i have been looking over (though not too deeply) and havn't seen any binary to packed and packed to binary conversion instructions. Seems silly though, cause one would think that the conversions would slow it down, but while venturing to other places i've seen people offer libraries using mmx, sse, and sse2 for speed gain over the standard functions. So i'm guessing that maybe the FPU is much slower than I had anticipated, or i'm missing something. And i'm sure we've all figured out that 9 times out of 10, any game worth it's salt is going to use trig. Now if we can speed up some of those calculations, they won't be as bad to make and maybe we could gain some cycles that way, which can enable one to make more complex collision detection in better time, which should greatly increase performence. The real question comes to be, is MMX, SSE, and SSE2 really worth the investment for collision detection and/or movement, or is it only convenient when doing lots of calculations before converted back to the format that the functions will want to use? Just after everyone's thoughts...
Post 17 Nov 2007, 04:58
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
It's too late here but, isn't SSE scalars already compatible with FPU? AFAIK, floats are supported by SSE and doubles are supported by SSE2, there is no support for "long double" but, how many APIs requires and/or returns such data type?
Post 17 Nov 2007, 05:30
View user's profile Send private message Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
To be honest i never tried. I automatically assumed that since it says that it's designed for "packed numbers" i figured that chances are i'd just be wasting my time trying, for if i, out of stupidity, made an assumption that non-packed numbers worked, that it would hurt me in the long run. Or does the instructions that are used to load the registers from memory automatically convert to packed and back for you?
Post 17 Nov 2007, 05:34
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3055
Location: vpcmipstrm
bitRAKE
They are only packed in so far as the data is consecutive in memory and some operations are preformed in parallel on multiple operands - the single and double foats are the same data type. You can even use the scalar instructions to operate on just one foat at a time, but parallel should be the goal for speed.

There are conversion instructions for to/from integers, but the 32-bit and 64-bit floats are the same data type.
Post 17 Nov 2007, 06:21
View user's profile Send private message Visit poster's website Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Quote:
There are conversion instructions for to/from integers, but the 32-bit and 64-bit floats are the same data type.


What i need it is to be able to take a standard single precision floating point number in the same format that C uses them, convert it to the SSE and then convert it all back. Question is, is the conversion macro that appears i'll have to make (unless some one shows me where the pack and unpack floats instruction is cause i can't see it) going to give a result speed less than just using the fpu instead?
Post 17 Nov 2007, 12:26
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3055
Location: vpcmipstrm
bitRAKE
Just use a structure of four floats and use compiler directives to ensure the data is aligned to 16 byte boundary. The IEEE floats of Intel (FPU and SSE/2, see Vol.1 of manual) and C/C++ are the same.

I'd like to recommend Vol.4 of Intel's manuals as I don't code in C/C++ - I can read it, but don't bother to know more than that. Vol.4 specifically addresses your question, and it's from the source.
Post 17 Nov 2007, 16:44
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
kohlrak: just using SSE isn't going to magically speed up your stuff, you need to change how you think about things a bit.

With regular x87 FPU code, you're used to operate on one number at a time. With SSE code, you work on a whole bunch of numbers at a time (the second S is for SIMD). So you need to change your processing methods.

Also, SSE doesn't have trig functions, so you'll need to handle that stuff differently: lookup tables, taylor series, ... - but some of the clever heads can tell you about that Smile

What I'm trying to say is that there isn't a "SSEMUL" which is magically faster than "FMUL", you get the speed increase by reworking code, taking advantage of SIMD, alignment, etc.
Post 17 Nov 2007, 17:09
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
SSE gives you more register file locations to hold in-flight floating point values. 16 for double precision *32, 32 for single precision *32, 32 for double precision x64, and 64 for single precision x64 vs 8 for x87, any precision. Also SSE has a two-register ISA rather than one-register as x87 does. This means no fxch, but you do need a lot of movapd because a two-register ISA is not as cool as three-register, which is supposed to be coming to x86 in a couple of years.

These characteristics make it possible to write faster code in SSE than x87. A concrete example may be seen in Kümmel's Mandelbrot Benchmark which has source code with both an x87 and an SSE implementation. Even though a P4 can only issue 64 bits worth of operations each to the floating point add and floating point multiply pipelines, SSE is doing better on this class of processor than x87 because one instruction stream of the SSE code is like pipelining two instructions streams worth of the x87 code, not to mention that, according to Agner Fog's instruction tables a P4 can only issue one SSE or x87 floating point operation per clock cycle because they all go through port 1.

BTW, the above-mentioned manual is quite helpful in that it has all the SSE instructions laid out in a compact format so that you can see them and then look them up in a reference such as the Intel® 64 and IA-32 Architectures Software Developer's Manual vol. 2A and vol 2B. Also you will want to check out perhaps Intel's or Agner Fog's more general manuals, or maybe some stuff from AMD which I don't have time to hunt down URLs for just now.

It takes some time to get used to SSE, just because there are so many new instructions. As general exercises in floating point optimization you might attempt to add a second instruction stream to the x87 version of the Kümmel Mandelbrot Benchmark or to the *32 SSE version. This has been done in a different and incompatible style in the quickman source as well as in the 64-bit version of KMB.

Hope this is enough info to get you started.
Post 17 Nov 2007, 21:28
View user's profile Send private message Visit poster's website Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Quote:
Also, SSE doesn't have trig functions, so you'll need to handle that stuff differently: lookup tables, taylor series, ... - but some of the clever heads can tell you about that


That can be looked up easily i'm sure. I've, in my search for an explanation, have come across numerous documents about it, but nothing that really explained (at least from my skimming) why it's faster or how exactly to use it. I would imagin that the idea would be to re-create the functions from scratch.

Quote:
Just use a structure of four floats and use compiler directives to ensure the data is aligned to 16 byte boundary. The IEEE floats of Intel (FPU and SSE/2, see Vol.1 of manual) and C/C++ are the same.


So you're saying the ever-mentioned "packed" format is nothing more than a bunch of numbers stuck together like some kind of array (only assuming that it has a different name for different usage)? From past posts in which i have asked about packed, i have been left with the idea that packed was one of three possible types (packed, unpacked, and traditional).

Example of Interpretation (format: Traditonal-Packed-Unpacked).

(Spaces being seperate bytes)

5844562-5 84 45 82-5 8 4 4 5 8 2

Not sure what gave me the interpretation, but after comming to that conclusion, i decided that packed numbers wasn't worth dealing with. Perhaps that could shed some light on any tension.
Post 18 Nov 2007, 01:10
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Packed, when talking about SIMD instructions, indeed simply mean "stuffed together in an array", and possibly with some alignment requirements as well (ie, 16byte alignment or crash).

You might be thinking about BCD numbers?
Post 18 Nov 2007, 01:14
View user's profile Send private message Visit poster's website Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Therefor, for a figurative example, packed would be making an array in the structure to deal with many values at once where the same operation us used, while unpacked would be making an array of the structure where things using common operations are seperate for ease of organization?

Therefor, if i wanted to use these operations, i would do:

xpos1 dq 0f
xpos2 dq 0f
ypos1 dq 0f
ypos2 dq 0f

instead of

xpos1 dq 0f
ypos1 dq 0f
xpos2 dq 0f
ypos2 dq 0f

?

Or is it considered packed as long as i don't have something like

xpos dd 0f
ypos dd 0f
angle db 0f
Post 18 Nov 2007, 01:24
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
kohlrak: do you know vector / matrix math? that is area where MMX / SSE is utilized.

and yeah, word "packed" in this context means just "stuffed together" as f0dder said - don't speculate too much about it.
Post 18 Nov 2007, 01:46
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
To be honest they havn't taught me that stuff in math yet (or maybe they never will, since our math classes are so primitive). At this point i just want to see an example of packed and unpacked or at least how i can utilize packed into my code. In other words, do you happen to have a simple example handy (perhaps one that does a simple add or subtract operation using simd and outputting with printf)? Though i will admit hearing of matrix math before. I heard it was the solution to an on-going problem of mine with a 3d engine i'd like to work on. Though now that i have an idea i'll try to start messing with it.

I thank all of those who have contributed to this post.
Post 18 Nov 2007, 01:56
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
I think you would need some background on vectors / matrix math to utilize MMX / SSE.

typical vector consisting of four floats ,to be used with SSE:
Code:
x dd 1.0
y dd 0.0
z dd -1.0
t dd 1.0
    


i never understood what is the 4th number for, my math knowledge is quite bad too Sad Hope i didn't screw anything in this post
Post 18 Nov 2007, 02:04
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Well, for future reference, i quickly wrote a crappy example that would explain how to use it (thanks to some messing around and using your floats as a basis i get it now).

Code:
format PE console
entry main
include 'win32ax.inc'

macro outfloat out {
         finit
         fld [out]
         fstp [output]
         cinvoke printf, <"%f",10,0>, DWORD [output], DWORD [output+4]
}


section '.code' readable writeable executable
x dd 1.0f
y dd 0.0f
z dd -1.0f
t dd 1.0f
output dq 0f
main:
movaps xmm0, DQWORD [x]
movaps  DQWORD [x], xmm0 ;Just to show that it outputs fine when un-modified.
outfloat x
outfloat y
outfloat z
outfloat t
addps xmm0, xmm0 ;Simple addition
movaps  DQWORD [x], xmm0
outfloat x
outfloat y
outfloat z
outfloat t
cinvoke system, <"PAUSE", 0> ;To hold screen visible.
invoke ExitProcess, 0

section '.idata' readable import data

library kernel, 'kernel32.dll',\
        crtdll, 'crtdll.dll'

import kernel, ExitProcess, "ExitProcess"

import crtdll, printf, 'printf',\
               system, 'system'    


Though, now i wonder what the movups does differently, or if it's simply for taking things like 1 dword and filling up the space for the other dword with 0s or what...
Post 18 Nov 2007, 02:42
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
you're welcome.

i believe movups/movaps is about whether data is aligned (on 16 bytes) or not.

by the way, soon there will be FASMLIB 0.8.0 to print floating point numbers, no more libc Wink
Post 18 Nov 2007, 02:59
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

i believe movups/movaps is about whether data is aligned (on 16 bytes) or not.

Yep, but note that this is not a kind of hint, using movaps on data that is not really aligned causes GP fault even in real mode (no matter if alignment check is enabled or not).
Post 18 Nov 2007, 03:13
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3055
Location: vpcmipstrm
bitRAKE
Wow! That is an incredibly useful resource.
Post 19 Nov 2007, 07:09
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
MMX/SSE works in parallel. The 'packed' data simply means "stuffed in an array". The idea in MMX/SSE is to take multiple things at once.

e.g the 'normal' way, you would do something like:

add ax, 1
add bx, 2
add cx, 3
add dx, 4

this means you do 4 additions to 4 variables (registers). Note that all of these are independent of each other, so you could do them in parallel. In '
'MMX' way, it can look like this:

paddw mm0, [data]

where '[data]' is a memory location with 64-bit data (representing the four constants "stuffed together"; this is because unfortunately, MMX can't support immediate constants)

the value in 'data' can look something like this:

1 + (2 shl 16) + (3 shl 32) + (4 shl 48) = 0x0004000300020001

this means, obviously, that you had your 'variables' set up in mm0 according to your 'data', e.g:

mm0 represented like:

Code:
bits:
64......48......32......16.....0
--------------------------------
|  dx  ||  cx  ||  bx  ||  ax  |    
Obviously 'ax', 'bx', 'cx' and 'dx' are only used here to distinguish the variable names, they aren't actually registers, but are the respective data we added before with 'normal' way.

As you can see we need to put that data into a single register (something like an array) and then operate on each of the components separately. paddw operates on words in parallel (16 bits), paddd on dwords, paddb on bytes, etc...

SSE is the same basic principle (obviously the instructions are different), but also works for floats (which you were looking for) as well as integers.

hope this helps
Post 24 Nov 2007, 21:51
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.