flat assembler
Message board for the users of flat assembler.
Index
> Main > I'm sure this was discussed a few times before but... |
Author |
|
LocoDelAssembly 17 Nov 2007, 05:30
It's too late here but, isn't SSE scalars already compatible with FPU? AFAIK, floats are supported by SSE and doubles are supported by SSE2, there is no support for "long double" but, how many APIs requires and/or returns such data type?
|
|||
17 Nov 2007, 05:30 |
|
kohlrak 17 Nov 2007, 05:34
To be honest i never tried. I automatically assumed that since it says that it's designed for "packed numbers" i figured that chances are i'd just be wasting my time trying, for if i, out of stupidity, made an assumption that non-packed numbers worked, that it would hurt me in the long run. Or does the instructions that are used to load the registers from memory automatically convert to packed and back for you?
|
|||
17 Nov 2007, 05:34 |
|
bitRAKE 17 Nov 2007, 06:21
They are only packed in so far as the data is consecutive in memory and some operations are preformed in parallel on multiple operands - the single and double foats are the same data type. You can even use the scalar instructions to operate on just one foat at a time, but parallel should be the goal for speed.
There are conversion instructions for to/from integers, but the 32-bit and 64-bit floats are the same data type. |
|||
17 Nov 2007, 06:21 |
|
kohlrak 17 Nov 2007, 12:26
Quote: There are conversion instructions for to/from integers, but the 32-bit and 64-bit floats are the same data type. What i need it is to be able to take a standard single precision floating point number in the same format that C uses them, convert it to the SSE and then convert it all back. Question is, is the conversion macro that appears i'll have to make (unless some one shows me where the pack and unpack floats instruction is cause i can't see it) going to give a result speed less than just using the fpu instead? |
|||
17 Nov 2007, 12:26 |
|
bitRAKE 17 Nov 2007, 16:44
Just use a structure of four floats and use compiler directives to ensure the data is aligned to 16 byte boundary. The IEEE floats of Intel (FPU and SSE/2, see Vol.1 of manual) and C/C++ are the same.
I'd like to recommend Vol.4 of Intel's manuals as I don't code in C/C++ - I can read it, but don't bother to know more than that. Vol.4 specifically addresses your question, and it's from the source. |
|||
17 Nov 2007, 16:44 |
|
f0dder 17 Nov 2007, 17:09
kohlrak: just using SSE isn't going to magically speed up your stuff, you need to change how you think about things a bit.
With regular x87 FPU code, you're used to operate on one number at a time. With SSE code, you work on a whole bunch of numbers at a time (the second S is for SIMD). So you need to change your processing methods. Also, SSE doesn't have trig functions, so you'll need to handle that stuff differently: lookup tables, taylor series, ... - but some of the clever heads can tell you about that What I'm trying to say is that there isn't a "SSEMUL" which is magically faster than "FMUL", you get the speed increase by reworking code, taking advantage of SIMD, alignment, etc. |
|||
17 Nov 2007, 17:09 |
|
Xorpd! 17 Nov 2007, 21:28
SSE gives you more register file locations to hold in-flight floating point values. 16 for double precision *32, 32 for single precision *32, 32 for double precision x64, and 64 for single precision x64 vs 8 for x87, any precision. Also SSE has a two-register ISA rather than one-register as x87 does. This means no fxch, but you do need a lot of movapd because a two-register ISA is not as cool as three-register, which is supposed to be coming to x86 in a couple of years.
These characteristics make it possible to write faster code in SSE than x87. A concrete example may be seen in Kümmel's Mandelbrot Benchmark which has source code with both an x87 and an SSE implementation. Even though a P4 can only issue 64 bits worth of operations each to the floating point add and floating point multiply pipelines, SSE is doing better on this class of processor than x87 because one instruction stream of the SSE code is like pipelining two instructions streams worth of the x87 code, not to mention that, according to Agner Fog's instruction tables a P4 can only issue one SSE or x87 floating point operation per clock cycle because they all go through port 1. BTW, the above-mentioned manual is quite helpful in that it has all the SSE instructions laid out in a compact format so that you can see them and then look them up in a reference such as the Intel® 64 and IA-32 Architectures Software Developer's Manual vol. 2A and vol 2B. Also you will want to check out perhaps Intel's or Agner Fog's more general manuals, or maybe some stuff from AMD which I don't have time to hunt down URLs for just now. It takes some time to get used to SSE, just because there are so many new instructions. As general exercises in floating point optimization you might attempt to add a second instruction stream to the x87 version of the Kümmel Mandelbrot Benchmark or to the *32 SSE version. This has been done in a different and incompatible style in the quickman source as well as in the 64-bit version of KMB. Hope this is enough info to get you started. |
|||
17 Nov 2007, 21:28 |
|
kohlrak 18 Nov 2007, 01:10
Quote: Also, SSE doesn't have trig functions, so you'll need to handle that stuff differently: lookup tables, taylor series, ... - but some of the clever heads can tell you about that That can be looked up easily i'm sure. I've, in my search for an explanation, have come across numerous documents about it, but nothing that really explained (at least from my skimming) why it's faster or how exactly to use it. I would imagin that the idea would be to re-create the functions from scratch. Quote: Just use a structure of four floats and use compiler directives to ensure the data is aligned to 16 byte boundary. The IEEE floats of Intel (FPU and SSE/2, see Vol.1 of manual) and C/C++ are the same. So you're saying the ever-mentioned "packed" format is nothing more than a bunch of numbers stuck together like some kind of array (only assuming that it has a different name for different usage)? From past posts in which i have asked about packed, i have been left with the idea that packed was one of three possible types (packed, unpacked, and traditional). Example of Interpretation (format: Traditonal-Packed-Unpacked). (Spaces being seperate bytes) 5844562-5 84 45 82-5 8 4 4 5 8 2 Not sure what gave me the interpretation, but after comming to that conclusion, i decided that packed numbers wasn't worth dealing with. Perhaps that could shed some light on any tension. |
|||
18 Nov 2007, 01:10 |
|
f0dder 18 Nov 2007, 01:14
Packed, when talking about SIMD instructions, indeed simply mean "stuffed together in an array", and possibly with some alignment requirements as well (ie, 16byte alignment or crash).
You might be thinking about BCD numbers? |
|||
18 Nov 2007, 01:14 |
|
kohlrak 18 Nov 2007, 01:24
Therefor, for a figurative example, packed would be making an array in the structure to deal with many values at once where the same operation us used, while unpacked would be making an array of the structure where things using common operations are seperate for ease of organization?
Therefor, if i wanted to use these operations, i would do: xpos1 dq 0f xpos2 dq 0f ypos1 dq 0f ypos2 dq 0f instead of xpos1 dq 0f ypos1 dq 0f xpos2 dq 0f ypos2 dq 0f ? Or is it considered packed as long as i don't have something like xpos dd 0f ypos dd 0f angle db 0f |
|||
18 Nov 2007, 01:24 |
|
vid 18 Nov 2007, 01:46
kohlrak: do you know vector / matrix math? that is area where MMX / SSE is utilized.
and yeah, word "packed" in this context means just "stuffed together" as f0dder said - don't speculate too much about it. |
|||
18 Nov 2007, 01:46 |
|
kohlrak 18 Nov 2007, 01:56
To be honest they havn't taught me that stuff in math yet (or maybe they never will, since our math classes are so primitive). At this point i just want to see an example of packed and unpacked or at least how i can utilize packed into my code. In other words, do you happen to have a simple example handy (perhaps one that does a simple add or subtract operation using simd and outputting with printf)? Though i will admit hearing of matrix math before. I heard it was the solution to an on-going problem of mine with a 3d engine i'd like to work on. Though now that i have an idea i'll try to start messing with it.
I thank all of those who have contributed to this post. |
|||
18 Nov 2007, 01:56 |
|
vid 18 Nov 2007, 02:04
I think you would need some background on vectors / matrix math to utilize MMX / SSE.
typical vector consisting of four floats ,to be used with SSE: Code: x dd 1.0 y dd 0.0 z dd -1.0 t dd 1.0 i never understood what is the 4th number for, my math knowledge is quite bad too Hope i didn't screw anything in this post |
|||
18 Nov 2007, 02:04 |
|
kohlrak 18 Nov 2007, 02:42
Well, for future reference, i quickly wrote a crappy example that would explain how to use it (thanks to some messing around and using your floats as a basis i get it now).
Code: format PE console entry main include 'win32ax.inc' macro outfloat out { finit fld [out] fstp [output] cinvoke printf, <"%f",10,0>, DWORD [output], DWORD [output+4] } section '.code' readable writeable executable x dd 1.0f y dd 0.0f z dd -1.0f t dd 1.0f output dq 0f main: movaps xmm0, DQWORD [x] movaps DQWORD [x], xmm0 ;Just to show that it outputs fine when un-modified. outfloat x outfloat y outfloat z outfloat t addps xmm0, xmm0 ;Simple addition movaps DQWORD [x], xmm0 outfloat x outfloat y outfloat z outfloat t cinvoke system, <"PAUSE", 0> ;To hold screen visible. invoke ExitProcess, 0 section '.idata' readable import data library kernel, 'kernel32.dll',\ crtdll, 'crtdll.dll' import kernel, ExitProcess, "ExitProcess" import crtdll, printf, 'printf',\ system, 'system' Though, now i wonder what the movups does differently, or if it's simply for taking things like 1 dword and filling up the space for the other dword with 0s or what... |
|||
18 Nov 2007, 02:42 |
|
vid 18 Nov 2007, 02:59
you're welcome.
i believe movups/movaps is about whether data is aligned (on 16 bytes) or not. by the way, soon there will be FASMLIB 0.8.0 to print floating point numbers, no more libc |
|||
18 Nov 2007, 02:59 |
|
LocoDelAssembly 18 Nov 2007, 03:13
Quote:
Yep, but note that this is not a kind of hint, using movaps on data that is not really aligned causes GP fault even in real mode (no matter if alignment check is enabled or not). |
|||
18 Nov 2007, 03:13 |
|
bitRAKE 19 Nov 2007, 07:09
Xorpd! wrote: Agner Fog's instruction tables |
|||
19 Nov 2007, 07:09 |
|
Borsuc 24 Nov 2007, 21:51
MMX/SSE works in parallel. The 'packed' data simply means "stuffed in an array". The idea in MMX/SSE is to take multiple things at once.
e.g the 'normal' way, you would do something like: add ax, 1 add bx, 2 add cx, 3 add dx, 4 this means you do 4 additions to 4 variables (registers). Note that all of these are independent of each other, so you could do them in parallel. In ' 'MMX' way, it can look like this: paddw mm0, [data] where '[data]' is a memory location with 64-bit data (representing the four constants "stuffed together"; this is because unfortunately, MMX can't support immediate constants) the value in 'data' can look something like this: 1 + (2 shl 16) + (3 shl 32) + (4 shl 48) = 0x0004000300020001 this means, obviously, that you had your 'variables' set up in mm0 according to your 'data', e.g: mm0 represented like: Code: bits: 64......48......32......16.....0 -------------------------------- | dx || cx || bx || ax | As you can see we need to put that data into a single register (something like an array) and then operate on each of the components separately. paddw operates on words in parallel (16 bits), paddd on dwords, paddb on bytes, etc... SSE is the same basic principle (obviously the instructions are different), but also works for floats (which you were looking for) as well as integers. hope this helps |
|||
24 Nov 2007, 21:51 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.