learning mmx/sse/sse2

Index > Main > learning mmx/sse/sse2

Author

Thread

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 01 Jan 2017, 05:47

Hello. I want to learn mmx/sse/sse2.

Please give me tutorials and code examples.

Also please tell me, which parts of mmx are outdated and replaced by sse? And which parts of sse are outdated and replaced by sse2? Just so I don't break my head over dead instuctions.

I probably wouldn't go into sse3 and beyond yet, not in this thread at least. Feel free to share something interesting about it though, I hope it wouldn't confuse me further.

01 Jan 2017, 05:47

Xorpd!

Joined: 21 Dec 2006
Posts: 161

Xorpd! 01 Jan 2017, 17:23

The original mmx instruction set had only packed integer instructions in 64-bit mmx registers aliased to the FPU registers. These instructions were too lame to be generally useful. SSE introduced the 128-bit xmm registers, but only had single precision data types. SSE2 added double precision data and expanded the integer operations from mmx to the 128 bit xmm registers. So the dead instructions in the family come back as more improved ones in succeeding generations. In AVX most instructions expand to 256-bit ymm registers and include a 3 operand syntax.

Another advantage of the SSEx instructions is that in 64-bit code you get twice as many registers to work with. Are you working in 64 bits I hope? I assume Windows ... perhaps pre-Windows 10: the only cool feature of Windows 10 that I have seen is "Project to this PC" but I don't know whether it's worth the pain.

What is your math background? Have you experimented much with graphics in FASM as yet? You know, it's important in programs that produce graphical output to get the graphics part working first because debugging is easier that way. If something is going wrong you can tell just by looking at the picture that your program has rendered.

It might take a couple of iterations both in floating point capabilities and graphical capabilities to get up to speed because there is a more intuitive serial approach to both and also a massively parallel approach to both which can result in greatly increased throughput but it's a big leap to get there from zero.

I don't have any tutorials and my only examples are in the long Mandelbrot thread and are really ugly complicated code. You can look things up in Intel's manuals and there is lots of useful information n Agner Fog's web pages.

01 Jan 2017, 17:23

rugxulo

Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)

rugxulo 01 Jan 2017, 17:26

I haven't done much SIMD programming, so I'm not the best person to respond. But I can give easy advice.

MMX (and AMD's 3dnow!) are totally obsolete and shouldn't be used for new code on new machines. They are basically reusing FPU resources, and the FPU is deprecated in lieu of SSE2 (which is mandatory on AMD64).

So bare minimum would be SSE2, which is since P4 (2000) or AMD64 (2003). I forget which Windows started requiring it, maybe 8? (Apparently yes. So that's since late 2012.) I think even MSVC targets it by default now, so a lot of software defaults to it as well.

As for SSE3 and beyond (SSSE3, SSE4.1/4.2), it's fairly common and might be useful. Of course, you can always use CPUID to check for availability before using such specialized instructions (if it even makes a noticeable difference).

Even though I don't understand AVX and it's still relatively new and underutilized, you may also wish to look into that. It's probably here to stay.

01 Jan 2017, 17:26

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 02 Jan 2017, 07:24

ok, so for now I can forget MMX exists, great.

@Xorpd!
>Are you working in 64 bits I hope?
No, 32 for now. I want my program to work even on ancient stuff.

>What is your math background?
Well, I have very vague idea of how 3d rendering works, and a bit about vectors and matrix operations. Should be ok, but I need to refresh a lot.
Also, I don't know much math terms, especially english math terms.

>Have you experimented much with graphics in FASM as yet?
I'm about to start. Thanks for suggestion.

>my only examples are in the long Mandelbrot thread and are really ugly complicated code
which one, where can I find it?

02 Jan 2017, 07:24

Xorpd!

Joined: 21 Dec 2006
Posts: 161

Xorpd! 02 Jan 2017, 10:12

The Mandelbrot thread is https://board.flatassembler.net/topic.php?t=5122 .

Yeah, get started with some simple graphics and animation.

As far as math goes I was thinking more in term of Euler angles and Hamiltonian mechanics.

I was kind of an early adopter of x64 a decade ago, but even at that time most processors sold were 64-bit capable, people just weren't running a 64-bit OS. Pre-64-bit hardware is kind of rare these days and there is the death of many late 32-bit hardware due to the capacitor plague as well. Nowadays you can hit the dumpsters around campus housing at the end of a semester and find 64-bit computers for free, so I wouldn't consider poverty to be a limitation as far as 64-bit computing goes.

4K monitors are another story, but anyone who gets less than that for programming or gaming in 2017 is just an idiot. So I encourage you to go with x64 and 4K; by the time you get something useful out the door these will be old news.

But don't take my attitudes too harshly: I recall many people getting hooked on Farmville and that didn't require any highfalutin' math or fancy hardware, although it had the potential to look much cooler in 4K. Ditto for Angry Birds. What made the game more than anything was the imagination and concepts of its creators and that's what you should be cultivating and enabling as a first priority.

02 Jan 2017, 10:12

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 02 Jan 2017, 12:07

Ok, that's an interesting point of view.

Now a question about SSE itself: What does it mean to pack, unpack and shuffle something?

02 Jan 2017, 12:07

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20873
Location: In your JS exploiting you and your system

revolution 02 Jan 2017, 13:06

One SSE register holds more than one value. E.g. one 128-bit register can hold 16 bytes, 8 words, 4 dwords, or 2 qwords, so you "pack" them together into one 128-bit register.

02 Jan 2017, 13:06

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 02 Jan 2017, 13:51

difference between comiss and ucomiss?

quote from https://flatassembler.net/docs.php?article=manual#2.1.15

>comiss and ucomiss compare the single precision values and set the ZF, PF and CF flags to show the result. The destination operand must be a SSE register, the source operand can be a 32-bit memory location or SSE register.

02 Jan 2017, 13:51

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20873
Location: In your JS exploiting you and your system

revolution 02 Jan 2017, 14:45

IIRC one is ordered and the other in not ordered.

Floating point values have some special encodings for SNaN, QNan (often referred just as NaN) and infinity. NaNs can't be ordered in the normal numbering system (is SNaN greater than 7 or less than 7?) so there are the unordered compares for that. See the CPU docs if you need to know exactly what state the flags are in for each type of compare with unordered values.

BTW: Do you have the Intel or the AMD instruction set docs? If not, get them. They are free to download.

02 Jan 2017, 14:45

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 02 Jan 2017, 15:28

Why are there so many instructions that work on only one value instead of all 4? Is sse tries to partially replace fpu?

About intel manual: search gives me something with numbers 64 in the title, I need something more old than that. (And huge books are scary.)

02 Jan 2017, 15:28

Xorpd!

Joined: 21 Dec 2006
Posts: 161

Xorpd! 02 Jan 2017, 19:35

Yes, SSE is more or less trying to completely replace FPU.

Don't worry about the 64-bit stuff. Some integer instructions and instruction formats were removed in 64-bit mode to clear out space for new instructions, but for floating point code the only obvious difference is that you have twice as many registers available and twice as big integer registers in 64-bit mode compared to 32-bit mode. Thus Intel doesn't provide separate manuals for the 32-bit instruction set, just one manual with notes about the differences on a per-instruction basis.

02 Jan 2017, 19:35

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 03 Jan 2017, 06:37

"mfence" and others are for multithreaded apps only?

"prefetch" and "prefetchw" are actually useful? If they are, then where?

03 Jan 2017, 06:37

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 03 Jan 2017, 09:23

How to benchmark my code?

rdtsc

03 Jan 2017, 09:23

rugxulo

Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)

rugxulo 03 Jan 2017, 22:40

A quick search on this forum for "RDTSCP" shows less than a page worth of links, so that's probably your best bet. N.B. IIRC, that is slightly better ("serializing variant", according to Wikipedia) but requires 2010-ish cpus or newer.

Also, here's some random links from quick Google search:

http://www.felixcloutier.com/x86/RDTSCP.html
https://unix4lyfe.org/benchmarking/

03 Jan 2017, 22:40

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 04 Jan 2017, 08:49

Oh wow, another instruction for that. And it's not even mentioned in fasm programmer's manual. Interesting, thanks.

hmm...

https://en.wikipedia.org/wiki/Time_Stamp_Counter

>Starting with the Pentium Pro, Intel processors have practiced out-of-order execution, where instructions are not necessarily performed in the order they appear in the program. This can cause the processor to execute RDTSC later than a simple program expects, producing a misleading cycle count.[4] The programmer can solve this problem by inserting a serializing instruction, such as CPUID, to force every preceding instruction to complete before allowing the program to continue, or by using the RDTSCP instruction, which is a serializing variant of the RDTSC instruction.

So I can't precisely count execution time of a single instruction... That's actually pretty important. I'll probably count execution time of a large pack of functions for now, that should be precise enough for that.

04 Jan 2017, 08:49

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20873
Location: In your JS exploiting you and your system

revolution 04 Jan 2017, 09:34

vivik: Relying on external resources like Wikipedia and the fasm manual is an incomplete (and possibly error prone) way to discover the instructions and their actions. Did you get the real manuals from the real sources Intel and/or AMD?

04 Jan 2017, 09:34

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 04 Jan 2017, 11:04

yes

but i'm terrible at reading documentation. It feels like reading it all will take me 2 months or so. I'd like to start practicing right now, and slowly get down to details.

I will read official docs, in time.

04 Jan 2017, 11:04

Xorpd!

Joined: 21 Dec 2006
Posts: 161

Xorpd! 04 Jan 2017, 11:41

Definitely start practicing right away. There is just no way to make sense out of computer documentation without running test programs on your own to make sure your interpretation of what you have read is correct.

But get all of Agner Fog's 5 PDFs. You wouldn't be talking about 'timing a single instruction' if you had read some of his stuff. An instruction has latency (the time it takes before the results of the instruction are available to further instructions) throughput (the number of copies of an instruction can be issued in the same clock cycle) penalties (for cache misses, branch misprediction) resource usage (ports, write buffers) so the cost of a single instruction depends on the context in which it is issued.

An expensive-looking instruction can end up being free (like floating point multiplications in a fast Fourier transform) and an innocuous-looking instruction can turn out to be really expensive (like a read of uncached data).

04 Jan 2017, 11:41

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 05 Jan 2017, 08:28

What do you think, this code is a good place to use sse?
How would you approach writing the sse code here?

Code:

    int char_width = 9;//--const
    int char_height = 16;//--const

    int whatchar_y = (int)c / 32;
    int whatchar_x = (int)c % 32;

    float f_what_lr = char_width / 512.0; //--const
    float what_left = whatchar_x * f_what_lr;
    float what_right = what_left + f_what_lr;

    float f_what_tb = char_height / 512.0;//--const
    float what_top = whatchar_y * f_what_tb;
    what_top = 1.0 - what_top; //dat flip :p
    float what_bottom = what_top - f_what_tb; //notice minus here

    float f_where_lr = f_what_lr * 2.0;//--const
    float where_left = -1.0 + wherechar_x * f_where_lr;
    float where_right = where_left + f_where_lr;

    float f_where_tb = f_what_tb * 2.0;//--const
    float where_top = 1.0 - wherechar_y * f_where_tb; //opposite sign
    float where_bottom = where_top - f_where_tb;

05 Jan 2017, 08:28

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum