flat assembler
Message board for the users of flat assembler.

Index > Heap > Intel released documentation for SSSE3 (Supplemental SSE3)

Author
Thread Post new topic Reply to topic
MazeGen



Joined: 06 Oct 2003
Posts: 975
Location: Czechoslovakia
MazeGen
http://developer.intel.com/design/Pentium4/documentation.htm#manuals

(revision 021)

Quote:
SUPPLEMENTAL STREAMING SIMD EXTENSIONS 3
(SSSE3) INSTRUCTIONS
SSSE3 provide 32 instructions (represented by 14 mnemonics) to accelerate computations
on packed integers. These include:
• Twelve instructions that perform horizontal addition or subtraction operations.
• Six instructions that evaluate absolute values.
• Two instructions that perform multiply and add operations and speed up the
evaluation of dot products.
• Two instructions that accelerate packed-integer multiply operations and produce
integer values with scaling.
• Two instructions that perform a byte-wise, in-place shuffle according to the
second shuffle control operand.
• Six instructions that negate packed integers in the destination operand if the
signs of the corresponding element in the source operand is less than zero.
• Two instructions that align data from the composite of two operands.
Post 28 Sep 2006, 11:52
View user's profile Send private message Visit poster's website Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 975
Location: Czechoslovakia
MazeGen
Since the manual mentions only Core Duo and Core Solo, it seems that SSSE3 is present since Core 1.
Post 28 Sep 2006, 11:59
View user's profile Send private message Visit poster's website Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
Quote:

Two instructions that perform multiply and add operations and speed up the evaluation of dot products.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 10, NO. 8, DECEMBER 2000
Packed Integer Wavelet Transform Constructed by Lifting Scheme
Chengjiang Lin, Bo Zhang, and Yuan F. Zheng [aka Lin,Zhang,Zheng--LZZ]

This is one of the few references I could find, to explain the purported advantages of using packed integers rather than unpacked, i.e. conventional, integers.
According to LZZ, if one needs only (for example) 16 bits to manipulate (in their situation, a graphical image), then, by inserting data from four graphical points into a memory location or register 64 bits wide, one could manipulate four graphical points with a single instruction, instead of requiring, in this example, four separate instructions to accomplish the same goal.
My question is this: Has anyone measured the time needed for the initial transform, i.e. with the four times slower method--UNPACKED integers, to learn how much time the cpu is sitting there doing nothing, EVEN WITH THE FOUR TIMES SLOWER method?
In my own research, nearly 15 years ago, now, on cpu's LONG since obsolete, I found that the cpu was just sitting around 90% of the time, and actually doing something useful (in my research, a Fast Fourier Transform on 16 bit data, 256 points per frame) only 10% of the time. Yes, I can imagine performing this operation USING PACKED integers, instead of plain vanilla integers, as I used, but then, so what? What do I achieve? The cpu today, already 1000 times faster than the 486 I used, would then be sitting there doing nothing 98.3% of the time with a single "packed" operand, instead of 97.6% of the time using four separate operands. Is it really worth the obscurity, and difficulty debugging, to use these new instructions? What task, i.e. which REAL WORLD task, can now be implemented, by virtue of having these new instructions available, that COULD NOT HAVE BEEN performed last year, or last week, or last decade? What has been GAINED as a result of these new instructions? To me, this is simply another illustration of the thirty year old debate between RISC and CISC. Naturally, Intel favors CISC (Complex Instruction Set Computing). I believe, very firmly, in RISC architecture (Reduced Instruction Set Computing), where the dot products, for example, are computed by the programmer in software (slow execution speed), not by the cpu in hardware (fast exection speed). I fail to understand which application, in the real world, will now be possible, that heretofore, could not have been accomplished, without these new instructions.
Smile
Post 28 Sep 2006, 12:43
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Wikipedia says it's present since Core 2 and Xeon 5100 Series http://en.wikipedia.org/wiki/SSSE3
Post 28 Sep 2006, 12:54
View user's profile Send private message Reply with quote
Tomasz Grysztar
Assembly Artist


Joined: 16 Jun 2003
Posts: 7718
Location: Kraków, Poland
Tomasz Grysztar
Post 28 Sep 2006, 13:18
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
tom tobias: the point is, of course, to get the job done faster when you need it. Yes, a CPU will usually sit idle a lot of time, but using SSE instructions mean that when I click "do heavy calculations", the job is done perhaps twice as fast as if not using SSE (for the extreme cases).

As for RISC vs. CISC, even the traditional RISC architectures have become more CISCy (adding the same kind of instruction sets as SSE, see AltiVec), so... whatever Smile
Post 28 Sep 2006, 13:22
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
I think the extra instructions are just icing on the cake: use 'em if you really need to, but otherwise you kinda have to avoid 'em for compatibility anyways. Most people (e.g., me) only use a subset of the available instructions anyways. So, it doesn't matter much to me.

It's like GNU extensions: use 'em if you have to, but remember, it's not necessarily ANSI/POSIX/etc.

Perhaps NASA needs the extra computing power to find more dwarf planets. Smile
Post 28 Sep 2006, 18:18
View user's profile Send private message Visit poster's website Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
fudder explained the rationale for including the new SSE3 instructions:
Quote:
to get the job done faster when you need it ...
using SSE instructions mean that ...
the job is done perhaps twice as fast as if not using SSE ...

rugxulo contributed:
Quote:
Perhaps NASA needs the extra computing power ...

Yes, 2x faster, or even 200x faster, but the point is, how much time did the cpu spend, computing dot products the old fashioned way? Isn't it better to augment the instruction set FOR A REASON, rather than simply as a marketing tool to increase sales? Most of the dot products I have computed, originate with mechanical and electrical engineering applications, none of which, in my experience, demand instantaneous solution. I am asking here, a very simple question, maybe too simple: Where is the need for these new instructions? Which task will we now be able to compute, (thanks to these new SSE3 instructions,) that we were previously UNABLE to complete, due to time limitations--cpu operated too slowly with the conventional instructions/software? Ok, here's a variation on this theme: Let's go back two or three decades, and ask this question: Which programs of recent vintage employ "exotic" instructions --i.e. instructions introduced after the initial release of the 8086? Well, some of those instructions are ESSENTIAL for running contemporary software, for instance, instructions permitting access to protected mode. What about the FPU instructions, integrated onto the same silicon, commencing with the 486? Sure, we use those daily, and perhaps we may conclude, without data to support this conclusion, that an off chip location for the fpu would prevent some of these programs from operating in real time. I am not certain of this. I found that comparing exection times of 386 + off chip math coprocessor, versus 486 (on chip fpu), the FFT computation required, as fudder suggested, TWICE as long. If the FFT had required, instead of 256 points, 1024, or 4096 points, perhaps the 386+ offchip fpu, would have failed to complete the task in real time, and perhaps the 486, contrarily, would have continued to function in real time, thereby confirming that inclusion of these "new" instructions, the fpu on chip, offered MORE than a simple marketing advantage. Maybe the same circumstance exists here with the new SSE3 instructions, and I simply lack sufficient experience to recognize which applications demand faster dot product computation, else risk failure...
Confused
Post 29 Sep 2006, 09:26
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.