flat assembler
Message board for the users of flat assembler.

Index > Main > Intel released SSE4 documentation

Author
Thread Post new topic Reply to topic
Hunter



Joined: 07 Jun 2006
Posts: 41
Hunter
Intel released SSE4 documentation:
http://download.intel.com/design/processor/manuals/D91561.pdf

Is it planned to support it by FASM?
Post 25 Sep 2007, 10:38
View user's profile Send private message Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan
Hunter wrote:
Intel released SSE4 documentation

This is already past day... cuz AMD released SSE5)))
AMD64 Architecture Tech Docs
AMD64 Technology 128-Bit SSE5 Instruction Set

_________________
Any offers?
Post 25 Sep 2007, 12:25
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7752
Location: Kraków, Poland
Tomasz Grysztar
Seems like a lot of fun to add the DREX support into fasm. Smile
Post 25 Sep 2007, 12:44
View user's profile Send private message Visit poster's website Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen
Yeah, it seems that AMD has a relish for changing the instruction encoding, which can kill whole SSE5 Razz
Post 25 Sep 2007, 13:09
View user's profile Send private message Visit poster's website Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
Tomasz wrote:
Seems like a lot of fun to add the DREX support into fasm

http://amdzone.com/index.php?name=PNphpBB2&file=viewtopic&t=11125&postdays=0&postorder=asc&start=60
from that reference:
AMD wrote:

SSE5 introduces the DREX prefix byte which is similar to the REX prefix byte that AMD introduced with AMD64. Note that none of the SSE's including SSE4 introduce any prefix bytes.

SSE5 also adds genuine 3-way insructions with two sources and an independent destination. None of the SSE's including SSE4 have 3-way instructions.

SSE5 has conditional moves. Again, no conditional moves in SSE including SSE4.

SSE5 is a true ISA extension that matches a lot of the power of advanced instruction sequences like what is available on Itanium or what was available with Altivec on the G5 Macs.
Post 25 Sep 2007, 17:29
View user's profile Send private message Reply with quote
peter



Joined: 09 May 2006
Posts: 63
peter
Had anybody tried to learn how to program for SSE4? I tried to implement strcmp with SSE4.2 instructions:

http://smallcode.weblogs.us/2007/11/23/strcmp-and-strlen-using-sse-42-instructions/

The problem is that you cannot test it, because there is no processors with SSE4 support yet Smile.
Post 23 Nov 2007, 14:51
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
http://softwarecommunity.intel.com/articles/eng/1193.htm
Quote:
SDK for 45nm Next Generation Intel® Core™2 Processor Family and Intel® SSE4 (Penryn SDK): A collection of documentation and tools for developing software for Penryn and Intel SSE4. The Penryn SDK includes presentations and whitepapers on Penryn and Intel SSE4, and an Intel SSE4 emulator that allows you to start developing applications with Intel SSE4 today!
Post 23 Nov 2007, 16:19
View user's profile Send private message Reply with quote
peter



Joined: 09 May 2006
Posts: 63
peter
LocoDelAssembly, thanks for the link.
Post 23 Nov 2007, 23:50
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
SSE4 emulator? Sounds like blazing speed to me Razz

Joking aside, such a thing could be useful - if it can realistically simulate timings as well (ie., you will of course get slow speed from the emulator, but if it can show what speed you would clock at on real hardware...)
Post 24 Nov 2007, 00:01
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
But unfortunally it will not emulate at cycle level, in fact, the library just install an invalid opcode exception handler to emulate SSE4 instructions. It is only useful to check if your algorithm accomplishes its specification, but for timings you will have to wait for Penryn Smile

PS: And actually Penryn will implement a subset of SSE4
Post 24 Nov 2007, 01:12
View user's profile Send private message Reply with quote
peter



Joined: 09 May 2006
Posts: 63
peter
Unfortunately, the emulator does not support string-processing instructions (SSE 4.2). Only SSE4 is supported Sad.

Well, the processors with SSE4.2 will be available some day. We should learn how to use them. It's not easy, as you can see from Intel papers and my article, but the instructions looks very tempting. Imagine CRC32 calculation or strlen in hardware Cool.
Post 24 Nov 2007, 08:44
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan
peter wrote:
Imagine CRC32 calculation or strlen in hardware Cool.

I doubt they will have good speed. And again RISC vs. CISC.

_________________
Any offers?
Post 24 Nov 2007, 08:50
View user's profile Send private message Reply with quote
peter



Joined: 09 May 2006
Posts: 63
peter
There will be dedicated hardware for CRC32 and string-manipulation, so it should be really fast. See this discussion.
Post 24 Nov 2007, 10:07
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
If I'm right, the CRC calculation it does is not the most conventional, it also reflects the bits first.
Code:
CRC32 instruction for 64-bit source operand and 64-bit destination operand:
TEMP1[63-0] = BIT_REFLECT64 (SRC[63-0])
TEMP2[31-0] = BIT_REFLECT32 (DEST[31-0])
TEMP3[95-0] = TEMP1[63-0] << 32
TEMP4[95-0] = TEMP2[31-0] << 64
TEMP5[95-0] = TEMP3[95-0] XOR TEMP4[95-0]
TEMP6[31-0] = TEMP5[95-0] MOD2 11EDC6F41H
DEST[31-0] = BIT_REFLECT (TEMP6[31-0])
DEST[63-32] = 00000000H

CRC32 instruction for 32-bit source operand and 32-bit destination operand:
TEMP1[31-0] = BIT_REFLECT32 (SRC[31-0])
TEMP2[31-0] = BIT_REFLECT32 (DEST[31-0])
TEMP3[63-0] = TEMP1[31-0] << 32
TEMP4[63-0] = TEMP2[31-0] << 32
TEMP5[63-0] = TEMP3[63-0] XOR TEMP4[63-0]
TEMP6[31-0] = TEMP5[63-0] MOD2 11EDC6F41H
DEST[31-0] = BIT_REFLECT (TEMP6[31-0])

CRC32 instruction for 16-bit source operand and 32-bit destination operand:
TEMP1[15-0] = BIT_REFLECT16 (SRC[15-0])
TEMP2[31-0] = BIT_REFLECT32 (DEST[31-0])
TEMP3[47-0] = TEMP1[15-0] << 32
TEMP4[47-0] = TEMP2[31-0] << 16
TEMP5[47-0] = TEMP3[47-0] XOR TEMP4[47-0]
TEMP6[31-0] = TEMP5[47-0] MOD2 11EDC6F41H
DEST[31-0] = BIT_REFLECT (TEMP6[31-0])

CRC32 instruction for 8-bit source operand and 64-bit destination operand:
TEMP1[7-0] = BIT_REFLECT8(SRC[7-0])
TEMP2[31-0] = BIT_REFLECT32 (DEST[31-0])
TEMP3[39-0] = TEMP1[7-0] << 32
TEMP4[39-0] = TEMP2[31-0] << 8
TEMP5[39-0] = TEMP3[39-0] XOR TEMP4[39-0]
TEMP6[31-0] = TEMP5[39-0] MOD2 11EDC6F41H
DEST[31-0] = BIT_REFLECT (TEMP6[31-0])
DEST[63-32] = 00000000H

CRC32 instruction for 8-bit source operand and 32-bit destination operand:
TEMP1[7-0] = BIT_REFLECT8(SRC[7-0])
TEMP2[31-0] = BIT_REFLECT32 (DEST[31-0])
TEMP3[39-0] = TEMP1[7-0] << 32
TEMP4[39-0] = TEMP2[31-0] << 8
TEMP5[39-0] = TEMP3[39-0] XOR TEMP4[39-0]
TEMP6[31-0] = TEMP5[39-0] MOD2 11EDC6F41H
DEST[31-0] = BIT_REFLECT (TEMP6[31-0])

Notes:
BIT_REFLECT64: DST[63-0] = SRC[0-63]
BIT_REFLECT32: DST[31-0] = SRC[0-31]
BIT_REFLECT16: DST[15-0] = SRC[0-15]
BIT_REFLECT8: DST[7-0] = SRC[0-7]
MOD2: Remainder from Polynomial division modulus 2    
Post 24 Nov 2007, 14:22
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Iirc the .PDF mentions things like software iSCSI as possible uses for the CRC32 instructions... so I hope/assume intel hasn't fucked up this (very) CISCy instruction.
Post 24 Nov 2007, 14:31
View user's profile Send private message Visit poster's website Reply with quote
peter



Joined: 09 May 2006
Posts: 63
peter
Bit reflection is used in all CRC32 algorithms, so it's not unusual.

What's actually bad is that they used polynomial 11EDC6F41 from iSCSI standard. Another polynomial (04C11DB7, reflected EDB88320) is used in almost all programs (WinZip, WinRar, WinHex, etc) and in Ethernet standard.


Last edited by peter on 25 Nov 2007, 02:34; edited 1 time in total
Post 25 Nov 2007, 02:11
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
peter wrote:
Bit reflection is used in all CRC32 algorithms, so it's not unusual.

What's actually bad is that they used polynomial 11EDC6F41 from iSCSI standard. Another polynomial (EDB88320, reflected 04C11DB7) is used in almost all programs (WinZip, WinRar, WinHex, etc).

I guess that shows what they intend the instructions for Smile

Is the poly hardcoded in the cpu (along with LUTs), or can it be set programatically?

_________________
Image - carpe noctem
Post 25 Nov 2007, 02:16
View user's profile Send private message Visit poster's website Reply with quote
peter



Joined: 09 May 2006
Posts: 63
peter
Post 25 Nov 2007, 02:35
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
I actually hoped that the poly can be set up somewhere or at least give a DWORD parameter with the instruction, oh well Razz
Post 26 Nov 2007, 08:17
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.