flat assembler
Message board for the users of flat assembler.

Index > Heap > Sandy Bridge AVX

Author
Thread Post new topic Reply to topic
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Just look at the speed. I wonder no more why its 21% faster clock-for-clock.
http://agner.org/optimize/instruction_tables.pdf
page 81 shows PMIN/AVG/MIN/MAX/ABS/SIGN... throughput of 2 instructions per clock (0.5 CPI)
Median throughput of XMM instruction subset is around 1 clock.
Micro- and macro-op fusions are better (more instructions are fused) and ports 2&3 are functionally equal so better scheduling can be achieved.

Cannot wait for these fixed chips to start coming in so we can all test these 'babies' Smile

I smell a new flamewar "MOV vs. XOR":
Agner Fog wrote:

Instructions that need no execution unit
The abovementioned special cases where registers are set to zero by instructions such as
XOR EAX,EAX are handled at the register rename/allocate stage without using any
execution unit. This makes the use of these zeroing instructions extremely efficient, with a
throughput of four zeroing instructions per clock cycle. The carry flag can be zeroed with
CLC in the same efficient way.


EDIT: Chapter 8 of "microarchitecture.pdf" is a very good read http://agner.org/optimize

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 08 Feb 2011, 07:27
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17249
Location: In your JS exploiting you and your system
revolution
Post 08 Feb 2011, 13:09
View user's profile Send private message Visit poster's website Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
Madis 731 wrote:
I smell a new flamewar "MOV vs. XOR":
Not really.

My argument was always about (a) readability, not speed of execution, and (b) saving project time (avoiding mistakes), rather than reducing executable latency.

In my opinion, if one goes back into the Intel archives, one will discover that already with the 386, it was faster to write XOR than MOV.

Maybe I err, on that point. I don't remember any more. I still prefer code which is readable, rather than obscure. I still think that writing XOR to clear a register is illogical, since there is but a single operand, and the whole point of Boolean operations is to interface two or more entities.

Accordingly, I will not argue with Agner Fog, or anyone else. If speed of execution is measurable, with the task at hand, then it should be possible to compare two versions of the same program, otherwise identical, with one version replacing all the XOR's with MOV instructions.

In my opinion, a large fraction of humans engaged in such a test, would fail to identify, in a double blind testing scenario, which vesion presented MOV, and which XOR.

It may be possible to measure a latency difference between the two versions, with an electronic timer, but I doubt that ordinary folks on the FASM forum could distinguish the two iterations, one with MOV, one with XOR.

Smile
tom
Post 09 Feb 2011, 01:46
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17249
Location: In your JS exploiting you and your system
revolution
Hi tom, welcome back.

Would you have less resistance to XOR reg,reg if Intel had simply made a new mnemonic for it (like "clear reg" or "zero reg"), with the CPU still internally using the XOR circuitry to do the job?

Intel already did this with xchg ax,ax, they made a new mnemonic for that and called it nop. I wonder how may flame wars Intel unwittingly suppressed by making that simple cosmetic change?

tom you are 100% correct that for most user interactive programs the time difference is not perceivable on human time scales. But for other tasks that involve lengthy computations things like the xor reg,reg are valuable. We need to look at the task at hand to decide if xor reg,reg is warranted in that particular situation. Anyhow, I know you already know this. I just post it here for others to consider.

[edit] changed opcode to the more correct term - mnemonic.


Last edited by revolution on 09 Feb 2011, 03:30; edited 1 time in total
Post 09 Feb 2011, 02:33
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
tom tobias wrote:
My argument was always about (a) readability, not speed of execution, and (b) saving project time (avoiding mistakes), rather than reducing executable latency.

Then I think XOR wins in this space:
Code:
; MOV
   mov rax, 0
   movq xmm0, rax
; XOR
   xorpd xmm0, xmm0    

You have to know that movq zero extends the source to make enough bits to fill its destination to understand this sequence, so I don't think the MOV version is as readable.
Post 09 Feb 2011, 03:15
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
How about not waiting for Intel to make another change, but start from FASM and tell the story in macro language:
Code:
macro clr r
{
  xor r,r
}
clr rax
clr ebx
clr cx
clr dl
; other options:
; ctz - Clear To Zero
; rst - Reset
    
Post 09 Feb 2011, 12:51
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Tom, if you want readability over speed, what the hell are you doing with assembly?

And, really, "xor reg, reg" is idiomatic x86 code - apart from absolute beginners, nobody has trouble with the "readability" of that. If they have after a few weeks of learning x86 assembly, they should be doing something else.
Post 13 Feb 2011, 20:03
View user's profile Send private message Visit poster's website Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 294
Alphonso
Umm, I always thought it was MOV that was harder to read...
Code:
        push 1234
        pop  eax
        mov  ebx,eax    ;WTF, 1234 is still in eax    
Laughing
Post 14 Feb 2011, 15:34
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17249
Location: In your JS exploiting you and your system
revolution
Maybe it is pop that is unreadable:
Code:
push 1234
pull eax ;ah, pull makes more sense than pop.    
But perhaps more explicitly we should have:
Code:
store 4bytes[esp-4],1234 : esp=esp-4
load eax,4bytes[esp] : esp=esp+4    
Post 14 Feb 2011, 15:44
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
the best impovement that intel can do is to launch a RISC X86, on a single DIP16 MCU, programmable using ICprog and JDM programmer (RCD)

about clr, it is sure that it would be faster than xor if done with real circuits.
because xor is a TWO level logic gate, and clr is a direct binary assignation.

unfortunatelly, there is no one_byte_opcode free.


Last edited by edfed on 14 Feb 2011, 16:12; edited 1 time in total
Post 14 Feb 2011, 16:04
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
edfed, has it occured to you that "xor reg, thesamereg" just might be handled differently from arbitrary two-register xor?
Post 14 Feb 2011, 16:12
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
no, i don't think it is handled differently, because i think there are many more things pore important to design in the IA architecture. but maybe it is the case, maybe they made some logic just to improve xor r,r very fast...
what about or r,r, and r,r that does just flag updates?
Post 14 Feb 2011, 16:15
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17249
Location: In your JS exploiting you and your system
revolution
edfed wrote:
no, i don't think it is handled differently, because i think there are many more things pore important to design in the IA architecture. but maybe it is the case, maybe they made some logic just to improve xor r,r very fast...
Yes, "they" did. It is clearly stated in the Intel manuals that xor r1,r1 has dedicated circuitry.
edfed wrote:
what about or r,r, and r,r that does just flag updates?
You should use test r1,r1 instead. OR will use a writeback resource for the register, TEST doesn't writeback the register.
Post 15 Feb 2011, 00:13
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.