flat assembler
Message board for the users of flat assembler.

Index > Heap > Why is CPU design hard? And MMU plus other random ramblings

Goto page 1, 2, 3, 4  Next
Author
Thread Post new topic Reply to topic
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
While I was really into this diff.-opcode-same-funcion thing and I guess gained from it. Intel now nullifies the point of unaligned instructions (or aligned, whichever way you think of it).
Anandtech link
Anandtech wrote:

With Nehalem, Intel has not only reduced the performance penalty of the unaligned op but also made it so that if you use the unaligned op on aligned data, there’s absolutely no performance degradation.


Why on earth didn't they start with it???
Code:
movdq   xmm0,[400001h] ;=> to decoding
mov     tmp0,400001h
and     tmp1,tmp0,~15 ; 400001h and ~15 != 400000, oops it wasn't aligned
cmpe    tmp1,tmp0; 2 uops lost to and and cmpe
; => to execute A) laodea or B) loadneu
loadea  xmm0,@tmp0 ;if aligned
loadneu xmm0,@tmp0 ;if not...
    
Post 23 Aug 2008, 22:49
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17247
Location: In your JS exploiting you and your system
revolution
Madis731 wrote:
Why on earth didn't they start with it???
I guess these things are a lot more complicated than they appear to the lay-cpu-programmer.
Post 24 Aug 2008, 00:03
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
revolution wrote:
I guess these things are a lot more complicated than they appear to the lay-cpu-programmer.

Of course, but my guess is that they make things more complicated than needed, and then they are stuck because they need to keep the backward compatibility. And things are getting even more funny if some CPU developper changes its developpment policy. From what I've heard, Intel changed it's complete core developpment teams multiple time in history. It's no wonder the x86 looks like patchwork.

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 24 Aug 2008, 05:26
View user's profile Send private message Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan
Madis731 i believe lddqu implements several split cache line loads if unaligned.
Post 24 Aug 2008, 08:07
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
They talk about MOVDQU here and cache boundary penalties. AMD pulled it off back in 2007 Nov...its really about "making it" rather than "trying to make it work":
Intel Software Community wrote:

However, when testing the same operations on AMD processors, we noticed there was no measurable penalty at all for unaligned loads across cache lines, and furthermore, the page line penalty was a mere 5-10%.

the link


Okay, another one to think about:
Code:
;mem = 1234567h just that we're on the same page its NOT aligned
mov eax,[mem]
mov ebx,[mem+4]
mov ecx,[mem+8]
mov edx,[mem+12]
;By this time Core 2 has spent 4 cycles (but sometimes my tests show 8 ).
    

...now lets change this to unaligned xmm move
Code:
;mem = 1234567h
movdqu xmm0,[mem]
;Surprise - this also takes 4 clocks Neutral ??!?
    

While I'm watching Agner's manuals closely, I see the 4 uops going to ports 0/1/5, which it seems are 4 unaligned dword moves. LDDQU uses 2 moves, one of which is aligned (simply mem&~15 with lower part cut off). Seems that movdqu isn't much different from mov reg,mem only taking less opcodes.

But wait, I've seen this 2 uop move somewhere ^o) :
Code:
;mem = 1234567h
mov eax,[mem]
mov ebx,[mem+8]
;By this time Core 2 has spent only 2! cycles
    

Okay to make a conclusion, is MOVDQU ~ 4xMOV REG32,MEM32 mirrored like LDDQU ~ 2xMOV REG64,MEM64 ?
Post 24 Aug 2008, 09:32
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
lazer1



Joined: 24 Jan 2006
Posts: 185
lazer1
mattst88 wrote:
I have wondered this too.

lazer1 wrote:
SSE does this all the time having different instructions which
only differ at the interpretative level, IDENTICAL at the h/w level.


How do you know they are the same at the hardware level?


what I meant is that the h/w specification is IDENTICAL,

the only difference is in the mind of the programmer


in fact I use them when they could contain integers where
neither meaning is valid.

add is a much better opcode as it adds both unsigned and signed,
which are identical because the condition codes are well designed.

Quote:


add x, y
ja .positive ; for unsigned add

add x,y
jg .positive ; for signed add

Post 25 Aug 2008, 15:45
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
lazer1 wrote:
what I meant is that the h/w specification is IDENTICAL,
the only difference is in the mind of the programmer
in fact I use them when they could contain integers where
neither meaning is valid.[---]

Did you even *read* the links posted above and the discussion?
Its clearly stated that it saves you few uops here and there when you use the same type of instruction that's in the xmm's at that time. Okay, sorry I may have mislead you with my previous discussion, but its related and it should be understandable:
Code:
movdqa xmm0,mem128.a
movaps xmm1,mem128.b
mulpd  xmm1,xmm0     ;BAD!!!

;..but

movapd xmm0,mem128.a
movapd xmm1,mem128.b
mulpd  xmm1,xmm0     ;VERY GOOD! Smile

;even better...

movapd xmm1,mem128.a
;movapd xmm0,xmm1 ;if you really want to preserve it
mulpd  xmm1,mem128.b
    

If you don't respect the type, then you pay with a penalty, usually some uops, but sometimes an entire clock. Conversion instructions help you jump
from one op-type to an other. Really, if you don't care about op-typing, just use the shortest, which is always *PS (SHUFPS, MOVUPS, etc.)

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 25 Aug 2008, 16:00
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
lazer1



Joined: 24 Jan 2006
Posts: 185
lazer1
Quote:

Seems that movdqu isn't much different from mov reg,mem only taking less opcodes.


if 2 code fragments are identical speed but different sizes,

then the smaller one is preferable as it eats up less cache.


ie better cache utilisation, whether it be L1 or L2



if you install Ubuntu it has

"Ubuntu 8.04.1, memtest86+" in the boot menu,

for my Sempron, it says the L1 cache is 128K and L2 cache is 256K

(you can also find these sizes via asm: CPUID function 80000005,
ecx bits 31 to 24 give the L1 data cache size in KB,

and cpuid function 80000006 ecx bits 16 to 31 give L2 size in KB,

for AMD's, probably the same for Intel.
)

anyway, if a program is less than 128K the entire program
will fit in all the caches!

smaller opcodes will increase the percentage of the program
that can fit in the L1 and L2 caches.
Post 25 Aug 2008, 16:13
View user's profile Send private message Reply with quote
lazer1



Joined: 24 Jan 2006
Posts: 185
lazer1
Madis731 wrote:
lazer1 wrote:
what I meant is that the h/w specification is IDENTICAL,
the only difference is in the mind of the programmer
in fact I use them when they could contain integers where
neither meaning is valid.[---]

Did you even *read* the links posted above and the discussion?
Its clearly stated that it saves you few uops here and there when you use the same type of instruction that's in the xmm's at that time.


yes but you are looking at Intel's sloppy works ie their cpus,

I am looking at it from a cpu design POV,

from a CPU design POV the mov_unaligned_128
should IGNORE the type,

it should be a typeless instruction.

WHY should the type matter if the instruction IGNORES the type,

that is sloppy cpu implementation.


mov is literally a wiring problem, you connect bit m of source
with bit m of dest with a wire,

how the type can matter is something only Intel can do.

the type flip-flops shouldnt be wired up to the mov_unaligned_128
logic gates or wires,

the words "efficient" and "Intel" can only be in the same

sentence if there is also the word "not"
Post 25 Aug 2008, 16:28
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17247
Location: In your JS exploiting you and your system
revolution
lazer1 wrote:
the words "efficient" and "Intel" can only be in the same

sentence if there is also the word "not"
You mean like this?:

Intel CPU are much more efficient, when compared to AMD, which are not nearly as efficient.
Post 25 Aug 2008, 20:25
View user's profile Send private message Visit poster's website Reply with quote
lazer1



Joined: 24 Jan 2006
Posts: 185
lazer1
revolution wrote:
lazer1 wrote:
the words "efficient" and "Intel" can only be in the same

sentence if there is also the word "not"
You mean like this?:

Intel CPU are much more efficient, when compared to AMD, which are not nearly as efficient.


yes! Razz

but there is a problem with the word order Mad


I was thinking more of Intel design, it is so pedantic
and overcomplicated. considering that ARM RISC was
around almost 20 years ago.

long mode is by AMD and is near perfect design relative to the constraints,

I would have gone much further, design a brand new fully orthogonal
instruction set. I dont mind having the compatibility mode, but for
64 bit you need a brand new instruction set.

as for SSE I would have gone for a vectorised register set:

have 1 HUGE register of say 2048 bits, then
allow subsets to be accessed eg:

256 x 8 : 256 byte registers
128 x 16 : 128 word vector registers
64 x 32 : 64 dword vector registers
32 x 64 : 32 long vector registers
16 x 128 : 16 vector registers of 128 bits
8 x 256 : 8 vector registers of 256 bits
4 x 512 : 4 vector registers of 512 bits
2 x 1024 : 2 vector registers of 1024 bits

and also vector subsets,

its just different 2^m aligned m-bit subsets of the same 2048 bits.

eg:

3 bits to encode the vector width: 2^(3+m) bits, 0<= m <= 7
2 bits to encode the vector field width: 2^(m + 3) bits, 0<= m <= 3
8 bits for source register, 8 for dest register,

that leaves 11 bits for the opcode and other info for a 32 bit instruction set.

for things like pshufd I would have the shuffle bits eg in rcx
and allow shuffling of up to 16 fields, 4 bits per field x 16 = 64.

shuffle16 dest, src ; vectors of 16 bytes or words or ...
shuffle8 dest, src ; vectors of 8 bytes or words or ...
shuffle4 dest, src ; vectors of 4 bytes or words

this is a very orthogonal scheme, we can go for a

load-store architecture where no maths ops allowed on memory,

you can only do mov with memory, nothing else.

thus shuffle16 and all maths ops ONLY act on registers,

shuffle16 reg, mem ; WRONG!

instead:

mov reg, mem
shuffle16 reg1, reg

where eg rcx is loaded with the shuffle bits outside the loop.

orthogonality ideas have been around A LONG TIME,
eg the Motorola 68000 was an orthogonal instruction set
from more than 20 years ago.
Post 25 Aug 2008, 21:09
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17247
Location: In your JS exploiting you and your system
revolution
lazer1: Clearly you have never designed a CPU, it is not just a technical challenge but also a psychological challenge. Intel tried some bold things with the Itanium, but it failed, not for technical reasons, but because the programmers didn't like it. Sad but true. So go ahead and design your uber-CPU, but be aware that if the general community don't like it then it won't sell.
Post 25 Aug 2008, 21:15
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

Intel tried some bold things with the Itanium, but it failed, not for technical reasons, but because the programmers didn't like it.

Is that what really happened or was Intel that never targeted general public in first place? The pricing was very high, I saw no computer store selling even the Itanium alone (even though the very same computer store was selling Opteron in some cases), and the Pentium4 was aggressively put in the market at the same time with the biggest MoreGigahertzIsBetter(TM) marketing ever. It also had the technical disadvantage of running x86 code slower than AMD64 but I thing the other things weighted more in its "failure" (quote because I don't know if it is really loosing market since its release).

lazer1, that 2048 bits mega-register could be costly, the modes of accessing it you are proposing looks more like having a 2048 bits SRAM than real registers, the addressing logic could be prohibitively costly or even impossible to achieve within a CPU cycle without any latency.

And 2 all, several times people said "if I was designed a CPU...", let me tell you that Sun released the HDL code of the SPARC, I even have a copy of it (OpenSPARCT1.1.6.tar.bz2.bz2), don't remember from where I got it exactly (but was from the Sun's site of course). I suggest download it at least to convince yourselves of how complex a CPU is. Also, you can check http://www.fpga4fun.com/spoc.html for a more simple design, if you can't grasp it then don't try with OpenSPARC Razz (Even though I did an miserably failed of course)
Post 25 Aug 2008, 21:49
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17247
Location: In your JS exploiting you and your system
revolution
LocoDelAssembly wrote:
Quote:

Intel tried some bold things with the Itanium, but it failed, not for technical reasons, but because the programmers didn't like it.

Is that what really happened or was Intel that never targeted general public in first place? The pricing was very high, I saw no computer store selling even the Itanium alone (even though the very same computer store was selling Opteron in some cases), and the Pentium4 was aggressively put in the market at the same time with the biggest MoreGigahertzIsBetter(TM) marketing ever. It also had the technical disadvantage of running x86 code slower than AMD64 but I thing the other things weighted more in its "failure" (quote because I don't know if it is really loosing market since its release).
My view is that without programmer support (ie. no software to run on the thing) no one wants to buy it. A computer without software is not much use. Retailers don't want to stock a poor product. Admittedly Intel did aim the Itanium at the server market, but the programmers found it a heavy burden to get code running well, the thing was just too complex to understand how to make it run nicely. Hence, they preferred the much simpler, and faster, x86 code they already had. Technically the Itanium is pretty good, but if you have seen how to code for it you will understand why it is a real pig to use.
Post 25 Aug 2008, 21:59
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
I forgot these links: http://www.youtube.com/view_play_list?p=803563859BF7ED8C (Electrical - Digital Circuits and Systems), http://www.youtube.com/view_play_list?p=D2350A83B752C861 (Electronics - Digital VLSI System Design).

I can't comment about the quality of those courses because this is not my field at all (Have I even one field?) and I had not enough time to watch all the videos but I've enjoyed the 6 or 7 videos I've seen.

revolution, yes, the lack of software is a BIG problem but, since most of the servers out there run Linux and typically all the software inside is open, why not just recompiling or even using a distro already compiled for Itanium? Surely all the ASM optimizations can't be used, it would be all C at the beginning but is it a real concern? (For the target audience). I think the pricing weighted a lot, they should started with a much lower price to motivate people to change rather than putting Itanium on the market as if it was a jewel ignoring the fact that there were well established architectures already that needed to be pushed out of the market first by more competitive pricing or other means.

And yes, programming triplets of instructions is not very handy, especially because the bundles restricts the possible instructions triplet combinations Smile But again, Assembly programmers compared to C programmers are really scarce, I think that the complexity of the Itanium Assembly was not one of the most important causes of its failure.
Post 25 Aug 2008, 22:23
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17247
Location: In your JS exploiting you and your system
revolution
LocoDelAssembly wrote:
revolution, yes, the lack of software is a BIG problem but, since most of the servers out there run Linux and typically all the software inside is open, why not just recompiling or even using a distro already compiled for Itanium? Surely all the ASM optimizations can't be used, it would be all C at the beginning but is it a real concern? (For the target audience). I think the pricing weighted a lot, they should started with a much lower price to motivate people to change rather than putting Itanium on the market as if it was a jewel ignoring the fact that there were well established architectures already that needed to be pushed out of the market first by more competitive pricing or other means.
Actually that is the point, why pay premium price for a system that can only use stock C code and will then give you overall lower performance than the x86 systems?
LocoDelAssembly wrote:
And yes, programming triplets of instructions is not very handy, especially because the bundles restricts the possible instructions triplet combinations Smile But again, Assembly programmers compared to C programmers are really scarce, I think that the complexity of the Itanium Assembly was not one of the most important causes of its failure.
Also the point, C code still needs to be put into assembly code. The compiler software is incredibly complex when the Itanium considerations have to be included. You still need assembly at some point, whether it comes from a compiler or a human makes little difference.
Post 25 Aug 2008, 22:29
View user's profile Send private message Visit poster's website Reply with quote
lazer1



Joined: 24 Jan 2006
Posts: 185
lazer1
Quote:

My view is that without programmer support (ie. no software to run on the thing) no one wants to buy it. A computer without software is not much use. Retailers don't want to stock a poor product. Admittedly Intel did aim the Itanium at the server market, but the programmers found it a heavy burden to get code running well, the thing was just too complex to understand how to make it run nicely. Hence, they preferred the much simpler, and faster, x86 code they already had. Technically the Itanium is pretty good, but if you have seen how to code for it you will understand why it is a real pig to use.


well, the important thing with everything is: KEEP THINGS SIMPLE!

the hardware should be fast with normal unoptimised asm.
if it isnt it will fail, because most people are too lazy to
bother with complicated optimisations.


thats what I like about the RISC idea where you REMOVE AS MUCH
functionality as you can get away with. the R in RISC means
"reduced" ie less things. Originally they found that most compilers
didnt use most instructions of CISC. And thus it would pay off
to remove the rarely used things. Compilers typically
just use a TINY subset of an instruction set eg most compilers
are unlikely to use binary coded decimal instructions.

the original ARM a pioneer RISC didnt even have mult and div,
you had to do those in software. Their idea was to only
have instructions which could be done in 1 cycle.

some RISCs dont even have a stack pointer, you just use any
register. when you call a subroutine the args go to registers
and the return address also to a register. For leaf functions
that is faster as memory isnt accessed to call a subroutine.

if it isnt a leaf function then the function can store the return
address on a stack in software.

PPC has a lot of leaf function optimisations, ie ideas which
make leaf functions faster.

creating an innovation for servers is doomed because there
are relatively few servers as each server typically is serving
probably hundreds of people.

servers are simpler than home computers as servers
are mainly file managers with shell scripts. The gfx you
see when you connect to a server is done by your own
computer. The server just emits html which IE or Firefox
INTERPRETS as graphics.

eg this forum will be on a server, but the graphics you see
is done by IE and Windows XP or by Firefox and Linux.

the server just generates the html via script programs,
a php script for this forum.


I dont know anything about the Itanium other than that it exists,

but I think to succeed it needs backwards compatibility with
existing x86. That way people can buy the cpu and
some people can experiment coding.

backwards compatibility can be done via emulation.

one thing I dont understand is that x86 is emulated
above a RISC core, so why dont they make the
RISC core accessible to programmers?
Post 26 Aug 2008, 01:07
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17247
Location: In your JS exploiting you and your system
revolution
lazer1: There are many many trade-offs that must be considered when designing a CPU. Blanket statements like "RISC is better" are just not true in every case.

Also, servers need to be fast and reliable. They are generally much more complicated than a desktop machine.
Post 26 Aug 2008, 01:18
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
lazer1 wrote:
the hardware should be fast with normal unoptimised asm. if it isnt it will fail, because most people are too lazy to bother with complicated optimisations.
Keep in mind that most running machine code is generated by compilers, not humans... this isn't a "compilers generate better code" argument, I'm just saying that most running code will be generated by compilers - thus CPUs should focus on that. You're still right that a lot of x86 instructions aren't widely used, though, and that the architecture could be better Smile

I'm not sure about the whole RISC vs. CISC debate though... internally, these days x86 execute relatively simple micro-ops, decoded from the more complicated x86 instruction set... reduced-instruction architectures get SIMD instructions... and there's weird stuff like Itanium's EPIC. The lines begin to blur.

Personally I don't think you should strive for a minimal instruction set, but a comfortable one, without useless instructions. LOOP is a nice instruction in theory, but when it takes longer than "dec reg / jnz label", it's useless - I don't see why it can't be broken into u-ops that are at least as effective, though. And striving for 1-instruction-1-cycle seems silly, it would probably either mean no SIMD instructions, or having other instructions run artificially slow.

x86-64 is nice, but there's still too much x86 legacy - both in the CPU, but also the rest of the PC platform. Would be nice throwing it all aside, and doing backward compatibility with emulation, but that's just not going to happen. Itanium had backward compatibility, but it didn't go through (pricing, marketing, lack of performance, etc.)

Also, keep in mind that there's a lot of different kinds of "servers", and that Itanium wasn't just created for servers, but also workstations. You know, big and complicated number-crunching stuff, not just simple serving of web content.
Post 26 Aug 2008, 01:27
View user's profile Send private message Visit poster's website Reply with quote
lazer1



Joined: 24 Jan 2006
Posts: 185
lazer1
revolution wrote:
lazer1: There are many many trade-offs that must be considered when designing a CPU. Blanket statements like "RISC is better" are just not true in every case.


but x86 IS based on a RISC core,

thats like a human saying humans can go to the moon,

no they cannot, its the machine which goes to the moon,
the human just presses some buttons.

you could put a snail in the same machine and it would
also go to the moon.

x86 is a zombie cpu, your x86 is in fact a RISC machine
PRETENDING to be an x86.

Quote:

Also, servers need to be fast and reliable. They are generally much more complicated than a desktop machine.


the hardware is much more complicated, but the s/w functionality is
much simpler.

thats the whole idea of a server,

a home computer deals with printer, files, graphics, the user,etc


with the server paradigm you have:

a printer server to deal with the printer, it does NOTHING else
a file server deals with files, it does NOTHING else
a terminal deals with the user, it just forwards all the work
mainframes deal with the actual processing,

the functionality of a desktop machine is SPLIT UP into
different machines with the Unix server paradigm.

thus the desktop machine is MUCH MORE COMPLICATED,
it does EVERYTHING

with the server paradigm each machine just forwards
everything outside its specialism to the appropriate server.

that is why Linux is so complicated, it is based on Unix which
was never meant to be a desktop. It can be used as a desktop
but it is VERY INEFFICIENT because it has to pretend
the desktop is lots of different machines,

eg Unix graphics is typically transmitted across an ethernet
from a mainframe to the users terminal.

terminal = monitor + mouse + keyboard + simple computer
for forwarding things to the mainframe and receiving graphics
from the mainframe.

if you want to see the MOST EFFICIENT desktop ever its
68k AmigaOS.
Post 26 Aug 2008, 12:28
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3, 4  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.