flat assembler
Message board for the users of flat assembler.
Index
> Projects and Ideas > To build a new processor from scratch |
Author |
|
bitRAKE 19 Jun 2009, 06:18
The whole advancement on x86 has been kludge on kludge.
It's not so bad really, Code: QWORD.atouiW: xor eax,eax ; 33 C0 jmp .0 ; EB 0D @@: cmp edx,9 ; 83 FA 09 ja @F ; 77 14 lea rax,[rax+rax*4] ; 48 8D 04 80 lea rax,[rdx+rax*2] ; 48 8D 04 42 .0: movzx edx,word [rcx] ; 0F B7 11 add rcx,2 ; 48 83 C1 02 sub edx,'0' ; 83 EA 30 jae @B ; 73 E7 @@: retn ; C3 The current design is very good at predicting code fetch, so the optimization needed to the decode stage has been minimal. IIRC, Intel already has patents for caching of macro/micro OPs (instead of the current instruction cache) to reduce decode pressure, but don't use it. |
|||
19 Jun 2009, 06:18 |
|
hopcode 19 Jun 2009, 12:21
bitRAKE wrote: (Three more bytes than the 32-bit version.) Ok, that's not so bad. And... I am taking the positive side of the whole. Thank you for your answering. |
|||
19 Jun 2009, 12:21 |
|
bitRAKE 21 Jun 2009, 18:22
I had more time to think about your post...
Quote: is it not weird/stupid that after checking the prefix I've always imagined the processor is as you describe: Quote: If i could have done it, I would had let run all the software till now in an interpreted (read almost as "debugged") "legacy" processor shell, and the new software in a native way, where the logic take advantages from the actual knowledge, and from the experience of years. Quote: If we... run on 64 bit machine we need back-compatibility, not a way |
|||
21 Jun 2009, 18:22 |
|
hopcode 22 Jun 2009, 02:47
The matter is complex. I will print the thread to think cold about it. For example,
Quote: Shouldn't prefix bytes just be consumed until a non-prefix byte is present - regardless of the instruction? Yes And the matter is how to check in Intel's mechanism of prefixing. They have developed the prefixes 66h/67h/F2H/F3h. Then they have for new SIMD instruction like CVTDQ2PD, considered a 2-byte opcode instruction, Cap 2.1.2 (A-M instruc. reference) Quote:
They have forgotten (thanks to God!!) the 67H for this time !!! The 67h is imho yet a good prefix because -1) it is for back compatibility -2) they have not yet re-touched it. And now, what is again the common sense: F3h was the REP (a prefix!!!) on old machines,now it could be a mandatory prefix so that one reads the following byte after considering that the previous is not 100% a REP PREFIX. On old machines after F3 several things, but F3 could have been discarded On new machine after F3 several things, but F3 cannot be simply neglected/discarded as an invalid/superfluos REP !!! This invalidate the position as you say Quote: Shouldn't prefix bytes just be consumed until a non-prefix byte is present - regardless of the instruction? It cannot be "regardless" beause the mandatory prefix for CVTDQ2PD is by design part of the instruction. I am not against their inventivity, the could invent what they like. But that of mine is a common-sense concept : If i call in the USA, i do PREFIX (USA) 001 PREFIX (STATE) 213 Hollywood NUMBER YYYYYYY I dial 001-213-YYYYYYY to call in Hollywood But what if one day one will establish that the 001-415 will remain Palo Alto,when calling from Germany, with 001 discarded,if i call from Palo Alto and, on the contrary,001-213 will be a mandatory prefix for an hotline ? Quote: Shouldn't prefix bytes just be consumed until a non-prefix byte is present - regardless of the instruction? Should i dial in Hollywood by its prefixes regardless of the number 001-213 ? What If i call in Hollywood from Palo Alto ? what prefix should i dial ? 213 ? Regards, hopcode |
|||
22 Jun 2009, 02:47 |
|
bitRAKE 22 Jun 2009, 13:22
By consume I do not imply the prefix should be discarded or ignored. Once a non-prefix byte is reached then analysis of prefixes is performed. I do understand though, how it does make dis-assembly more difficult than prefixes being a binary switch.
F3 66 F3 47 0F E6 3C F7 cvtdq2pd xmm15,[r15+r14*8] ...executes just fine! First prefixes prior to E6 selects instruction - first 0F (special); REX bytes are prefixes in long mode; and then F3 (of group 66 F2 F3). IMHO, handling 0F as a special type of prefix works better because E6 3C is a valid instruction. Hopefully, no one got the bright idea to align instructions with bogus prefixes (AMD) - that would kill forward compatibility with the introduction of the more advanced decode semantics. |
|||
22 Jun 2009, 13:22 |
|
hopcode 23 Jun 2009, 11:35
Quote:
0x403000: F366 REPZ 0x403002: F347 REPZ INC EDI and...surprise!! 0x403004: 0FE63CF7 CVTTPD2DQ XMM7,XMMWORD PTR [ESI*8+EDI] It understand 99% of the 64bit code. This shows how the 4 bytes F3 66 F3 47 are only a waste of resource. Quote: no one got the bright idea to align instructions with bogus prefixes Think upon POP CS (valid on some 186 machine) It is the 0Fh!! it would have become the future famous prefix 0Fh. I imagine that they have eliminated it because they needed a simple nibble. Ok, but it was no more supported. This is good so. But they have leaved untouched from 40h->5Fh INC/DEC/PUSH/POP for all r32 registers. 5f - 40 = 31 lost bytes in the main table They have not implemented them with a MOD/RM with, for example, only one 40h +MOD for the instructions and r32 regs !!! So for the instruction cvtdq2pd, the prefixes are all before!! All bogus! Why ? They could have implemented it with postfixes, because 1-) it doesn't need back compatibility (PEBrowse show it 99% correct) 2-) F3 (REP) was not eliminated from the table,like the POP CS instruction. And, think that in 64bit mode they could have implemented the same instruction only by re-using the dismissed ones BOUND/AAM etc. It executes fine,but doesnt follow the economy of technology. (Please, take this one last statement of mine as a not necessairly negative). Regards, hopcode |
|||
23 Jun 2009, 11:35 |
|
hopcode 02 Sep 2009, 01:55
I have had an (perhaps already existing) idea.
All processors i know have these peculiarities, and they show always the same difficulties: 1) they do not distinguish data from code 2) they try to cache data and/or code but they have to 1a) work on different instruction's len/data sizes 2a) execute branching algos etc. to improve speed/caching I am wondering whether exists or not a processor, a "structured" processor, i would say, in which we find the following feature of opcodes, for example,32bit: a prefix of 2 bytes (it may be 4 too,why not ?) Code: |15|14|13|12|11|10|09|08|07|06|05|04|03|02|01|00| bits in 2 bytes prefix _________________________________ ______________ i) instruction's part d) 4 bit description's part in 4 bits we could describe the type of i) and the following b) body of instruction (4 bytes). Also, the instruction has ALWAYS 4+2=6 bytes of len As example, the d) part tell us 0->15 informations/flags: Code: 0000 -> data follows in the b) part 0001 -> code follows in the b) part 0010 -> mix of data+code in the b) part 0100 -> RAW POINTER TO INSTRUCTION follows in the b) part, it is to say, for example, one of reg/reg type (mov eax,ebx) they are NOT OPCODED! they are POINTER TO ALREADY stored opcodes/operations, "called" from a list in an INSTRUCTION DESCRIPTOR TABLE (similiar to interrupts) In this case, 4 bytes could describe a lot of registers and informations/packed instructions 1000 -> OPCODED instruction follows in the b)part like we know as they are today on processors. .... If i find for example a word with the LSBits set to 0000 i will know that in the b) part i will find only datas from a presumably data section, even if they are code to be executed (in this case -> exceptions) This will be important for these reasons: 1r) stark structuring of the program 2r) more security 3r) no brainfucked algos to perform caching 4r) PROGRAMMERS optimize their code 5r) ...all the advantage from a "structured" execution instead of the slow parsing of an unstrucured/prone-to-errors sequence of opcodes Do such a "structured" processor exists ? Regards, hopcode |
|||
02 Sep 2009, 01:55 |
|
revolution 02 Sep 2009, 03:00
The Itanium has a 5 bit prefix before each bundle to describe the contents of the following 123 bits (128 bits total).
|
|||
02 Sep 2009, 03:00 |
|
hopcode 02 Sep 2009, 03:41
revolution wrote: The Itanium has a 5 bit prefix before each bundle to describe the contents of the following 123 bits (128 bits total). Itanium is recent (but i dont know nothing about it) ...Also, the same Intel has realized a sort of fixed prefixing... i cannot believe it (btw, thanks for the interesting links "From Sand to Silicon: the Making of a Chip") Regards, hopcode |
|||
02 Sep 2009, 03:41 |
|
booter 28 Dec 2009, 11:24
hopcode wrote: I have had an (perhaps already existing) idea. IBM mainframe (360,370,etc.) BTW, its assembler is 1000 times easier! |
|||
28 Dec 2009, 11:24 |
|
hopcode 29 Dec 2009, 05:50
I have browsed the net and here, what i have found
for IBM S/390 Assembler tutorial http://csc.colstate.edu/woolbright/WOOLBRIG.htm for IBM S/370 Assembler tutorial http://cbttape.org/~jmorrison/s370asm/html/tut-contents.html That is very interesting. booter wrote: its assembler is 1000 times easier At a first glance Yes, and no (because i am not used at it) for example, from this table: Code: BC B"1000",LAB or BC 8,LAB Branch if condition code is zero BC B"0100",LAB or BC 4,LAB Branch if condition code is one BC B"0010",LAB or BC 2,LAB Branch if condition code is two BC B"0001",LAB or BC 1,LAB Branch if condition code is three BC B"1001",LAB or BC 9,LAB Branch if condition code is three or zero BC B"1011",LAB or BC 11,LAB Branch if condition code is three, two or zero NO: it is not so intuitive Code: BC B"1000",LAB or BC 8,LAB Branch if condition code is zero but YES: 2 branching on one line ,adding Code: BC B"1001",LAB or BC 9,LAB ;Branch if condition code is three or zero A R5,20( R7,R8 ) ;the word 20 bytes past the sum of R7 and R8 is added to R5 as when reading the well-known intel ones. As we can see from here http://csc.colstate.edu/woolbright/INSTORG.HTM and IBM S/360 "an imaginary journey through the evolution of the S/370 hardware design project" http://cbttape.org/~jmorrison/s370asm/html/tut-S370-design-001.html Instructions format is highly structured. One could learn opcode-reading by memory in few days!! The whole thing sounds to me very inspirative. As i have a couple of hours free -perhaps end of January-, i will install the Hercules emulator to test something written in ASM for IBM/360/370 Thank you for your precious reference, Regards, hopcode |
|||
29 Dec 2009, 05:50 |
|
hopcode 06 Jan 2010, 05:34
Something else about the weirdness of Intel opcodes...
In an early '86 stage they implemented in the group F6 different instructions, which dont follow the MOD byte rules, example A Code: F6C1 10 TEST CL,10 F6D1 NOT CL But, almost at the same time, developers of the math co-processor, (more INTEL-ligent persons i think) starting from the fact that MOD byte is a good solution to re-functionalize already-functional opcodes, took their solution, in that same early stage, to encode FPU instructions, as follow, example B: Code: D805 88204000 FADD DWORD PTR DS:[402088] D8C1 FADD ST,ST(1) where the MOD byte is fully functional as MOD byte. It is superflous to say that this one was a really future-forwarded compatible feature. As i said: they had their own reasons to invent such a caotic patchwork, as exposed in example A, in which one must check the MOD byte, not for the purpouse to find infos about what follows, but to separate, the test instruction from the other ones of the group. This is what i call a "ballast", to get rid of. As you know, assembler is not an option. It is the Way to talk to machines. And as you know, bad fundemantals leads to squared impossiblility !!! Paradoxically, following their patchwork-layout, i have found a new way to read the len of instructions in my rebuilt LDE. At the moment, i do not publish it because it is in test-phase, because i am too busy in the userland. I think it could be the smallest LDE of the world, but i would state it as i publish it, in my di-fasm thread. Anyway,I remain curious to know their (good/bad) reasons for that patchwork, whatever they are, because they are (superflous to say) part of their success. Greetings, hopcode |
|||
06 Jan 2010, 05:34 |
|
bitRAKE 06 Jan 2010, 08:13
Similar patchwork can be seen in 680x0 encoding.
http://aggregate.org/ has some research on assember / compiler creation - looking at several perspectives of the problem. If we were engineering extensions to x86 from an existing design, we would have detailed knowledge of how each encoding flows through. Every design will have delays to syncronize resources. In some places we would see complex or seldom used instructions which would not be impacted by neighboring encodings. Just like managing a group of people to a coordinated goal: some have special talents, others just get the job done, and a few are just plain slow (despite incentive). I wish social scheduling were as simple as one of the computer scheduling algorithms, but they are historically related. Complexity results for scheduling problems. Maybe peel off the decoder to reveal the macro-/micro-OPs used by the core - emulating the x86 decoder in software (a la Transmeta (owned one) or Itanium). This could be done every few years as cores abstract away previous encodings. The common abstraction does open the door for other players - despite Intel's success. |
|||
06 Jan 2010, 08:13 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.