OK, this may be not only unstructured, but utterly chaotic, however it is my pleasure to provide some comments.
I did not make any notes, so this is all just from my memory - please forgive me if I mess up with some small details.
- When I was starting the fasm's x86-64 implementation back in 2004, I was quite impressed with the design of REX prefix. With the fact that they so elegantly reused the 4xh opcode space (which in 16/32-bit was for "INC reg" and "DEC reg" short encoding) to get prefix with exactly four bits that were needed for all long mode stuff.
However now I must confess that I am even more impressed with the cleverness of the VEX prefix design.
They used the opcodes of LDS and LES instructions (which do not exist in long mode, so there they are available), but in such a way that in 16/32-bit modes the postbyte has two highest bits set - which means, as anyone with x86 opcode experience should know, that both operands are registers. However such encoding was never valid for LDS/LES instructions, as they accepted only "reg,mem" syntax. So it is the encoding that was unused and VEX prefixes do not cause conflict in 16/32-bit modes, too.
- How did they manage to ensure that the two highest bits are always set in modes other than long? Well, they did put there bits that outside of long mode should always have value 0 (like the highest bit of 4-bit register selector: outside long mode only 3-bit register selecting is allowed, thus it indeed is always 0), and then they bit-reversed (as in NOT operation) those fields. Simple!
We've got two VEX prefixes. The one that uses opcode of LDS instruction is a two-byte one (first byte is C5h), the highest bit in postbyte is the R bit (the same one as in REX, but inverted), and the four bits just below it are the additional register selector (non-destructive source for three-operand instructions), bit-inverted in whole. This way in the two highest bits we have inverted R bit (which is the highest bit of register selector) and the highest bit of inverted third register selector. They both have value of 1 (inverted from 0) outside the long mode, so we always get a non-LDS encoding this way.
The VEX that uses opcode of LES is a three-byte one (so C4h byte is followed by two more), and two highest bits of the first byte are inverted R and X from the REX. Again, both are always 1 outside the long mode. No conflicts!
- VEX prefix can in general be used with SSE instructions which operate on XMM registers. MMX instructions that operate on MM registers cannot be encoded with VEX (their slowly proceeding deprecation already started with x86-64, where they kept FPU stack size of 8, and thus no MM8-MM15 registers were introduced; I would not even be surprised if in future we see a processor with SSE/AVX, but no FPU/MMX anymore).
- Even though VEX is 2-3 bytes long itself, it does not make instruction code longer. How did they manage to do that? First, VEX is able to encode all the bits from REX - W, R, X, and B, so you don't use REX anymore when you have VEX. Second, there are two bits in VEX that encode if the instruction opcode is preceded by 66h, 0F2h, 0F3h, or no prefix, so this prefix is also encoded in VEX and you don't provide it as a separate one. Third, there is a bit field in three-byte VEX that specifies the opcode plane where the instruction resides - either space of opcodes that were preceded by 0Fh byte, or by 0Fh,38h or 0Fh,3Ah combination (the field is 5 bits long and so there is a plenty of space for future extensions) - therefore the instruction code that follows VEX is just a single byte and it is no longer preceded by the stuff like 66h+0Fh, VEX is enough. And since VEX is used in general with SSE instructions that always had at least two bytes of this stuff, VEX generally succeeds in keeping instruction code at least as short as it was without VEX. And if SSE instruction required REX because of the R bit (as when destination register is in XMM8-XMM15 range), with VEX it can even become shorter by one byte (because R bit can be encoded with two-byte VEX, so we get rid of REX and of 66h and 0Fh bytes, and use just two-byte VEX instead).
- Old SSE instructions that work with the new VEX encoding have their mnemonic preceded with V letter, when you want to use VEX. It is not only because this new encoding wouldn't work on older hardware - also there is a subtle difference in what the instruction does. Analogously to how the 32-bit operations in long mode zero the upper bits of 64-bit register, instructions that use VEX encoding and operate on 128-bit XMM register zero the upper bits of 256-bit YMM register that the given XMM is a part of. While the legacy SSE instructions encoded without VEX do not modify the upper bits of YMM register (just like the 32-bit instructions in legacy mode of x86-64 architecture do not modify the upper bits of 64-bit register).
- And this is still not the end of the differences between the same instruction encoded with VEX or not. With VEX you get another bit field that encodes a third register operand; and thus the syntax for VEX-encoded SSE instruction is that it usually takes three parameters - one destination and two sources. Of course if you want the same operation as you did with old SSE, you just specify the destination to be the same as the first source. And there are some exceptions, like SQRT operation, where there is just one source anyway, so we stay with two operands (and register field in VEX is not used).
- There are even instructions that take four independent register operands. In such case the fourth register is encoded within the top four bit of immediate byte. VBLENDVPD is an example of such instruction.
- The L bit in VEX chooses between the 128-bit or 256-bit operation (that is, between operation on XMM register, or the wider 256-bit YMM that contains it). This is generally allowed only with operations on packed floating-point values. Operations on single floating-point values obviously don't need a special syntax to operate on YMM (when you use VEX encoding, the upper portion of YMM is zeroed anyway). And the operations on packed integers (that is: MMX instruction that were later updated to work on XMM registers, and now are updated again to use VEX encoding) are not allowed to operate on YMM for reason unknown to me (perhaps that would be too expensive to implement in the hardware?).
- Among other bits inherited from REX, VEX contains W bit. Of course in case of instructions that use some general-purpose registers it may still sometimes be used in a traditional way, to select between 32-bit register and 64-bit register, as in the case of VCVTSD2SI instruction. But for the majority of AVX instructions this bit serves no purpose.
The hint, that Intel had some ideas how to use this bit in some other cases, can be found in the encodings of instructions that they dropped, but AMD still plans to implement in the XOP extension with exactly the same encodings as Intel intended for them. They are FMA4 and VPERMIL2PD/VPERMIL2PS instruction.
These instructions take four register operands and as usual with such instructions in AVX, the second source (that is: the third operand) can be register or memory, and the fourth register is encoded in the immediate byte. However when set W=1 in VEX, we encode the same instruction as having the fourth operand being r/m, and the third operand encoded as additional register in the immediate byte.
So this is an example that for four-operand instructions W bit would allow to choose between "reg,reg,r/m,reg" and "reg,reg,reg,r/m" versions of instruction. However after Intel dropped those instruction, there is no more any hint of such W bit purpose in the Intel's AVX. It survived in AMD's XOP, however.
- And as the XOP is mentioned... AMD had some hard time, with its SSE5 being knocked-out by AVX before they even made it into silicone (well, that may have been fortunate, actually) and having to search for some way of encoding AMD-specific instructions that would be as good as VEX, but separate from Intel-claimed territory.
They designed the XOP prefix encoding, which mimics the 3-byte VEX encoding. It has all the same bit fields as VEX and all in the same places, and the first byte is 8Fh. As you may know, 8Fh is an opcode extension with POP instruction encoded for postbyte register field value of 0. This means that to get unused encoding, we need to have at least 1 there, and for this reason the five bit field that resided in the lowest bits of second XOP byte (which mimics the same five bit field in VEX which is used to select between various opcode planes) has always the value of 8 or 9 (with a space for a few more - way less than in the VEX case, however). This is the only real difference between XOP and VEX prefix features. The XOP prefix keeps the inversion of some of the bit fields to be the same as in the case of VEX, even though it is not needed in case of XOP.
Generally: XOP prefix is made to mimic VEX, but because of that it is not as elegant as VEX is.
- AMD uses W bit in XOP prefix to select which of the source operand is encoded as r/m - just as it was originally with some of the AVX instructions (the ones that AMD now included in XOP specification, after Intel dropped them from AVX).
- The correct detection of some AVX features requires the XGETBV instruction to be used - this instruction is not a part of AVX itself and was introduced into Intel 64 architecture in parallel to designing AVX. This is a catch I did not notice at first, and so initially I implemented AVX without having XGETBV recognized by fasm. I've been fooled.