flat assembler
Message board for the users of flat assembler.

Index > Main > JCC Erratum fix in FASM?

Author
Thread Post new topic Reply to topic
hellomachine



Joined: 18 May 2023
Posts: 22
Location: I don't even exist
hellomachine 24 May 2023, 19:47
Hello, there is a situation named "JCC erratum" which is about code padding (something like 'align' but different.). This is a very important matter and affects a function performance in compare to codes generated by C compilers (GCC and ...).
I developed a function and everything was just fine, but my function was 1.5x slower than that function in C (same behaviour, but generated by C compiler (GCC)).
Is there a solution for this "JCC erratum" in FASM (There is a solution in GNU assembler) ?
Reference:
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf[/url]

I think it's possible to create a macro for this problem... please attention, this is very important !!!! Very !!! For performance.

_________________
JESUS _F_ CHRIST
Post 24 May 2023, 19:47
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 805
Location: Russia
macomics 24 May 2023, 22:14
For high-level languages that cannot directly operate on the location of code in memory, these optimizations are useful. But you write in assembler and you can arrange the code as you like. Nothing prevents you from performing code alignment according to certain rules. Write macros that solve this problem, and you can put them in this topic.

Judging by the list of processors, in 11-12 gens this problem has already been cured.
Post 24 May 2023, 22:14
View user's profile Send private message Reply with quote
hellomachine



Joined: 18 May 2023
Posts: 22
Location: I don't even exist
hellomachine 25 May 2023, 06:17
macomics wrote:
For high-level languages that cannot directly operate on the location of code in memory, these optimizations are useful. But you write in assembler and you can arrange the code as you like. Nothing prevents you from performing code alignment according to certain rules. Write macros that solve this problem, and you can put them in this topic.

Judging by the list of processors, in 11-12 gens this problem has already been cured.


Yes, but most CPUs (in-use) are older gens (check Datacenters, and you find out the most CPUs are older gens). And about Assembly and how to write codes, I'm sure over 95% of assembly programmers doesn't know anything about code-alignment (Each branch (label) must be aligned at 16 bytes boundaries) and their functions become slower than the ones, generated by a C/C++ compiler. And even by aligning the code branches (labels and ...) to 16 bytes, there are some other problems like this 'JCC Erratium' and ...

I'm doing my best to point to code-alignment for those who care about Assembly (and being a god). Twisted Evil Cool

I'm creating a macro which gets the offset of a test|cmp (where compare started) and the offset of jcc instruction, then checks if the test|cmp and jcc are in the same 32 bytes chunk (code chunk).

But it's better to expand your codes (for example, convert 32-BYTE ops to 64-BYTE) to reach this purpose.[/b]

_________________
JESUS _F_ CHRIST


Last edited by hellomachine on 25 May 2023, 07:36; edited 1 time in total
Post 25 May 2023, 06:17
View user's profile Send private message Reply with quote
hellomachine



Joined: 18 May 2023
Posts: 22
Location: I don't even exist
hellomachine 25 May 2023, 07:24
Done, a Macro is created to fix 'JCC Erratum' problem.

Code:
macro fix.jcc_erratum _offset_end_ {
 local padding, rem
  padding = 0
  rem = 32 - ($ and 31)
 if rem < (_offset_end_ - $)
  padding = rem
 end if
 repeat padding
  db 0x2e
 end repeat }
    


I hope Mr. Tomasz Grysztar consider this solution as an important one.

How to use this Macro :

Use this macro, before each cmp|test (all instructions that a JCC (jg,jl,jb,je,....) Instruction comes after).
And after its JCC instruction, use a @@ to define the range of that cmp|test with its JCC instruction and pass it to the Macro.

For example:

Code:
func1:
        fix.jcc_erratum @f
        test eax, eax
        jz   .some_label
 @@:    ...    
    


Multi usage example:

Code:
func1:
        fix.jcc_erratum @f
        test eax, 1
        jz   .some_label
@@: mov edx, 0
        fix.jcc_erratum @f
        test eax, 2
        jz   .some_label2
@@: ...
    


This Macro puts (or makes sure) a cmp|test with its JCC instruction in the same 32 bytes chunk (code).
You can disable it (CPU detection or ...), if the CPU is not listed in this problem CPUs.
Post 25 May 2023, 07:24
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 805
Location: Russia
macomics 25 May 2023, 07:34
That's just the document says not about specific offsets, but about the intersection of the cache line. But the length of the line depends on the processor. And this cannot be taken into account in macros. You will have to do the alignment with explicit indication of this length. For processors of the 11th-12th gen, the line length is 64 bytes, before the 8th gen, the lines were 16-byte, and in these gens, 32-byte lines were made. And this led to these errors in the execution of the code.

No matter how much you would like, but it will not work correctly to do everything for all processors. You will have to do something manually anyway. Macros are not omnipotent. To begin with, you start with a pair of test+jcc commands, but there is still a huge set of commands that should not cross the boundaries of cache lines.

cmp+jcc,jmp,call,ret,...
Post 25 May 2023, 07:34
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 805
Location: Russia
macomics 25 May 2023, 07:39
hellomachine wrote:
Code:
macro fix.jcc_erratum _offset_end_ {
 local padding, rem
  padding = 0
  rem = 32 - ($ and 31)
 if rem < (_offset_end_ - $)
  padding = rem
 end if
 repeat padding
  db 0x2e
 end repeat }    
And now check the action of this macro on the following sequence of commands
Code:
align 64
func:
    nop
    fix.jcc_erratum offset_end
    test [rsi], 3
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    jnz @f
offset_end:    
or
Code:
align 64
func:
    nop
    scas byte [rdi]
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    jnz @f
offset_end:    
Processors have a limit on the command length of 16 bytes. And also some models have a limit on the number of prefixes for a instruction.
Post 25 May 2023, 07:39
View user's profile Send private message Reply with quote
hellomachine



Joined: 18 May 2023
Posts: 22
Location: I don't even exist
hellomachine 25 May 2023, 07:50
macomics wrote:
That's just the document says not about specific offsets, but about the intersection of the cache line. But the length of the line depends on the processor. And this cannot be taken into account in macros. You will have to do the alignment with explicit indication of this length. For processors of the 11th-12th gen, the line length is 64 bytes, before the 8th gen, the lines were 16-byte, and in these gens, 32-byte lines were made. And this led to these errors in the execution of the code.

No matter how much you would like, but it will not work correctly to do everything for all processors. You will have to do something manually anyway. Macros are not omnipotent. To begin with, you start with a pair of test+jcc commands, but there is still a huge set of commands that should not cross the boundaries of cache lines.

cmp+jcc,jmp,call,ret,...


Of course there are too many things to handle but here we are handling JCC problem by putting a CMP and its JCC in same cache-line (32-byte). And as I said, you can disable it by undefining the macro and define an empty one, if the target CPU is not an old one. So I think with macros, we can handle everything.
100% it's better to expand an operation (for example, convert a 32-BYTE op to 64-BYTE to have more codes (for padding)) but it will take time for those who just want to focus on the way their codes and functions work not code-align performance (they just use this macro and done).

_________________
JESUS _F_ CHRIST
Post 25 May 2023, 07:50
View user's profile Send private message Reply with quote
hellomachine



Joined: 18 May 2023
Posts: 22
Location: I don't even exist
hellomachine 25 May 2023, 07:56
macomics wrote:
hellomachine wrote:
Code:
macro fix.jcc_erratum _offset_end_ {
 local padding, rem
  padding = 0
  rem = 32 - ($ and 31)
 if rem < (_offset_end_ - $)
  padding = rem
 end if
 repeat padding
  db 0x2e
 end repeat }    
And now check the action of this macro on the following sequence of commands
Code:
align 64
func:
    nop
    fix.jcc_erratum offset_end
    test [rsi], 3
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    jnz @f
offset_end:    
or
Code:
align 64
func:
    nop
    scas byte [rdi]
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    jnz @f
offset_end:    
Processors have a limit on the command length of 16 bytes. And also some models have a limit on the number of prefixes for a instruction.


This Macro will not add anything between a cmp and its JCC ! It adds, before the cmp start.
And I don't think it will be a good idea to have something between a cmp and its JCC.

_________________
JESUS _F_ CHRIST
Post 25 May 2023, 07:56
View user's profile Send private message Reply with quote
Peter_Cordes



Joined: 25 May 2023
Posts: 4
Peter_Cordes 25 May 2023, 08:49
macomics wrote:
That's just the document says not about specific offsets, but about the intersection of the cache line.


No, the JCC erratum is based on 32-byte boundaries because that's how the Sandybridge-family uop cache is indexed, with up to 3 lines from the same 32-byte chunk of x86 machine code.

Read Agner Fog's microarch PDF ( https://agner.org/optimize/microarchitecture.pdf), specifically the Sandybridge chapter, since that's when the uop cache was new. ( https://www.realworldtech.com/sandy-bridge/ )

Quote:
But the length of the line depends on the processor. And this cannot be taken into account in macros. You will have to do the alignment with explicit indication of this length. For processors of the 11th-12th gen, the line length is 64 bytes, before the 8th gen, the lines were 16-byte, and in these gens, 32-byte lines were made. And this led to these errors in the execution of the code.


No, Intel CPUs have used 64-byte cache lines since Pentium-M and Pentium 4. Before that it was 32-byte lines, in Pentium 3 and earlier. I'm not aware of any Intel CPUs having 16-byte cache lines. They fetch code from L1i cache in 16-byte chunks, but that's not a whole line.

Fortunately that's not relevant, only Skylake had an erratum that microcode worked around by disabling the uop-cache for chunks where a branch touched the end of the 32-byte chunk. See https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf and https://stackoverflow.com/questions/61256646/how-can-i-mitigate-the-impact-of-the-intel-jcc-erratum-on-gcc

(Skylake was 6th-gen, i7-6xxx. The same microarchitecture was used up until Ice Lake, so some 10th-gen CPUs are non-Skylake. Intel hasn't consistently used different generation numbers for different microarchitectures recently, once they got stalled on re-spins of Skylake. e.g. Comet Lake from 2019 was still Skylake and was sold as 10th-gen, alongside Ice Lake 10th-gen in some laptops. https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(client) vs. https://en.wikichip.org/wiki/intel/microarchitectures/comet_lake )

Quote:

No matter how much you would like, but it will not work correctly to do everything for all processors.


Agreed that this macro is too simplistic, though. As you say, the max length of any single instruction (including `0x2e` prefixes) is 15 bytes, and this macro if used naively will try to pad far too many prefixes.

It will also kill performance on some non-Skylake CPUs: Silvermont family especially can only decode efficiently when an instruction has at most 3 prefixes, including the 0F escape byte for early Silvermont at least. That's why Intel recommends that compilers/assemblers implement the workaround by padding multiple instructions separately, not one very long instruction.

Also yes, the macro needs to know the start and end of the branch instructions since they're not all the same length, so a label at the start and end of the branch could work as inputs.
Post 25 May 2023, 08:49
View user's profile Send private message Visit poster's website Reply with quote
hellomachine



Joined: 18 May 2023
Posts: 22
Location: I don't even exist
hellomachine 25 May 2023, 09:04
Peter_Cordes wrote:
macomics wrote:
That's just the document says not about specific offsets, but about the intersection of the cache line.


No, the JCC erratum is based on 32-byte boundaries because that's how the Sandybridge-family uop cache is indexed, with up to 3 lines from the same 32-byte chunk of x86 machine code.

Read Agner Fog's microarch PDF ( https://agner.org/optimize/microarchitecture.pdf), specifically the Sandybridge chapter, since that's when the uop cache was new. ( https://www.realworldtech.com/sandy-bridge/ )

Quote:
But the length of the line depends on the processor. And this cannot be taken into account in macros. You will have to do the alignment with explicit indication of this length. For processors of the 11th-12th gen, the line length is 64 bytes, before the 8th gen, the lines were 16-byte, and in these gens, 32-byte lines were made. And this led to these errors in the execution of the code.


No, Intel CPUs have used 64-byte cache lines since Pentium-M and Pentium 4. Before that it was 32-byte lines, in Pentium 3 and earlier. I'm not aware of any Intel CPUs having 16-byte cache lines. They fetch code from L1i cache in 16-byte chunks, but that's not a whole line.

Fortunately that's not relevant, only Skylake had an erratum that microcode worked around by disabling the uop-cache for chunks where a branch touched the end of the 32-byte chunk. See https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf and https://stackoverflow.com/questions/61256646/how-can-i-mitigate-the-impact-of-the-intel-jcc-erratum-on-gcc

(Skylake was 6th-gen, i7-6xxx. The same microarchitecture was used up until Ice Lake, so some 10th-gen CPUs are non-Skylake. Intel hasn't consistently used different generation numbers for different microarchitectures recently, once they got stalled on re-spins of Skylake. e.g. Comet Lake from 2019 was still Skylake and was sold as 10th-gen, alongside Ice Lake 10th-gen in some laptops. https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(client) vs. https://en.wikichip.org/wiki/intel/microarchitectures/comet_lake )

Quote:

No matter how much you would like, but it will not work correctly to do everything for all processors.


Agreed that this macro is too simplistic, though. As you say, the max length of any single instruction (including `0x2e` prefixes) is 15 bytes, and this macro if used naively will try to pad far too many prefixes.

It will also kill performance on some non-Skylake CPUs: Silvermont family especially can only decode efficiently when an instruction has at most 3 prefixes, including the 0F escape byte for early Silvermont at least. That's why Intel recommends that compilers/assemblers implement the workaround by padding multiple instructions separately, not one very long instruction.

Also yes, the macro needs to know the start and end of the branch instructions since they're not all the same length, so a label at the start and end of the branch could work as inputs.


Thank you. For Macro, I try to add some "raise exception" for when there is problem with the length of prefix (For example (Error message), 'Something wrong with fix.jcc_erratum Macro, please handle your code padding manually').

_________________
JESUS _F_ CHRIST
Post 25 May 2023, 09:04
View user's profile Send private message Reply with quote
Peter_Cordes



Joined: 25 May 2023
Posts: 4
Peter_Cordes 25 May 2023, 09:26
hellomachine wrote:

Code:
func1:
        fix.jcc_erratum @f
        test eax, eax
        jz   .some_label
 @@:    ...    
    




That's the wrong place to put the padding. That lengthens the test/JCC macro-fused instruction pair so it still starts before the end of a 32-byte block and ends after.

I tested with

Code:
;format ELF64

macro fix.jcc_erratum _offset_end_ {
 local padding, rem
  padding = 0
  rem = 32 - ($ and 31)
 if rem < (_offset_end_ - $)
  padding = rem
 end if
 repeat padding
  db 0x2e
 end repeat }

;section '.text' executable align 32

align 32 

   times 26 nop
func1:
        add eax, eax
        fix.jcc_erratum @f
        test eax, eax
        jz  func1 
 @@:    add edx, edx
    


With `ndisasm -b16 jcc.bin`, I get

Code:
00000000  90                nop
...
00000018  90                nop
00000019  90                nop
0000001A  6601C0            add eax,eax
0000001D  2E2E2E6685C0      cs test eax,eax       ;;;; TEST/JNZ spanning 32-byte boundary
00000023  74F5              jz 0x1a
00000025  6601D2            add edx,edx
    


You need the padding to be before an earlier instruction, like the `add eax,eax`. But you still need to tell the macro the length of the test/jnz or the call [mem] or whatever. And as I mentioned before, the max instruction length of 15 bytes can't be exceeded, and you don't want more than 3 total prefixes on any one instruction, so this isn't something you can easily do with a single macro.


---

Also, is there a way to use this for output formats other than flat binary? With `format ELF` or `format ELF64`, I got this error, because it doesn't know the absolute address of `$` as an assemble-time constant. (And despite the ALIGN 32, it doesn't realize that $ and 31 can be computed at assemble time)

Code:
jcc.asm [6] fix.jcc_erratum [3]:
  rem = 32 - ($ and 31)
processed: rem?1=32-($ and 31)
error: invalid use of symbol.
    


If there isn't a workaround for this, it's not very usable for more real-world stuff, even if you can change the design to do anything useful.
Post 25 May 2023, 09:26
View user's profile Send private message Visit poster's website Reply with quote
hellomachine



Joined: 18 May 2023
Posts: 22
Location: I don't even exist
hellomachine 25 May 2023, 09:39
I'm working on it with an example, for ELF64 EXECUTABLE ...
Post 25 May 2023, 09:39
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 805
Location: Russia
macomics 25 May 2023, 10:16
Peter_Cordes wrote:
No, the JCC erratum is based on 32-byte boundaries because that's how the Sandybridge-family uop cache is indexed, with up to 3 lines from the same 32-byte chunk of x86 machine code.

Read Agner Fog's microarch PDF ( https://agner.org/optimize/microarchitecture.pdf), specifically the Sandybridge chapter, since that's when the uop cache was new. ( https://www.realworldtech.com/sandy-bridge/ )

Quote:

But the length of the line depends on the processor. And this cannot be taken into account in macros. You will have to do the alignment with explicit indication of this length. For processors of the 11th-12th gen, the line length is 64 bytes, before the 8th gen, the lines were 16-byte, and in these gens, 32-byte lines were made. And this led to these errors in the execution of the code.



No, Intel CPUs have used 64-byte cache lines since Pentium-M and Pentium 4. Before that it was 32-byte lines, in Pentium 3 and earlier. I'm not aware of any Intel CPUs having 16-byte cache lines. They fetch code from L1i cache in 16-byte chunks, but that's not a whole line.

Fortunately that's not relevant, only Skylake had an erratum that microcode worked around by disabling the uop-cache for chunks where a branch touched the end of the 32-byte chunk. See https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf and https://stackoverflow.com/questions/61256646/how-can-i-mitigate-the-impact-of-the-intel-jcc-erratum-on-gcc

(Skylake was 6th-gen, i7-6xxx. The same microarchitecture was used up until Ice Lake, so some 10th-gen CPUs are non-Skylake. Intel hasn't consistently used different generation numbers for different microarchitectures recently, once they got stalled on re-spins of Skylake. e.g. Comet Lake from 2019 was still Skylake and was sold as 10th-gen, alongside Ice Lake 10th-gen in some laptops. https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(client) vs. https://en.wikichip.org/wiki/intel/microarchitectures/comet_lake )
Thanks for the helpful links. I agree that Intel has a hodgepodge in their names. But I still know another 486 DX2 where a 16-bit cache was implemented, but there was no instruction cache (this was actually one of my old computers). Here I agree that I confused the data cache and the instruction cache.

hellomachine, one more thing I wanted to show you (in my previous examples). You use the prefix 0x2E (CS), but in instructions with memory accesses, this can also cause problems.
Post 25 May 2023, 10:16
View user's profile Send private message Reply with quote
Peter_Cordes



Joined: 25 May 2023
Posts: 4
Peter_Cordes 25 May 2023, 10:43
macomics wrote:
But I still know another 486 DX2 where a 16-bit cache was implemented, but there was no instruction cache (this was actually one of my old computers). Here I agree that I confused the data cache and the instruction cache.


In P5 Pentium and 486, L1 cache was unified, not split instruction / data.
You're right that 486 used 16-byte cache lines; I found a couple independent mentions of that being true.
P5 Pentium used 32-byte lines, so did PPro / PII / PIII.

Quote:

hellomachine, one more thing I wanted to show you (in my previous examples). You use the prefix 0x2E (CS), but in instructions with memory accesses, this can also cause problems.


I think this was intended for x86-64, or 32-bit with a flat memory model. In that case, CS prefixes are fine except on instructions that use FS or GS, e.g. for thread-local storage.

A separate risk is a CS prefix with an opcode that can *never* have a memory operand, where it's plausible some future CPU might interpret that prefix+opcode as a different instruction, or give the prefix a special meaning. e.g. like Pentium 4 branch hints on JCC instructions, which were 2E and 3E, the same bytes as CS and DS prefixes https://stackoverflow.com/questions/14332848/intel-x86-0x2e-0x3e-prefix-branch-prediction-actually-used

As Intel notes, they're "reserved for future use". But due to compilers and assemblers using Intel's suggestion of 2E prefixes in front of many different non-memory instructions, binaries now exist in the wild assuming these prefixes will be ignored. Some of these are commercially-important, so there's a big incentive for CPU designers not to interpret them in a way that stops code from working, or makes it slow. Like with the `rep ret` thing that GCC used to work around AMD's K8 / K10 branch-prediction limitations for a 1-byte ret following another branch.
https://stackoverflow.com/questions/20526361/what-does-rep-ret-mean

Also related re: using other prefixes for padding, such as 66h https://stackoverflow.com/questions/74954166/using-the-operand-size-override-prefix-0x66-for-instruction-alignment

https://stackoverflow.com/questions/61319623/are-intel-tsx-prefixes-executed-safely-on-amd-as-nop - generally CPUs ignore prefixes they don't understand, and it's only retroactively that Intel / AMD document older CPUs as ignoring it. Similar to how REP NOP was re-purposed as PAUSE: https://stackoverflow.com/questions/7086220/what-does-rep-nop-mean-in-x86-assembly-is-it-the-same-as-the-pause-instru/33639300#33639300 and how REP BSF runs as TZCNT on newer CPUs, with interesting backward-compat opportunities for code that knows it has a non-zero value, but cares about running faster on AMD CPUs where BSF is slow.

Anyway, with Intel officially recommending 2E prefixes for this, it's fine except in homebrew osdev cases with non-zero segment bases, or if the instruction also had an FS or GS prefix.
Post 25 May 2023, 10:43
View user's profile Send private message Visit poster's website Reply with quote
hellomachine



Joined: 18 May 2023
Posts: 22
Location: I don't even exist
hellomachine 25 May 2023, 10:56
Peter_Cordes wrote:

Also, is there a way to use this for output formats other than flat binary? With `format ELF` or `format ELF64`, I got this error, because it doesn't know the absolute address of `$` as an assemble-time constant. (And despite the ALIGN 32, it doesn't realize that $ and 31 can be computed at assemble time)

Code:
jcc.asm [6] fix.jcc_erratum [3]:
  rem = 32 - ($ and 31)
processed: rem?1=32-($ and 31)
error: invalid use of symbol.
    


You need to use org (for example, org 0) in an Object file but in executable (format ELF64 EXECUTABLE) it's OK and you get no error.

I also made a switch for it to turn it off or on (for when we don't want to use JCC Erratum) and if the padding is greater than 3, there will be FASM compiler exception (not make it yet (working on it)).

Code:
format ELF64

enable.jcc_erratum_fix = 1

macro fix.jcc_erratum _offset_end_ {
 if enable.jcc_erratum_fix=1
  local padding, rem
   padding = 0
   rem = 32 - ($ and 31)
  if rem < (_offset_end_ - $)
   padding = rem
  end if
  if rem <= 3
   repeat padding
    db 0x2e
   end repeat
  else
   ; Stop FASM and Print Error => Something wrong with your code, please handle code-padding manually.
  end if
 end if  }


section '.text' executable align 32

org 0

 align 16
main:
        mov     rdi, .hello_world
        call    strlen

        xor     edi, edi
        mov     eax, 60 ; SYS_EXIT
        syscall
        hlt
 .hello_world db 'Hello, World. My name is Mr. Nobody.', 0x00

 align 16
strlen:
        mov     r8, rdi
        test    dil, 7
        jz      SHORT .loop
        ; unaligned (8-BYTE) handle ...
  align 16
 .loop:
        mov     rax, QWORD [rdi]
        lea ecx, [ecx+12312312] ; something to push test al, al and its jump, near to end of a 32 bytes chunk.
        xor ecx, ecx ; something to push test al, al and its jump, near to end of a 32 bytes chunk.
        xor     rcx, rcx ; something to push test al, al and its jump, near to end of a 32 bytes chunk.
        fix.jcc_erratum @f
        test    al, al
        jz      SHORT .ret
 @@:    fix.jcc_erratum @f
        test    ah, ah
        jz      SHORT .ret.1
 @@:    fix.jcc_erratum @f
        test    eax, 0xff0000
        jz      SHORT .ret.2
 @@:    fix.jcc_erratum @f
        test    eax, 0xff000000
        jz      SHORT .ret.3
 @@:    shr     rax, 32
        fix.jcc_erratum @f
        test    al, al
        jz      SHORT .ret.4
 @@:    fix.jcc_erratum @f
        test    ah, ah
        jz      SHORT .ret.5
 @@:    fix.jcc_erratum @f
        test    eax, 0xff0000
        jz      SHORT .ret.6
 @@:    fix.jcc_erratum @f
        test    eax, 0xff000000
        jz      SHORT .ret.7
 @@:    lea     rdi, [rdi+8]
        jmp     SHORT .loop
  align 16
 .ret:  mov     rax, rdi
        sub     rax, r8
        ret
  align 16
 .ret.1: lea    rax, [rdi+1]
        sub     rax, r8
        ret
  align 16
 .ret.2: lea    rax, [rdi+2]
        sub     rax, r8
        ret
  align 16
 .ret.3: lea    rax, [rdi+3]
        sub     rax, r8
        ret
  align 16
 .ret.4: lea    rax, [rdi+4]
        sub     rax, r8
        ret
  align 16
 .ret.5: lea    rax, [rdi+5]
        sub     rax, r8
        ret
  align 16
 .ret.6: lea    rax, [rdi+6]
        sub     rax, r8
        ret
  align 16
 .ret.7: lea    rax, [rdi+7]
        sub     rax, r8
        ret                  
    


The result:
Code:
0000000000000000 <.text>:
   0:   48 c7 c7 16 00 00 00    mov    $0x16,%rdi
   7:   e8 34 00 00 00          callq  0x40
   c:   31 ff                   xor    %edi,%edi
   e:   b8 3c 00 00 00          mov    $0x3c,%eax
  13:   0f 05                   syscall 
  15:   f4                      hlt    
  16:   48                      rex.W
  17:   65 6c                   gs insb (%dx),%esSad%rdi)
  19:   6c                      insb   (%dx),%esSad%rdi)
  1a:   6f                      outsl  %dsSad%rsi),(%dx)
  1b:   2c 20                   sub    $0x20,%al
  1d:   57                      push   %rdi
  1e:   6f                      outsl  %dsSad%rsi),(%dx)
  1f:   72 6c                   jb     0x8d
  21:   64 2e 20 4d 79          fs and %cl,%cs:0x79(%rbp)
  26:   20 6e 61                and    %ch,0x61(%rsi)
  29:   6d                      insl   (%dx),%esSad%rdi)
  2a:   65 20 69 73             and    %ch,%gs:0x73(%rcx)
  2e:   20 4d 72                and    %cl,0x72(%rbp)
  31:   2e 20 4e 6f             and    %cl,%cs:0x6f(%rsi)
  35:   62                      (bad)  
  36:   6f                      outsl  %dsSad%rsi),(%dx)
  37:   64 79 2e                fs jns 0x68
  3a:   00 90 90 90 90 90       add    %dl,-0x6f6f6f70(%rax)
  40:   49 89 f8                mov    %rdi,%r8
  43:   40 f6 c7 07             test   $0x7,%dil
  47:   74 07                   je     0x50
  49:   90                      nop
  4a:   90                      nop
  4b:   90                      nop
  4c:   90                      nop
  4d:   90                      nop
  4e:   90                      nop
  4f:   90                      nop
  50:   48 8b 07                mov    (%rdi),%rax
  53:   67 8d 89 f8 de bb 00    lea    0xbbdef8(%ecx),%ecx
  5a:   31 c9                   xor    %ecx,%ecx
  5c:   48 31 c9                xor    %rcx,%rcx
  5f:   2e 84 c0                cs test %al,%al
  62:   74 3c                   je     0xa0
  64:   84 e4                   test   %ah,%ah
  66:   74 48                   je     0xb0
  68:   a9 00 00 ff 00          test   $0xff0000,%eax
  6d:   74 51                   je     0xc0
  6f:   a9 00 00 00 ff          test   $0xff000000,%eax
  74:   74 5a                   je     0xd0
  76:   48 c1 e8 20             shr    $0x20,%rax
  7a:   84 c0                   test   %al,%al
  7c:   74 62                   je     0xe0
  7e:   2e 2e 84 e4             cs cs test %ah,%ah
  82:   74 6c                   je     0xf0
  84:   a9 00 00 ff 00          test   $0xff0000,%eax
  89:   74 75                   je     0x100
  8b:   a9 00 00 00 ff          test   $0xff000000,%eax
  90:   74 7e                   je     0x110
  92:   48 8d 7f 08             lea    0x8(%rdi),%rdi
  96:   eb b8                   jmp    0x50
  98:   90                      nop
  99:   90                      nop
  9a:   90                      nop
  9b:   90                      nop
  9c:   90                      nop
  9d:   90                      nop
  9e:   90                      nop
  9f:   90                      nop
  a0:   48 89 f8                mov    %rdi,%rax
  a3:   4c 29 c0                sub    %r8,%rax
  a6:   c3                      retq   
  a7:   90                      nop
  a8:   90                      nop
  a9:   90                      nop
  aa:   90                      nop
  ab:   90                      nop
  ac:   90                      nop
  ad:   90                      nop
  ae:   90                      nop
  af:   90                      nop
  b0:   48 8d 47 01             lea    0x1(%rdi),%rax
  b4:   4c 29 c0                sub    %r8,%rax
  b7:   c3                      retq   
  b8:   90                      nop
  b9:   90                      nop
  ba:   90                      nop
  bb:   90                      nop
  bc:   90                      nop
  bd:   90                      nop
  be:   90                      nop
    


Here i added two dummy instructions to push the test al, al and its jz near to a 32 bytes chunk and by using this Macro, a single 0x2E is added to push the test al, al to a new 32 bytes chunk and some other fix.jcc_erratum are ignored since everything was OK.

_________________
JESUS _F_ CHRIST
Post 25 May 2023, 10:56
View user's profile Send private message Reply with quote
Peter_Cordes



Joined: 25 May 2023
Posts: 4
Peter_Cordes 25 May 2023, 11:08
Oh right, FASM making an executable directly would know the address, and interesting that org 0 works when making an ELF .o. NASM only allows ORG when making a flat binary.

Remember that the 3-prefix limit includes other prefixes instructions might already need, like REX and/or 66h operand-size for test r8w, r8w.

Quote:

Code:
  5f:   2e 84 c0                cs test %al,%al
  62:   74 3c                   je     0xa0
    



That macro-fused test/je spans a 32-byte boundary, so will trigger slow performance on Skylake CPUs. (The ...40 to ...5f block can't get cached in the uop cache, has to use legacy decode).

Like I said earlier, you need to lengthen an *earlier* instruction, not the test/jcc itself. Macro-fusion during decode makes test+jcc effectively a single instruction that tests-and-branches as a single uop for the pipeline. Intel's whitepaper on the JCC erratum explains this.

https://uica.uops.info/ can assemble a loop body in NASM or GAS syntax (or hex machine code) and analyze it for various Intel microarchitectures, including warning about JCC-erratum problems for Skylake. For FASM, you can hexdump and copy/paste the machine code.

It's useless to use this macro in front of a test/jcc or cmp/jcc. (Except maybe if the TEST starts near the end of a 64-byte cache line, and the JCC starts in the next cache line. Then macro-fusion might not happen, IIRC.)
Post 25 May 2023, 11:08
View user's profile Send private message Visit poster's website Reply with quote
hellomachine



Joined: 18 May 2023
Posts: 22
Location: I don't even exist
hellomachine 25 May 2023, 11:17
Peter_Cordes wrote:
Oh right, FASM making an executable directly would know the address, and interesting that org 0 works when making an ELF .o. NASM only allows ORG when making a flat binary.

Remember that the 3-prefix limit includes other prefixes instructions might already need, like REX and/or 66h operand-size for test r8w, r8w.

Quote:

Code:
  5f:   2e 84 c0                cs test %al,%al
  62:   74 3c                   je     0xa0
    



That macro-fused test/je spans a 32-byte boundary, so will trigger slow performance on Skylake CPUs. (The ...40 to ...5f block can't get cached in the uop cache, has to use legacy decode).

Like I said earlier, you need to lengthen an *earlier* instruction, not the test/jcc itself. Macro-fusion during decode makes test+jcc effectively a single instruction that tests-and-branches as a single uop for the pipeline. Intel's whitepaper on the JCC erratum explains this.

https://uica.uops.info/ can assemble a loop body in NASM or GAS syntax (or hex machine code) and analyze it for various Intel microarchitectures, including warning about JCC-erratum problems for Skylake. For FASM, you can hexdump and copy/paste the machine code.

It's useless to use this macro in front of a test/jcc or cmp/jcc. (Except maybe if the TEST starts near the end of a 64-byte cache line, and the JCC starts in the next cache line. Then macro-fusion might not happen, IIRC.)


Go it. You right. Thank you. Well I try another ways (FASM virtual or ...) tomorrow and if it's not worked, well we need to be more careful about the way we are coding. Very Happy

At least we can make a Macro (like this) to detect 'JCC Erratum' and prints a message (warning) in compile time, so we know about it and go to fix it ...

_________________
JESUS _F_ CHRIST
Post 25 May 2023, 11:17
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.