flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
hellomachine 24 May 2023, 19:47
Hello, there is a situation named "JCC erratum" which is about code padding (something like 'align' but different.). This is a very important matter and affects a function performance in compare to codes generated by C compilers (GCC and ...).
I developed a function and everything was just fine, but my function was 1.5x slower than that function in C (same behaviour, but generated by C compiler (GCC)). Is there a solution for this "JCC erratum" in FASM (There is a solution in GNU assembler) ? Reference: https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf[/url] I think it's possible to create a macro for this problem... please attention, this is very important !!!! Very !!! For performance. _________________ JESUS _F_ CHRIST |
|||
![]() |
|
hellomachine 25 May 2023, 06:17
macomics wrote: For high-level languages that cannot directly operate on the location of code in memory, these optimizations are useful. But you write in assembler and you can arrange the code as you like. Nothing prevents you from performing code alignment according to certain rules. Write macros that solve this problem, and you can put them in this topic. Yes, but most CPUs (in-use) are older gens (check Datacenters, and you find out the most CPUs are older gens). And about Assembly and how to write codes, I'm sure over 95% of assembly programmers doesn't know anything about code-alignment (Each branch (label) must be aligned at 16 bytes boundaries) and their functions become slower than the ones, generated by a C/C++ compiler. And even by aligning the code branches (labels and ...) to 16 bytes, there are some other problems like this 'JCC Erratium' and ... I'm doing my best to point to code-alignment for those who care about Assembly (and being a god). ![]() ![]() I'm creating a macro which gets the offset of a test|cmp (where compare started) and the offset of jcc instruction, then checks if the test|cmp and jcc are in the same 32 bytes chunk (code chunk). But it's better to expand your codes (for example, convert 32-BYTE ops to 64-BYTE) to reach this purpose.[/b] _________________ JESUS _F_ CHRIST Last edited by hellomachine on 25 May 2023, 07:36; edited 1 time in total |
|||
![]() |
|
hellomachine 25 May 2023, 07:24
Done, a Macro is created to fix 'JCC Erratum' problem.
Code: macro fix.jcc_erratum _offset_end_ { local padding, rem padding = 0 rem = 32 - ($ and 31) if rem < (_offset_end_ - $) padding = rem end if repeat padding db 0x2e end repeat } I hope Mr. Tomasz Grysztar consider this solution as an important one. How to use this Macro : Use this macro, before each cmp|test (all instructions that a JCC (jg,jl,jb,je,....) Instruction comes after). And after its JCC instruction, use a @@ to define the range of that cmp|test with its JCC instruction and pass it to the Macro. For example: Code: func1: fix.jcc_erratum @f test eax, eax jz .some_label @@: ... Multi usage example: Code: func1: fix.jcc_erratum @f test eax, 1 jz .some_label @@: mov edx, 0 fix.jcc_erratum @f test eax, 2 jz .some_label2 @@: ... This Macro puts (or makes sure) a cmp|test with its JCC instruction in the same 32 bytes chunk (code). You can disable it (CPU detection or ...), if the CPU is not listed in this problem CPUs. |
|||
![]() |
|
macomics 25 May 2023, 07:34
That's just the document says not about specific offsets, but about the intersection of the cache line. But the length of the line depends on the processor. And this cannot be taken into account in macros. You will have to do the alignment with explicit indication of this length. For processors of the 11th-12th gen, the line length is 64 bytes, before the 8th gen, the lines were 16-byte, and in these gens, 32-byte lines were made. And this led to these errors in the execution of the code.
No matter how much you would like, but it will not work correctly to do everything for all processors. You will have to do something manually anyway. Macros are not omnipotent. To begin with, you start with a pair of test+jcc commands, but there is still a huge set of commands that should not cross the boundaries of cache lines. cmp+jcc,jmp,call,ret,... |
|||
![]() |
|
macomics 25 May 2023, 07:39
hellomachine wrote:
Code: align 64 func: nop fix.jcc_erratum offset_end test [rsi], 3 nop nop nop nop nop nop nop nop nop nop nop nop jnz @f offset_end: Code: align 64 func: nop scas byte [rdi] nop nop nop nop nop nop nop nop nop nop nop nop nop nop nop nop nop nop jnz @f offset_end: |
|||
![]() |
|
hellomachine 25 May 2023, 07:50
macomics wrote: That's just the document says not about specific offsets, but about the intersection of the cache line. But the length of the line depends on the processor. And this cannot be taken into account in macros. You will have to do the alignment with explicit indication of this length. For processors of the 11th-12th gen, the line length is 64 bytes, before the 8th gen, the lines were 16-byte, and in these gens, 32-byte lines were made. And this led to these errors in the execution of the code. Of course there are too many things to handle but here we are handling JCC problem by putting a CMP and its JCC in same cache-line (32-byte). And as I said, you can disable it by undefining the macro and define an empty one, if the target CPU is not an old one. So I think with macros, we can handle everything. 100% it's better to expand an operation (for example, convert a 32-BYTE op to 64-BYTE to have more codes (for padding)) but it will take time for those who just want to focus on the way their codes and functions work not code-align performance (they just use this macro and done). _________________ JESUS _F_ CHRIST |
|||
![]() |
|
hellomachine 25 May 2023, 07:56
macomics wrote:
This Macro will not add anything between a cmp and its JCC ! It adds, before the cmp start. And I don't think it will be a good idea to have something between a cmp and its JCC. _________________ JESUS _F_ CHRIST |
|||
![]() |
|
Peter_Cordes 25 May 2023, 08:49
macomics wrote: That's just the document says not about specific offsets, but about the intersection of the cache line. No, the JCC erratum is based on 32-byte boundaries because that's how the Sandybridge-family uop cache is indexed, with up to 3 lines from the same 32-byte chunk of x86 machine code. Read Agner Fog's microarch PDF ( https://agner.org/optimize/microarchitecture.pdf), specifically the Sandybridge chapter, since that's when the uop cache was new. ( https://www.realworldtech.com/sandy-bridge/ ) Quote: But the length of the line depends on the processor. And this cannot be taken into account in macros. You will have to do the alignment with explicit indication of this length. For processors of the 11th-12th gen, the line length is 64 bytes, before the 8th gen, the lines were 16-byte, and in these gens, 32-byte lines were made. And this led to these errors in the execution of the code. No, Intel CPUs have used 64-byte cache lines since Pentium-M and Pentium 4. Before that it was 32-byte lines, in Pentium 3 and earlier. I'm not aware of any Intel CPUs having 16-byte cache lines. They fetch code from L1i cache in 16-byte chunks, but that's not a whole line. Fortunately that's not relevant, only Skylake had an erratum that microcode worked around by disabling the uop-cache for chunks where a branch touched the end of the 32-byte chunk. See https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf and https://stackoverflow.com/questions/61256646/how-can-i-mitigate-the-impact-of-the-intel-jcc-erratum-on-gcc (Skylake was 6th-gen, i7-6xxx. The same microarchitecture was used up until Ice Lake, so some 10th-gen CPUs are non-Skylake. Intel hasn't consistently used different generation numbers for different microarchitectures recently, once they got stalled on re-spins of Skylake. e.g. Comet Lake from 2019 was still Skylake and was sold as 10th-gen, alongside Ice Lake 10th-gen in some laptops. https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(client) vs. https://en.wikichip.org/wiki/intel/microarchitectures/comet_lake ) Quote:
Agreed that this macro is too simplistic, though. As you say, the max length of any single instruction (including `0x2e` prefixes) is 15 bytes, and this macro if used naively will try to pad far too many prefixes. It will also kill performance on some non-Skylake CPUs: Silvermont family especially can only decode efficiently when an instruction has at most 3 prefixes, including the 0F escape byte for early Silvermont at least. That's why Intel recommends that compilers/assemblers implement the workaround by padding multiple instructions separately, not one very long instruction. Also yes, the macro needs to know the start and end of the branch instructions since they're not all the same length, so a label at the start and end of the branch could work as inputs. |
|||
![]() |
|
hellomachine 25 May 2023, 09:04
Peter_Cordes wrote:
Thank you. For Macro, I try to add some "raise exception" for when there is problem with the length of prefix (For example (Error message), 'Something wrong with fix.jcc_erratum Macro, please handle your code padding manually'). _________________ JESUS _F_ CHRIST |
|||
![]() |
|
Peter_Cordes 25 May 2023, 09:26
hellomachine wrote:
That's the wrong place to put the padding. That lengthens the test/JCC macro-fused instruction pair so it still starts before the end of a 32-byte block and ends after. I tested with Code: ;format ELF64 macro fix.jcc_erratum _offset_end_ { local padding, rem padding = 0 rem = 32 - ($ and 31) if rem < (_offset_end_ - $) padding = rem end if repeat padding db 0x2e end repeat } ;section '.text' executable align 32 align 32 times 26 nop func1: add eax, eax fix.jcc_erratum @f test eax, eax jz func1 @@: add edx, edx With `ndisasm -b16 jcc.bin`, I get Code: 00000000 90 nop ... 00000018 90 nop 00000019 90 nop 0000001A 6601C0 add eax,eax 0000001D 2E2E2E6685C0 cs test eax,eax ;;;; TEST/JNZ spanning 32-byte boundary 00000023 74F5 jz 0x1a 00000025 6601D2 add edx,edx You need the padding to be before an earlier instruction, like the `add eax,eax`. But you still need to tell the macro the length of the test/jnz or the call [mem] or whatever. And as I mentioned before, the max instruction length of 15 bytes can't be exceeded, and you don't want more than 3 total prefixes on any one instruction, so this isn't something you can easily do with a single macro. --- Also, is there a way to use this for output formats other than flat binary? With `format ELF` or `format ELF64`, I got this error, because it doesn't know the absolute address of `$` as an assemble-time constant. (And despite the ALIGN 32, it doesn't realize that $ and 31 can be computed at assemble time) Code: jcc.asm [6] fix.jcc_erratum [3]: rem = 32 - ($ and 31) processed: rem?1=32-($ and 31) error: invalid use of symbol. If there isn't a workaround for this, it's not very usable for more real-world stuff, even if you can change the design to do anything useful. |
|||
![]() |
|
hellomachine 25 May 2023, 09:39
I'm working on it with an example, for ELF64 EXECUTABLE ...
|
|||
![]() |
|
macomics 25 May 2023, 10:16
Peter_Cordes wrote: No, the JCC erratum is based on 32-byte boundaries because that's how the Sandybridge-family uop cache is indexed, with up to 3 lines from the same 32-byte chunk of x86 machine code. hellomachine, one more thing I wanted to show you (in my previous examples). You use the prefix 0x2E (CS), but in instructions with memory accesses, this can also cause problems. |
|||
![]() |
|
Peter_Cordes 25 May 2023, 10:43
macomics wrote: But I still know another 486 DX2 where a 16-bit cache was implemented, but there was no instruction cache (this was actually one of my old computers). Here I agree that I confused the data cache and the instruction cache. In P5 Pentium and 486, L1 cache was unified, not split instruction / data. You're right that 486 used 16-byte cache lines; I found a couple independent mentions of that being true. P5 Pentium used 32-byte lines, so did PPro / PII / PIII. Quote:
I think this was intended for x86-64, or 32-bit with a flat memory model. In that case, CS prefixes are fine except on instructions that use FS or GS, e.g. for thread-local storage. A separate risk is a CS prefix with an opcode that can *never* have a memory operand, where it's plausible some future CPU might interpret that prefix+opcode as a different instruction, or give the prefix a special meaning. e.g. like Pentium 4 branch hints on JCC instructions, which were 2E and 3E, the same bytes as CS and DS prefixes https://stackoverflow.com/questions/14332848/intel-x86-0x2e-0x3e-prefix-branch-prediction-actually-used As Intel notes, they're "reserved for future use". But due to compilers and assemblers using Intel's suggestion of 2E prefixes in front of many different non-memory instructions, binaries now exist in the wild assuming these prefixes will be ignored. Some of these are commercially-important, so there's a big incentive for CPU designers not to interpret them in a way that stops code from working, or makes it slow. Like with the `rep ret` thing that GCC used to work around AMD's K8 / K10 branch-prediction limitations for a 1-byte ret following another branch. https://stackoverflow.com/questions/20526361/what-does-rep-ret-mean Also related re: using other prefixes for padding, such as 66h https://stackoverflow.com/questions/74954166/using-the-operand-size-override-prefix-0x66-for-instruction-alignment https://stackoverflow.com/questions/61319623/are-intel-tsx-prefixes-executed-safely-on-amd-as-nop - generally CPUs ignore prefixes they don't understand, and it's only retroactively that Intel / AMD document older CPUs as ignoring it. Similar to how REP NOP was re-purposed as PAUSE: https://stackoverflow.com/questions/7086220/what-does-rep-nop-mean-in-x86-assembly-is-it-the-same-as-the-pause-instru/33639300#33639300 and how REP BSF runs as TZCNT on newer CPUs, with interesting backward-compat opportunities for code that knows it has a non-zero value, but cares about running faster on AMD CPUs where BSF is slow. Anyway, with Intel officially recommending 2E prefixes for this, it's fine except in homebrew osdev cases with non-zero segment bases, or if the instruction also had an FS or GS prefix. |
|||
![]() |
|
hellomachine 25 May 2023, 10:56
Peter_Cordes wrote:
You need to use org (for example, org 0) in an Object file but in executable (format ELF64 EXECUTABLE) it's OK and you get no error. I also made a switch for it to turn it off or on (for when we don't want to use JCC Erratum) and if the padding is greater than 3, there will be FASM compiler exception (not make it yet (working on it)). Code: format ELF64 enable.jcc_erratum_fix = 1 macro fix.jcc_erratum _offset_end_ { if enable.jcc_erratum_fix=1 local padding, rem padding = 0 rem = 32 - ($ and 31) if rem < (_offset_end_ - $) padding = rem end if if rem <= 3 repeat padding db 0x2e end repeat else ; Stop FASM and Print Error => Something wrong with your code, please handle code-padding manually. end if end if } section '.text' executable align 32 org 0 align 16 main: mov rdi, .hello_world call strlen xor edi, edi mov eax, 60 ; SYS_EXIT syscall hlt .hello_world db 'Hello, World. My name is Mr. Nobody.', 0x00 align 16 strlen: mov r8, rdi test dil, 7 jz SHORT .loop ; unaligned (8-BYTE) handle ... align 16 .loop: mov rax, QWORD [rdi] lea ecx, [ecx+12312312] ; something to push test al, al and its jump, near to end of a 32 bytes chunk. xor ecx, ecx ; something to push test al, al and its jump, near to end of a 32 bytes chunk. xor rcx, rcx ; something to push test al, al and its jump, near to end of a 32 bytes chunk. fix.jcc_erratum @f test al, al jz SHORT .ret @@: fix.jcc_erratum @f test ah, ah jz SHORT .ret.1 @@: fix.jcc_erratum @f test eax, 0xff0000 jz SHORT .ret.2 @@: fix.jcc_erratum @f test eax, 0xff000000 jz SHORT .ret.3 @@: shr rax, 32 fix.jcc_erratum @f test al, al jz SHORT .ret.4 @@: fix.jcc_erratum @f test ah, ah jz SHORT .ret.5 @@: fix.jcc_erratum @f test eax, 0xff0000 jz SHORT .ret.6 @@: fix.jcc_erratum @f test eax, 0xff000000 jz SHORT .ret.7 @@: lea rdi, [rdi+8] jmp SHORT .loop align 16 .ret: mov rax, rdi sub rax, r8 ret align 16 .ret.1: lea rax, [rdi+1] sub rax, r8 ret align 16 .ret.2: lea rax, [rdi+2] sub rax, r8 ret align 16 .ret.3: lea rax, [rdi+3] sub rax, r8 ret align 16 .ret.4: lea rax, [rdi+4] sub rax, r8 ret align 16 .ret.5: lea rax, [rdi+5] sub rax, r8 ret align 16 .ret.6: lea rax, [rdi+6] sub rax, r8 ret align 16 .ret.7: lea rax, [rdi+7] sub rax, r8 ret The result: Code: 0000000000000000 <.text>: 0: 48 c7 c7 16 00 00 00 mov $0x16,%rdi 7: e8 34 00 00 00 callq 0x40 c: 31 ff xor %edi,%edi e: b8 3c 00 00 00 mov $0x3c,%eax 13: 0f 05 syscall 15: f4 hlt 16: 48 rex.W 17: 65 6c gs insb (%dx),%es Here i added two dummy instructions to push the test al, al and its jz near to a 32 bytes chunk and by using this Macro, a single 0x2E is added to push the test al, al to a new 32 bytes chunk and some other fix.jcc_erratum are ignored since everything was OK. _________________ JESUS _F_ CHRIST |
|||
![]() |
|
Peter_Cordes 25 May 2023, 11:08
Oh right, FASM making an executable directly would know the address, and interesting that org 0 works when making an ELF .o. NASM only allows ORG when making a flat binary.
Remember that the 3-prefix limit includes other prefixes instructions might already need, like REX and/or 66h operand-size for test r8w, r8w. Quote:
That macro-fused test/je spans a 32-byte boundary, so will trigger slow performance on Skylake CPUs. (The ...40 to ...5f block can't get cached in the uop cache, has to use legacy decode). Like I said earlier, you need to lengthen an *earlier* instruction, not the test/jcc itself. Macro-fusion during decode makes test+jcc effectively a single instruction that tests-and-branches as a single uop for the pipeline. Intel's whitepaper on the JCC erratum explains this. https://uica.uops.info/ can assemble a loop body in NASM or GAS syntax (or hex machine code) and analyze it for various Intel microarchitectures, including warning about JCC-erratum problems for Skylake. For FASM, you can hexdump and copy/paste the machine code. It's useless to use this macro in front of a test/jcc or cmp/jcc. (Except maybe if the TEST starts near the end of a 64-byte cache line, and the JCC starts in the next cache line. Then macro-fusion might not happen, IIRC.) |
|||
![]() |
|
hellomachine 25 May 2023, 11:17
Peter_Cordes wrote: Oh right, FASM making an executable directly would know the address, and interesting that org 0 works when making an ELF .o. NASM only allows ORG when making a flat binary. Go it. You right. Thank you. Well I try another ways (FASM virtual or ...) tomorrow and if it's not worked, well we need to be more careful about the way we are coding. ![]() At least we can make a Macro (like this) to detect 'JCC Erratum' and prints a message (warning) in compile time, so we know about it and go to fix it ... _________________ JESUS _F_ CHRIST |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.