flat assembler
Message board for the users of flat assembler.
Index
> Linux > Thanks to all Goto page Previous 1, 2, 3 |
Author |
|
revolution 29 Jun 2022, 13:13
Which OS are you using?
|
|||
29 Jun 2022, 13:13 |
|
Furs 29 Jun 2022, 13:47
revolution wrote:
Because userspace follows the SysV x64 ABI which does have requirements of 16-byte alignment, but no shadow space (unlike the MS x64 ABI). And in fact, a lot of libraries are built with such assumption. You don't have to consider those libraries "broken" or being built with "wrong flags": the GCC compiler itself assumes that by default (look in its source code) and it's the de-facto standard compiler for Linux. Obviously you can turn this off but unfortunately you'll have to rebuild the whole userland to avoid 16-byte alignment assumptions. It has a 128-byte red zone though, and Linux can dispatch signals so messing with rsp is 100% a no go on Linux. You don't need it for syscalls but you can't say the same about userland. |
|||
29 Jun 2022, 13:47 |
|
revolution 29 Jun 2022, 13:58
Looking at the Sys-V ABI it says this:
Section 3.1.2 wrote: Like the Intel386 architecture, the AMD64 architecture in general does not require all data accesses to be properly aligned. Misaligned data accesses are slower than aligned accesses but otherwise behave identically. The only exceptions are that __m128 , __m256 and __m512 must always be aligned properly. Section 3.2.2 wrote: The end of the input argument area shall be aligned on a 16 (32 or 64, if __m256 or __m512 is passed on stack) byte boundary. For Linux there is no such requirement. BTW: Section 3.2.3 Parameter passing is really complicated. Figures 3.5 and 3.6 show how terrible it can get. So in that case the code probably works by good fortune, rather than good management. |
|||
29 Jun 2022, 13:58 |
|
I 30 Jun 2022, 00:19
Quote: Which OS are you using? Native Fedora 36 5.18.6-200.fc36.x86_64 Also VBox Fedora 34 |
|||
30 Jun 2022, 00:19 |
|
revolution 30 Jun 2022, 00:39
Mint doesn't show that behaviour.
|
|||
30 Jun 2022, 00:39 |
|
I 01 Jul 2022, 01:53
>>>
VirtualBox Linux Mint 20.3 Cinnamon 5.2.7 Kernel 5.4.0-91-generic Gnome 3.36.9 x11 Call to SDL_Init (misaligned by modification at start). Fault in libc.so.6 (_IO_str?] instruction movaps [rsp], xmm0. Run unmodified then there's two calls that are unaligned to 16bytes by a push pop to save a value. Call to glColor4ubv (misaligned by single push (rax). Fault in swrast_dri.so instruction movaps [rsp+0x30], xmm0. Adding an extra push/pop to align stack, program runs without fault. Call to glShaderSource misaligned by single push (rbx). No fault occurred. VirtualBox Debian GNU/Linux 11 (bullseye) Kernel 5.10.0-13-amd64 Gnome 3.38.5 Wayland Call to SDL_SetVideoMode (misaligned by modification at start). Fault in libvulkan.so.1.2.162 instruction movaps [rsp], xmm0. No fault occurred with SDL_Init. Run unmodified no fault occurred. If I was clever enough to be able to handle exceptions maybe easier to test. However my own conclusion from this is that if passing control to 64-bit code that isn't mine then it's going to be a multiple of 16byte alignment of the stack. Forgot to mention, what I saw doesn't look to me like having to do with speed. |
|||
01 Jul 2022, 01:53 |
|
revolution 01 Jul 2022, 02:03
Using the SSE aligned instructions is purely for "speed" because there are perfectly functional unaligned variants that can be used instead. And those variants would work in all cases. But those variants could be "slower". So sadly in the quest for "speed" code becomes more complicated and difficult.
|
|||
01 Jul 2022, 02:03 |
|
revolution 02 Jul 2022, 00:07
I wrote: VirtualBox Linux Mint 20.3 Cinnamon 5.2.7 Kernel 5.4.0-91-generic Gnome 3.36.9 x11 Sometime in the last two versions it has been "upgraded" to use a more restricted and fragile calling method. |
|||
02 Jul 2022, 00:07 |
|
Alphonso 02 Jul 2022, 01:43
revolution wrote: Sometime in the last two versions it has been "upgraded" to use a more restricted and fragile calling method. The reference to speed was the fact that there only seemed to be a very small use of aligned instructions so wouldn't have expect a noticeable impact on performance. A quick google shows VS 2015 changed from using movdqa to movdqu in VS 2017, sorry for the Windows ref in a Linux thread. https://developercommunity.visualstudio.com/t/perf-regression-movdqu-instructions-are-generated/48123 Charles Fu wrote: Basically, the decision we made is to reduce silent bad code gen scenarios caused by alignment fault with the consideration that the change won't have performance impact on modern CPU Since Nehalem. Wonder if another reason in Linux could be to weed out "silent bad code" with in your face crashes? P.S. looks like "I" can use my old account again This is going to be confusing. |
|||
02 Jul 2022, 01:43 |
|
revolution 02 Jul 2022, 03:26
Alphonso wrote:
In any OoO CPU it is basically a waste of time to be counting clock cycles by examining things like Agner Fog's cycle count tables. It is a meaningless exercise. I see the use of the aligned instructions as a consequence of that thinking. Without proper benchmarking no one can know if it hurts or helps. The code posted above showing the stack adjustments seems to nullify any possible savings (of which there were probably no savings anyway). But it is what it is. Comply or die. |
|||
02 Jul 2022, 03:26 |
|
Furs 02 Jul 2022, 14:02
revolution wrote:
So it didn't crash by pure luck because it did not use anything that required 16 byte alignment in those specific APIs. |
|||
02 Jul 2022, 14:02 |
|
Furs 02 Jul 2022, 14:04
revolution wrote: Maybe just misguided. The ABI is terrible though. Mandating this alignment is a terrible idea. Instead, APIs that are "hot" (i.e. for performance) and make heavy use of such alignment should simply align the stack themselves. It's just 2 instructions ffs! Why should everything pay the price for what is essential 0.01% of APIs that truly need this alignment for speed when those APIs could simply realign the stack themselves on entry? (most are not speed-critical so it's whatever) The Microsoft x64 ABI is even worse with the shadow space. It makes me mad. |
|||
02 Jul 2022, 14:04 |
|
I 03 Jul 2022, 00:46
If movdqu is as fast on aligned data as movdqa [insert revolution approved bench here] excepting very old processors and movdqu beats movdqa hands down on unaligned data (no bench required) then isn't movqdu the better choice?
|
|||
03 Jul 2022, 00:46 |
|
revolution 03 Jul 2022, 02:52
I think the original purpose of the dqa variants is for large swathes of data that are being processed all together. In such cases it is probably an awesome instruction to use. But for random single transfers during a call/ret function entry I think it is just being pointlessly pedantic to insist on absolute perfection with regard to stack usage, when it means the rest of the code has to be larger, and do more things, in an effort to comply with all the extra restrictions.
Last edited by revolution on 03 Jul 2022, 04:57; edited 1 time in total |
|||
03 Jul 2022, 02:52 |
|
bitRAKE 03 Jul 2022, 04:26
It's possible to follow the line of thought which informed the ABI parameters. Probably not the best metrics,
Does a better performing general ABI exist? _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
03 Jul 2022, 04:26 |
|
Goto page Previous 1, 2, 3 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.