flat assembler
Message board for the users of flat assembler.

Index > Linux > Thanks to all

Goto page Previous  1, 2, 3
Author
Thread Post new topic Reply to topic
I



Joined: 19 May 2022
Posts: 33
I 29 Jun 2022, 12:49
revolution wrote:
You changed rbx, so the call to SDL_SetVideoMode is not correct. The change you made doesn't affect the stack, all the parameters are passed in registers.
rbx is fine for me, it's set to just SDL_OPENGL, not SDL_FULLSCREEN which gets messed up otherwise from changing the stack pointer at the start.

Besides, the fault occurs at the very first call, SDL_Init in ld-linux-x86-64.so.2 function dbg._dl_map_object_from_fd instruction movdqa xmm3, [rbp - 0xc0] when the stack is an odd multiple of 8 before the call.
Post 29 Jun 2022, 12:49
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18940
Location: In your JS exploiting you and your system
revolution 29 Jun 2022, 13:13
Which OS are you using?
Post 29 Jun 2022, 13:13
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1946
Furs 29 Jun 2022, 13:47
revolution wrote:
I wrote:
@revolution maybe that's it then, how can we guarantee that instruction isn't in a library function, now or sometime in the future.
It isn't part of the Linux spec, so it will be fine. Really, it will be fine. Don't worry about it.

I had another thread discussing the Windows RSP, and many there seem unconvinced also. A lot of superstition over RSP usage. Razz
Are you talking about Linux as in the kernel?

Because userspace follows the SysV x64 ABI which does have requirements of 16-byte alignment, but no shadow space (unlike the MS x64 ABI). And in fact, a lot of libraries are built with such assumption.

You don't have to consider those libraries "broken" or being built with "wrong flags": the GCC compiler itself assumes that by default (look in its source code) and it's the de-facto standard compiler for Linux. Obviously you can turn this off but unfortunately you'll have to rebuild the whole userland to avoid 16-byte alignment assumptions.

It has a 128-byte red zone though, and Linux can dispatch signals so messing with rsp is 100% a no go on Linux.

You don't need it for syscalls but you can't say the same about userland.
Post 29 Jun 2022, 13:47
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18940
Location: In your JS exploiting you and your system
revolution 29 Jun 2022, 13:58
Looking at the Sys-V ABI it says this:
Section 3.1.2 wrote:
Like the Intel386 architecture, the AMD64 architecture in general does not require all data accesses to be properly aligned. Misaligned data accesses are slower than aligned accesses but otherwise behave identically. The only exceptions are that __m128 , __m256 and __m512 must always be aligned properly.
And later:
Section 3.2.2 wrote:
The end of the input argument area shall be aligned on a 16 (32 or 64, if __m256 or __m512 is passed on stack) byte boundary.
So for the GNU stuff you need to align, but not just to 16, also perhaps to 32, or 64.

For Linux there is no such requirement.

BTW: Section 3.2.3 Parameter passing is really complicated. Figures 3.5 and 3.6 show how terrible it can get. Sad

So in that case the code probably works by good fortune, rather than good management. Razz
Post 29 Jun 2022, 13:58
View user's profile Send private message Visit poster's website Reply with quote
I



Joined: 19 May 2022
Posts: 33
I 30 Jun 2022, 00:19
Quote:
Which OS are you using?

Native Fedora 36 5.18.6-200.fc36.x86_64
Also VBox Fedora 34
Post 30 Jun 2022, 00:19
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18940
Location: In your JS exploiting you and your system
revolution 30 Jun 2022, 00:39
Mint doesn't show that behaviour.
Post 30 Jun 2022, 00:39
View user's profile Send private message Visit poster's website Reply with quote
I



Joined: 19 May 2022
Posts: 33
I 01 Jul 2022, 01:53
>>>

VirtualBox Linux Mint 20.3 Cinnamon 5.2.7 Kernel 5.4.0-91-generic Gnome 3.36.9 x11

Call to SDL_Init (misaligned by modification at start). Fault in libc.so.6 (_IO_str?] instruction movaps [rsp], xmm0.

Run unmodified then there's two calls that are unaligned to 16bytes by a push pop to save a value. Call to glColor4ubv (misaligned by single push (rax). Fault in swrast_dri.so instruction movaps [rsp+0x30], xmm0. Adding an extra push/pop to align stack, program runs without fault.

Call to glShaderSource misaligned by single push (rbx). No fault occurred.


VirtualBox Debian GNU/Linux 11 (bullseye) Kernel 5.10.0-13-amd64 Gnome 3.38.5 Wayland

Call to SDL_SetVideoMode (misaligned by modification at start). Fault in libvulkan.so.1.2.162 instruction movaps [rsp], xmm0. No fault occurred with SDL_Init.

Run unmodified no fault occurred.

If I was clever enough to be able to handle exceptions maybe easier to test. However my own conclusion from this is that if passing control to 64-bit code that isn't mine then it's going to be a multiple of 16byte alignment of the stack.

Forgot to mention, what I saw doesn't look to me like having to do with speed.
Post 01 Jul 2022, 01:53
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18940
Location: In your JS exploiting you and your system
revolution 01 Jul 2022, 02:03
Using the SSE aligned instructions is purely for "speed" because there are perfectly functional unaligned variants that can be used instead. And those variants would work in all cases. But those variants could be "slower". So sadly in the quest for "speed" code becomes more complicated and difficult.
Post 01 Jul 2022, 02:03
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18940
Location: In your JS exploiting you and your system
revolution 02 Jul 2022, 00:07
I wrote:
VirtualBox Linux Mint 20.3 Cinnamon 5.2.7 Kernel 5.4.0-91-generic Gnome 3.36.9 x11

Call to SDL_Init (misaligned by modification at start). Fault in libc.so.6 (_IO_str?] instruction movaps [rsp], xmm0
The box my colleague tested on is Mint 18.3, so that would explain the difference.

Sometime in the last two versions it has been "upgraded" to use a more restricted and fragile calling method.
Post 02 Jul 2022, 00:07
View user's profile Send private message Visit poster's website Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 295
Alphonso 02 Jul 2022, 01:43
revolution wrote:
Sometime in the last two versions it has been "upgraded" to use a more restricted and fragile calling method.
Isn't that scary?

The reference to speed was the fact that there only seemed to be a very small use of aligned instructions so wouldn't have expect a noticeable impact on performance.

A quick google shows VS 2015 changed from using movdqa to movdqu in VS 2017, sorry for the Windows ref in a Linux thread.
https://developercommunity.visualstudio.com/t/perf-regression-movdqu-instructions-are-generated/48123
Charles Fu wrote:
Basically, the decision we made is to reduce silent bad code gen scenarios caused by alignment fault with the consideration that the change won't have performance impact on modern CPU Since Nehalem.
Nehalem launched in 2008.

Wonder if another reason in Linux could be to weed out "silent bad code" with in your face crashes?

P.S. looks like "I" can use my old account again Smile This is going to be confusing.
Post 02 Jul 2022, 01:43
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18940
Location: In your JS exploiting you and your system
revolution 02 Jul 2022, 03:26
Alphonso wrote:
revolution wrote:
Sometime in the last two versions it has been "upgraded" to use a more restricted and fragile calling method.
Isn't that scary?
Maybe just misguided.

In any OoO CPU it is basically a waste of time to be counting clock cycles by examining things like Agner Fog's cycle count tables. It is a meaningless exercise. I see the use of the aligned instructions as a consequence of that thinking. Without proper benchmarking no one can know if it hurts or helps.

The code posted above showing the stack adjustments seems to nullify any possible savings (of which there were probably no savings anyway).

But it is what it is. Comply or die.
Post 02 Jul 2022, 03:26
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1946
Furs 02 Jul 2022, 14:02
revolution wrote:
I wrote:
VirtualBox Linux Mint 20.3 Cinnamon 5.2.7 Kernel 5.4.0-91-generic Gnome 3.36.9 x11

Call to SDL_Init (misaligned by modification at start). Fault in libc.so.6 (_IO_str?] instruction movaps [rsp], xmm0
The box my colleague tested on is Mint 18.3, so that would explain the difference.

Sometime in the last two versions it has been "upgraded" to use a more restricted and fragile calling method.
It always did. It's part of the ABI and GCC defaults to it. The only thing that was "upgraded" was possibly the vectorization so it actually uses it. But it always aligned/assumed stuff is 16-byte aligned since inception.

So it didn't crash by pure luck because it did not use anything that required 16 byte alignment in those specific APIs.
Post 02 Jul 2022, 14:02
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1946
Furs 02 Jul 2022, 14:04
revolution wrote:
Maybe just misguided.

In any OoO CPU it is basically a waste of time to be counting clock cycles by examining things like Agner Fog's cycle count tables. It is a meaningless exercise. I see the use of the aligned instructions as a consequence of that thinking. Without proper benchmarking no one can know if it hurts or helps.

The code posted above showing the stack adjustments seems to nullify any possible savings (of which there were probably no savings anyway).

But it is what it is. Comply or die.
Alignment is good, it doesn't matter if the CPU can handle unaligned accesses just fine (x86 can). For speed, that is. The problem with misaligned access is when it straddles a cache boundary. This can't happen with aligned access.

The ABI is terrible though. Mandating this alignment is a terrible idea. Instead, APIs that are "hot" (i.e. for performance) and make heavy use of such alignment should simply align the stack themselves. It's just 2 instructions ffs!

Why should everything pay the price for what is essential 0.01% of APIs that truly need this alignment for speed when those APIs could simply realign the stack themselves on entry? (most are not speed-critical so it's whatever)

The Microsoft x64 ABI is even worse with the shadow space. It makes me mad. Mad
Post 02 Jul 2022, 14:04
View user's profile Send private message Reply with quote
I



Joined: 19 May 2022
Posts: 33
I 03 Jul 2022, 00:46
If movdqu is as fast on aligned data as movdqa [insert revolution approved bench here] excepting very old processors and movdqu beats movdqa hands down on unaligned data (no bench required) then isn't movqdu the better choice?
Post 03 Jul 2022, 00:46
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18940
Location: In your JS exploiting you and your system
revolution 03 Jul 2022, 02:52
I think the original purpose of the dqa variants is for large swathes of data that are being processed all together. In such cases it is probably an awesome instruction to use. But for random single transfers during a call/ret function entry I think it is just being pointlessly pedantic to insist on absolute perfection with regard to stack usage, when it means the rest of the code has to be larger, and do more things, in an effort to comply with all the extra restrictions.


Last edited by revolution on 03 Jul 2022, 04:57; edited 1 time in total
Post 03 Jul 2022, 02:52
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3511
Location: vpcmipstrm
bitRAKE 03 Jul 2022, 04:26
It's possible to follow the line of thought which informed the ABI parameters. Probably not the best metrics,
  • all c files in the specint2000
  • just produced the objects, not linked with library
Jan Hubicka still works on the compiler.

Does a better performing general ABI exist?

_________________
¯\(°_o)/¯ unlicense.org
Post 03 Jul 2022, 04:26
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.