flat assembler
Message board for the users of flat assembler.

Index > Main > (stupid ??) alignment question ??

Author
Thread Post new topic Reply to topic
fabbel



Joined: 30 Oct 2012
Posts: 117
fabbel 16 Jan 2026, 13:11
Hi
can anyone pls shed light on this (potentially stupid ??) alignment question for which I can't seem to find any sastisying answer myself :
considering mem. layout of data, am kinda assuming "natural alignement" of (simple) data type is always power of 2 - typ. :
* db (byte)
=> byte size = 1 (8 bits) => "natural align." =1

* dw (word)
=> byte size = 2 (16 bits) => "natural align." =2

* dq (qword)
=> byte size = 8 (64 bits) => "natural align." =8
...

but what about those then ??
* dp (pword)
=> byte size = 6 (48 bits)
=> "natural align." = ?? 8 (nearest above ) ? or 2 (greatest divisor, 6 = 3x2) ? .... or sthg else ?

* dt (tbyte / tword)
=> byte size = 10 (80 bits)
=> "natural align." = ?? 16 (nearest above) ? 2 (greatest divisor, 10 = 5x2) ? .... or sthg else ?

... is my assumption simply wrong? am I missing sthg ?
... appropriate layout may depend on platform / access logic / scheme I suppose, but is there any generally accepted convention - typ. if building some structure (or mem array) including such field(s) ?
Post 16 Jan 2026, 13:11
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20841
Location: In your JS exploiting you and your system
revolution 16 Jan 2026, 13:37
Fword/pword/tword have no "natural" alignment.

I see tword is often aligned to 16, like a dqword.

x86 allows all alignments (and mis-alignments) anyway, so they can be placed anywhere. Some CPUs can utilise the cache line to swizzle the bytes into place with no penalty. So on those CPUs 3 twords could be placed into each 32-byte cache line.

Try them in various configurations and see how much difference can be noticed.
Post 16 Jan 2026, 13:37
View user's profile Send private message Visit poster's website Reply with quote
Jessé



Joined: 03 May 2025
Posts: 75
Location: Brazil
Jessé 16 Jan 2026, 17:44
This is actually a very good question. Alignment is related to performance over data access, and, depending on implementation, can even raise exception #AC (FLAGS.AC=1 and cr0.AM=1).
I found something on Intel's Intel 64 and IA32 manual as attached.


Description:
Filesize: 53.17 KB
Viewed: 484 Time(s)

Captura de imagem_20260116_143622.png



_________________
jesse6
Post 16 Jan 2026, 17:44
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20841
Location: In your JS exploiting you and your system
revolution 16 Jan 2026, 18:27
Which OSes set the AC flag? I know that Windows and Linux don't. What does the old x86 MacOS do?

But there is a situation where alignment is mandatory, even without the AC flag set. FXSAVE:
The destination operand contains the first byte of the memory image, and it must be aligned on a 16-byte boundary.
And the same for the associated FXRSTOR.
Post 16 Jan 2026, 18:27
View user's profile Send private message Visit poster's website Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1762
Location: Toronto, Canada
AsmGuru62 17 Jan 2026, 00:19
It will be curious to write some code where TBYTE would be aligned and mis-aligned.
Then measure the time of millions of loads and stores into these fields.
I wonder what will be faster, or maybe, inconclusive?
Post 17 Jan 2026, 00:19
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20841
Location: In your JS exploiting you and your system
revolution 17 Jan 2026, 04:32
When a storage element crosses a cache line that can sometimes show a noticeable drop in performance. Mis-alignment within a cache line might not show any difference.

Try it.
Post 17 Jan 2026, 04:32
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20841
Location: In your JS exploiting you and your system
revolution 17 Jan 2026, 17:13
revolution wrote:
Try it.
Okay. Razz
Code:
format elf64 executable at 1 shl 16

CACHE_LINE_SIZE         = 64
TWORD_LOOPS             = 1 shl 28

SYS64_write             = 1
SYS64_clock_gettime     = 228
SYS64_exit_group        = 231
CLOCK_PROCESS_CPUTIME_ID= 2

segment executable

entry   $
        ; get the OS to switch the CPU out of low power mode
        mov     eax,CACHE_LINE_SIZE
        mov     ecx,TWORD_LOOPS
        mov     esi,0
        call    read_twords
        ; the main loop
        mov     ebp,CACHE_LINE_SIZE
    .loop:
        mov     eax,SYS64_clock_gettime
        mov     edi,CLOCK_PROCESS_CPUTIME_ID
        lea     rsi,[timespec1]
        syscall
        mov     eax,CACHE_LINE_SIZE
        mov     ecx,TWORD_LOOPS
        lea     esi,[ebp - 1]
        call    read_twords
        mov     eax,SYS64_clock_gettime
        mov     edi,CLOCK_PROCESS_CPUTIME_ID
        lea     rsi,[timespec2]
        syscall
        lea     eax,[ebp - 1]
        call    print_decimal
        mov     al,' '
        call    print_char
        mov     rcx,[timespec1]
        imul    rcx,1000000000
        add     rcx,[timespec1+8]
        mov     rax,[timespec2]
        imul    rax,1000000000
        add     rax,[timespec2+8]
        sub     rax,rcx
        call    print_decimal
        mov     al,0xa
        call    print_char
        sub     ebp,1
        jnz     .loop
        mov     eax,SYS64_exit_group
        syscall

read_twords:
        ; rax = stride
        ; rcx = count
        ; rsi = offset
        lea     rsi,[test_data+rsi]
        fninit
    .loop:
        sub     rcx,1
        fld     tword[rsi]
        fld     tword[rsi+rax]
        fucompp
        jnz     .loop
        ret

print_decimal:
        ; rax = value
        sub     rsp,32
        mov     ecx,10
        mov     rsi,rsp
    .loop:
        xor     edx,edx
        div     rcx
        sub     rsi,1
        add     dl,'0'
        mov     [rsi],dl
        test    rax,rax
        jnz     .loop
        mov     edi,1
        mov     rdx,rsp
        sub     rdx,rsi
        mov     eax,SYS64_write
        syscall
        add     rsp,32
        ret

print_char:
        ; al = char
        push    rax
        mov     edi,1
        mov     rsi,rsp
        mov     rdx,1
        mov     eax,SYS64_write
        syscall
        pop     rax
        ret

segment readable writeable

align 64

test_data       db      64 * 3 dup (0xc0)       ; make some valid tword floats

align 8

timespec1       rq 2
timespec2       rq 2    
Code:
~ fasm tword-alignment.asm && ./tword-alignment 
flat assembler  version 1.73.31  (16384 kilobytes memory)
3 passes, 704 bytes.
63 274265389
62 274554841
61 275768167
60 276006914
59 270888481
58 272654384
57 272630350
56 207724707
55 279166600
54 208209127
53 205075196
52 206313466
<...>
0 207314691    
Two notable slowdowns there. The first at 55 bytes, and the second at 57-63 bytes. Clearly the CPU is reading the tword in two parts: An 8-byte read, followed by a 2-byte read.

Random alignments within a cache line never caused a timing difference. Only when crossing a cache line there is ~35% penalty. And it only slowed when the cache line was crossed by the one of the two separate parts of the tword.

Anyhow, that is on this system, with this specific code, on this specific OS, as reported by the Linux SYS64_clock_gettime syscall. Other systems, other codes, other OSes, other syscalls, will probably give entirely different results.
Post 17 Jan 2026, 17:13
View user's profile Send private message Visit poster's website Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1762
Location: Toronto, Canada
AsmGuru62 17 Jan 2026, 17:57
Nice research!
Post 17 Jan 2026, 17:57
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20841
Location: In your JS exploiting you and your system
revolution 18 Jan 2026, 08:15
A bit of visualisation never hurts.


Description:
Filesize: 37.89 KB
Viewed: 385 Time(s)

tword-alignment.png


Post 18 Jan 2026, 08:15
View user's profile Send private message Visit poster's website Reply with quote
Jessé



Joined: 03 May 2025
Posts: 75
Location: Brazil
Jessé 18 Jan 2026, 15:20
I did a test under a benchmark idea that I've posted here in another thread, and results are quite negligible. The average is the same, although TSC run length (as I called it there) is always longer on unaligned access.

The test benchmark code replaced in the 'test area':

Code:
                                   ; **********************************
                                   ; ***    Start Benchmark code    ***

                                   finit
                                   fldz
                                   fldz
                                   fstp      [align_tw]
                                   fstp      [unalign_tw]

                                   lfence
                                   rdtsc
                                   push      rdx
                                   push      rax

                                   ; ### Tested code goes here ###
                              @@@  fld       [unalign_tw]
                                   fld1
                                   faddp
                                   fstp      [unalign_tw]
                                   ; ### End tested code       ###

                                   dec       r12
                                   jnz       @@b

                              @@   mfence
                                   rdtsc
                                   pop       rbp
                                   pop       rcx

                                   ; ***     End Benchmark code     ***
                                   ; **********************************

    
Post 18 Jan 2026, 15:20
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20841
Location: In your JS exploiting you and your system
revolution 18 Jan 2026, 16:14
What is the alignment of unalign_tw? It (55 or 57-63) mod 64?
Post 18 Jan 2026, 16:14
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2026, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.