flat assembler
Message board for the users of flat assembler.

Index > Main > Aligned vs unaligned

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
Andy



Joined: 17 Oct 2011
Posts: 35
Andy
I wrote this basic code to apply a xor for each byte from a buffer. esp+4 is a pointer to the buffer and esp+8 is the length of the buffer.
Code:
mov esi, [esp + 4]
mov ecx, [esp + 8]
next:
lodsb
xor al, 127
mov [esi-1], al
dec ecx
jnz next
ret 8    


It works pretty good but since I am interested to learn I thought it would be a good idea to write this using AVX or AVX2 to process more than one byte each time. So I read a little bit about AVX instructions but I am really confused about what means aligned and unaligned for AVX or in general terms.

For example vlddqu instruction loads 256-bits of integer data from unaligned memory into destination register and vmovdqa instruction loads 256-bits of integer data from memory into destination register and must be aligned on a 32-byte boundary. Can someone please clear things up a little bit and explain what it means?
Post 04 May 2022, 19:54
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18624
Location: In your JS exploiting you and your system
revolution
Aligned means the memory address is a multiple of the data size.
Code:
org 0x5678
dd 0x1234 ; aligned. ($ mod 4) == 0

org 0x5679
dd 0x1234 ; unaligned. ($ mod 4) == 1    
For 256-bit alignment the memory address must have the 5 lowest bits as zero.
Code:
; ... some code/data here of unknown length
align 32 ; align to 256-bits
; ... aligned data goes here    
Post 04 May 2022, 21:58
View user's profile Send private message Visit poster's website Reply with quote
Andy



Joined: 17 Oct 2011
Posts: 35
Andy
Is there are advantage/disadvantage using one or other? I assume there must be something otherwise it would be easier to use simply unaligned memory address.
Post 04 May 2022, 23:17
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18624
Location: In your JS exploiting you and your system
revolution
The hardware can sometimes make aligned access faster, often using a single transfer.

If the data are unaligned then the memory transactions might need to be split into multiple transfers.
Post 04 May 2022, 23:24
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 555
Location: Russia
macomics
Given that the original loop works with bytes, there is no alignment problem. A byte is a minimally addressable sequence of bits.

It is worth thinking about alignment when you have decided to reduce the number of iterations of the cycle by processing several elements in one pass. For example:
Code:
    mov edi, [esp + 4]
    mov ecx, [esp + 8]
    test edi, 1
    jz sk
    xor byte [edi], 127 ; unaligned head byte
    scas byte [edi] ; inc or dec depending EFL.DF
    dec ecx
    jz ex
sk:
    mov al, cl
    shr ecx, 1
    jz en
lp:
    xor word [edi], 127 or ( 127 shl 8 ) ; aligned words
    scas word [edi]
    loop lp
en:
    and al, 1
    jz ex
    xor byte [edi], 127 ; tail byte
ex:
    retn    
The length of the cycle has been reduced by 2 times (the number of commands has decreased). But you can do even more if you use other registers. However, even in this example, alignment becomes important.
During 16-bit programs, this ensured that you would not access data crossing the segment boundary. There is no such problem in 32-bit and 64-bit programs using the float or long memory model.
But now alignment is required so that the processor works optimally with its cache memory, which greatly speeds up data access (read above from revolution). However, when working with bytes, alignment does not give anything.
You can also get to the border of memory pages. This is not the same as with segments, but the next page may be missing, which will cause a memory access error (for example 0C0000005h).
All the situations described above with data longer than one byte cannot occur if the data address is aligned to their length. e.g. word by 2, dword by 4, fword/pword by 8, qword by 8 etc
Post 05 May 2022, 00:26
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18624
Location: In your JS exploiting you and your system
revolution
In the case of AVX and the unaligned/aligned instructions:

For aligned data some CPU implementations show no difference in the access timing. So on those CPUs it makes no difference which instruction you choose. However some other CPUs may impose a small cycle penalty or two, which may or may not make a difference depending upon your application.

For unaligned data, the unaligned instruction is guaranteed to work, even if it requires multiple accesses, but for the aligned instruction it will fail.

In general the unaligned instructions are the easiest to use and give little to no penalty.
Post 05 May 2022, 02:07
View user's profile Send private message Visit poster's website Reply with quote
Andy



Joined: 17 Oct 2011
Posts: 35
Andy
Many thanks guys, now things are more clear.
Post 05 May 2022, 18:52
View user's profile Send private message Reply with quote
Andy



Joined: 17 Oct 2011
Posts: 35
Andy
One more question. I tried to use some AVX/2 instructions to rewrite the code above but I get this strange 0xC000001D STATUS_ILLEGAL_INSTRUCTION EXCEPTION_ILLEGAL_INSTRUCTION error. As far as I understand is an AVX2 instruction and my CPU supports AVX2, so what I miss here? Does this error mean I cannot use this instruction?

Quote:
VEX.256.66.0F38.W0 78 /r VPBROADCASTB ymm1, xmm2/m8
A
V/V
AVX2
Broadcast a byte integer in the source operand to thirty-two locations in ymm1.


Code:
mov esi, [esp + 4]
mov ecx, [esp + 8]
mov al, 127
vpbroadcastb ymm2, al
;next:
;cmp ecx, 256
;jl exit
;vpxor ymm1, ymm2, yword[esi]
;vmovdqu yword[esi], ymm1
;sub ecx, 256
;add esi, 256
;jmp next
;exit:
ret 8    


Description:
Filesize: 83.81 KB
Viewed: 979 Time(s)

cpu.jpg


Post 05 May 2022, 19:58
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 555
Location: Russia
macomics
Andy wrote:
VEX.256.66.0F38.W0 78 /r VPBROADCASTB ymm1, xmm2/m8
memory address (byte [source_op]), not generic register (al)
Intel SDM 2c wrote:
VEX256-encoded VPBROADCASTB/W/D/Q: The source operand is 8-bit, 16-bit, 32-bit, 64-bit memory location or
the low 8-bit, 16-bit 32-bit, 64-bit data in an XMM register. The destination operand is a YMM register.

Code:
push 127
vpbroadcastb ymm2, byte [esp]    
Post 05 May 2022, 20:33
View user's profile Send private message Reply with quote
Andy



Joined: 17 Oct 2011
Posts: 35
Andy
Thank you, it work. It's so frustrating being a beginner and missing obvious details.

PS: as I said, little details; in the code above I substract from ecx 256 and add to esi 256 when it should be 32 Bytes Rolling Eyes Crying or Very sad
Post 05 May 2022, 21:00
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 555
Location: Russia
macomics
Andy wrote:
in the code above I substract from ecx 256 and add to esi 256 when it should be 32 Bytes
no. it's 256 bytes. 256 bits / 8 bits-per-byte = 32 bytes. The address should be aligned to 32.
Code:
    test esi, 31
    jz ok
align_loop:
    xor byte [esi], 127
    lods byte [esi]
    test esi, 31
    loopnz align_loop
ok:
    mov dl, cl
    shr ecx, 5
    jz en
    vpbroadcastb ymm2, byte [align_loop + 2] ; dont ask, lazy
lp: ; esi aligned on 256-bits or 32-bytes
    ...
    loop lp
en:
    and dl, 31
    jz ex
    mov cl, dl ; ecx = 0 after loop e.g ecx = dl
tail_lp:
    xor byte [esi], 127
    lods byte [esi]
    loop tail_lp
ex:
    


Last edited by macomics on 05 May 2022, 22:13; edited 5 times in total
Post 05 May 2022, 21:51
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 739
tthsqe
Illegal instruction exception? Why does fasm allow you to assemble an illegal instruction?
Post 05 May 2022, 22:08
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 739
tthsqe
Also, for what it is worth, I saw a 15% penalty in my application for unaligned 32 byte reads/writes (from/to actually unaligned adresses that are only multiples of Cool on *older* avx2 processors. The penalty on new ones was around 5%, so just barely noticable.
Post 05 May 2022, 22:15
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 555
Location: Russia
macomics
tthsqe wrote:
Illegal instruction exception? Why does fasm allow you to assemble an illegal instruction?
Who wrote AVX.INC?
Post 05 May 2022, 22:16
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 739
tthsqe
Not I. Also Cool = 8 )
Post 05 May 2022, 22:18
View user's profile Send private message Reply with quote
Andy



Joined: 17 Oct 2011
Posts: 35
Andy
macomics wrote:
Andy wrote:
in the code above I substract from ecx 256 and add to esi 256 when it should be 32 Bytes
no. it's 256 bytes. 256 bits / 8 bits-per-byte = 32 bytes. The address should be aligned to 32.
Code:
    test esi, 31
    jz ok
align_loop:
    xor byte [esi], 127
    lods byte [esi]
    test esi, 31
    loopnz align_loop
ok:
    mov dl, cl
    shr ecx, 5
    jz en
    vpbroadcastb ymm2, byte [align_loop + 2] ; dont ask, lazy
lp: ; esi aligned on 256-bits or 32-bytes
    ...
    loop lp
en:
    and dl, 31
    jz ex
    mov cl, dl ; ecx = 0 after loop e.g ecx = dl
tail_lp:
    xor byte [esi], 127
    lods byte [esi]
    loop tail_lp
ex:
    


Probably not the best way but this is how I did it:
Code:
mov esi, [esp + 4]
mov ecx, [esp + 8]
push 127
vpbroadcastb ymm2, byte [esp]
pop ebx
next:
cmp ecx, 32
jl single
vpxor ymm1, ymm2, yword[esi]
vmovdqu yword[esi], ymm1
sub ecx, 32
add esi, 32
jmp next
single:
cmp ecx, 0
je exit
lodsb
xor al, bl
mov [esi-1], al
dec ecx
jnz single
exit:
ret 8    


Can you please explain the purpose of these sections?
Code:
ok:
    mov dl, cl
    shr ecx, 5
    jz en
    ...
en:
    and dl, 31
    jz ex
    mov cl, dl ; ecx = 0 after loop e.g ecx = dl
    
Post 06 May 2022, 00:13
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8000
Location: Kraków, Poland
Tomasz Grysztar
tthsqe wrote:
Illegal instruction exception? Why does fasm allow you to assemble an illegal instruction?
It's an AVX-512 instruction.

You can peel the layers with help of fasmg, where you can include instruction sets selectively:
Code:
include 'cpu/p6.inc'
include 'cpu/ext/avx2.inc'

vpbroadcastb ymm2, xmm0
vpbroadcastb ymm2, al    
Code:
flat assembler  version g.jmhx
test.asm [5]:
        vpbroadcastb ymm2, al
macro vpbroadcastb? [9]
Custom error: invalid combination of operands.    
But:
Code:
include 'cpu/p6.inc'
include 'cpu/ext/avx512.inc'

vpbroadcastb ymm2, xmm0
vpbroadcastb ymm2, al    
generates:
Code:
C4 E2 7D 78 D0           vpbroadcastb ymm2, xmm0
62 F2 7D 28 7A D0        vpbroadcastb ymm2, al    
and it's the same what you get with fasm 1.
Post 06 May 2022, 09:13
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 739
tthsqe
avx 512 is almost dead, and it is a shame to accidentally use one of these instructions.
fasmg looks nice by the way, especially if calm is working the way it seems to. Will have to try soon.
Post 06 May 2022, 09:52
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 555
Location: Russia
macomics
Andy wrote:
Can you please explain the purpose of these sections?
Code:
ok:
; S - Size bits
; s - Size bits, lower byte
; r - Remainder bits, lower byte
;ecx = SSSSSSSSSSSSSSSSSSSSSSSSsssrrrrr
    mov dl, cl; save lower 8 bits, remainder
;dl = sssrrrrr
    shr ecx, 5; ecx = ecx / 32
;ecx/32 = 00000SSSSSSSSSSSSSSSSSSSSSSSSsss
    jz en; if (ecx / 32 = 0) skip fast loop
lp:
    ...
    add esi, 32 ; you can't do that with a pointer
    loop lp ; ecx = ecx - 1 # after division, you can subtract 1.
; ecx = 0 after loop
en:
;dl = sssrrrrr
    and dl, 00011111b; 31; remainder, lower 5 bits
;dl = 000rrrrr
    jz ex; no tail
    mov cl, dl ; ecx = 0 after loop e.g ecx = dl
;ecx = 000000000000000000000000000rrrrr # tail loop counter    
Post 06 May 2022, 09:56
View user's profile Send private message Reply with quote
Andy



Joined: 17 Oct 2011
Posts: 35
Andy
macomics wrote:
Code:
    add esi, 32 ; you can't do that with a pointer
    


Why? Isn't this the way other instructions work?

For example lodsb
Quote:
Loads a byte, word, or doubleword from the source operand into the AL, AX, or EAX register, respectively. After the byte, word, or doubleword is transferred from the memory location into the AL, AX, or EAX register, the (E)SI register is incremented or decremented automatically according to the setting of the DF flag in the EFLAGS register.


And would be better like saving the offset in a different register and use it something like that?

So instead of
Code:
vpxor ymm1, ymm2, yword[esi]
vmovdqu yword[esi], ymm1
sub ecx, 32
add esi, 32    


to be something like
Code:
vpxor ymm1, ymm2, yword[esi + edx]
vmovdqu yword[esi + edx], ymm1
sub ecx, 32
add edx, 32    
Post 06 May 2022, 12:05
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.