flat assembler
Message board for the users of flat assembler.
![]() Goto page 1, 2 Next |
Author |
|
revolution 04 May 2022, 21:58
Aligned means the memory address is a multiple of the data size.
Code: org 0x5678 dd 0x1234 ; aligned. ($ mod 4) == 0 org 0x5679 dd 0x1234 ; unaligned. ($ mod 4) == 1 Code: ; ... some code/data here of unknown length align 32 ; align to 256-bits ; ... aligned data goes here |
|||
![]() |
|
Andy 04 May 2022, 23:17
Is there are advantage/disadvantage using one or other? I assume there must be something otherwise it would be easier to use simply unaligned memory address.
|
|||
![]() |
|
revolution 04 May 2022, 23:24
The hardware can sometimes make aligned access faster, often using a single transfer.
If the data are unaligned then the memory transactions might need to be split into multiple transfers. |
|||
![]() |
|
macomics 05 May 2022, 00:26
Given that the original loop works with bytes, there is no alignment problem. A byte is a minimally addressable sequence of bits.
It is worth thinking about alignment when you have decided to reduce the number of iterations of the cycle by processing several elements in one pass. For example: Code: mov edi, [esp + 4] mov ecx, [esp + 8] test edi, 1 jz sk xor byte [edi], 127 ; unaligned head byte scas byte [edi] ; inc or dec depending EFL.DF dec ecx jz ex sk: mov al, cl shr ecx, 1 jz en lp: xor word [edi], 127 or ( 127 shl 8 ) ; aligned words scas word [edi] loop lp en: and al, 1 jz ex xor byte [edi], 127 ; tail byte ex: retn During 16-bit programs, this ensured that you would not access data crossing the segment boundary. There is no such problem in 32-bit and 64-bit programs using the float or long memory model. But now alignment is required so that the processor works optimally with its cache memory, which greatly speeds up data access (read above from revolution). However, when working with bytes, alignment does not give anything. You can also get to the border of memory pages. This is not the same as with segments, but the next page may be missing, which will cause a memory access error (for example 0C0000005h). All the situations described above with data longer than one byte cannot occur if the data address is aligned to their length. e.g. word by 2, dword by 4, fword/pword by 8, qword by 8 etc |
|||
![]() |
|
revolution 05 May 2022, 02:07
In the case of AVX and the unaligned/aligned instructions:
For aligned data some CPU implementations show no difference in the access timing. So on those CPUs it makes no difference which instruction you choose. However some other CPUs may impose a small cycle penalty or two, which may or may not make a difference depending upon your application. For unaligned data, the unaligned instruction is guaranteed to work, even if it requires multiple accesses, but for the aligned instruction it will fail. In general the unaligned instructions are the easiest to use and give little to no penalty. |
|||
![]() |
|
Andy 05 May 2022, 18:52
Many thanks guys, now things are more clear.
|
|||
![]() |
|
Andy 05 May 2022, 19:58
One more question. I tried to use some AVX/2 instructions to rewrite the code above but I get this strange 0xC000001D STATUS_ILLEGAL_INSTRUCTION EXCEPTION_ILLEGAL_INSTRUCTION error. As far as I understand is an AVX2 instruction and my CPU supports AVX2, so what I miss here? Does this error mean I cannot use this instruction?
Quote: VEX.256.66.0F38.W0 78 /r VPBROADCASTB ymm1, xmm2/m8 Code: mov esi, [esp + 4] mov ecx, [esp + 8] mov al, 127 vpbroadcastb ymm2, al ;next: ;cmp ecx, 256 ;jl exit ;vpxor ymm1, ymm2, yword[esi] ;vmovdqu yword[esi], ymm1 ;sub ecx, 256 ;add esi, 256 ;jmp next ;exit: ret 8
|
||||||||||
![]() |
|
macomics 05 May 2022, 20:33
Andy wrote: VEX.256.66.0F38.W0 78 /r VPBROADCASTB ymm1, xmm2/m8 Intel SDM 2c wrote: VEX256-encoded VPBROADCASTB/W/D/Q: The source operand is 8-bit, 16-bit, 32-bit, 64-bit memory location or Code: push 127 vpbroadcastb ymm2, byte [esp] |
|||
![]() |
|
Andy 05 May 2022, 21:00
Thank you, it work. It's so frustrating being a beginner and missing obvious details.
PS: as I said, little details; in the code above I substract from ecx 256 and add to esi 256 when it should be 32 Bytes ![]() ![]() |
|||
![]() |
|
macomics 05 May 2022, 21:51
Andy wrote: in the code above I substract from ecx 256 and add to esi 256 when it should be 32 Bytes Code: test esi, 31 jz ok align_loop: xor byte [esi], 127 lods byte [esi] test esi, 31 loopnz align_loop ok: mov dl, cl shr ecx, 5 jz en vpbroadcastb ymm2, byte [align_loop + 2] ; dont ask, lazy lp: ; esi aligned on 256-bits or 32-bytes ... loop lp en: and dl, 31 jz ex mov cl, dl ; ecx = 0 after loop e.g ecx = dl tail_lp: xor byte [esi], 127 lods byte [esi] loop tail_lp ex: Last edited by macomics on 05 May 2022, 22:13; edited 5 times in total |
|||
![]() |
|
tthsqe 05 May 2022, 22:08
Illegal instruction exception? Why does fasm allow you to assemble an illegal instruction?
|
|||
![]() |
|
tthsqe 05 May 2022, 22:15
Also, for what it is worth, I saw a 15% penalty in my application for unaligned 32 byte reads/writes (from/to actually unaligned adresses that are only multiples of
![]() |
|||
![]() |
|
macomics 05 May 2022, 22:16
tthsqe wrote: Illegal instruction exception? Why does fasm allow you to assemble an illegal instruction? |
|||
![]() |
|
tthsqe 05 May 2022, 22:18
Not I. Also
![]() |
|||
![]() |
|
Andy 06 May 2022, 00:13
macomics wrote:
Probably not the best way but this is how I did it: Code: mov esi, [esp + 4] mov ecx, [esp + 8] push 127 vpbroadcastb ymm2, byte [esp] pop ebx next: cmp ecx, 32 jl single vpxor ymm1, ymm2, yword[esi] vmovdqu yword[esi], ymm1 sub ecx, 32 add esi, 32 jmp next single: cmp ecx, 0 je exit lodsb xor al, bl mov [esi-1], al dec ecx jnz single exit: ret 8 Can you please explain the purpose of these sections? Code: ok: mov dl, cl shr ecx, 5 jz en ... en: and dl, 31 jz ex mov cl, dl ; ecx = 0 after loop e.g ecx = dl |
|||
![]() |
|
Tomasz Grysztar 06 May 2022, 09:13
tthsqe wrote: Illegal instruction exception? Why does fasm allow you to assemble an illegal instruction? You can peel the layers with help of fasmg, where you can include instruction sets selectively: Code: include 'cpu/p6.inc' include 'cpu/ext/avx2.inc' vpbroadcastb ymm2, xmm0 vpbroadcastb ymm2, al Code: flat assembler version g.jmhx test.asm [5]: vpbroadcastb ymm2, al macro vpbroadcastb? [9] Custom error: invalid combination of operands. Code: include 'cpu/p6.inc' include 'cpu/ext/avx512.inc' vpbroadcastb ymm2, xmm0 vpbroadcastb ymm2, al Code: C4 E2 7D 78 D0 vpbroadcastb ymm2, xmm0 62 F2 7D 28 7A D0 vpbroadcastb ymm2, al |
|||
![]() |
|
tthsqe 06 May 2022, 09:52
avx 512 is almost dead, and it is a shame to accidentally use one of these instructions.
fasmg looks nice by the way, especially if calm is working the way it seems to. Will have to try soon. |
|||
![]() |
|
macomics 06 May 2022, 09:56
Andy wrote: Can you please explain the purpose of these sections? Code: ok: ; S - Size bits ; s - Size bits, lower byte ; r - Remainder bits, lower byte ;ecx = SSSSSSSSSSSSSSSSSSSSSSSSsssrrrrr mov dl, cl; save lower 8 bits, remainder ;dl = sssrrrrr shr ecx, 5; ecx = ecx / 32 ;ecx/32 = 00000SSSSSSSSSSSSSSSSSSSSSSSSsss jz en; if (ecx / 32 = 0) skip fast loop lp: ... add esi, 32 ; you can't do that with a pointer loop lp ; ecx = ecx - 1 # after division, you can subtract 1. ; ecx = 0 after loop en: ;dl = sssrrrrr and dl, 00011111b; 31; remainder, lower 5 bits ;dl = 000rrrrr jz ex; no tail mov cl, dl ; ecx = 0 after loop e.g ecx = dl ;ecx = 000000000000000000000000000rrrrr # tail loop counter |
|||
![]() |
|
Andy 06 May 2022, 12:05
macomics wrote:
Why? Isn't this the way other instructions work? For example lodsb Quote: Loads a byte, word, or doubleword from the source operand into the AL, AX, or EAX register, respectively. After the byte, word, or doubleword is transferred from the memory location into the AL, AX, or EAX register, the (E)SI register is incremented or decremented automatically according to the setting of the DF flag in the EFLAGS register. And would be better like saving the offset in a different register and use it something like that? So instead of Code: vpxor ymm1, ymm2, yword[esi] vmovdqu yword[esi], ymm1 sub ecx, 32 add esi, 32 to be something like Code: vpxor ymm1, ymm2, yword[esi + edx] vmovdqu yword[esi + edx], ymm1 sub ecx, 32 add edx, 32 |
|||
![]() |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.