flat assembler
Message board for the users of flat assembler.

Index > Main > How can x87 use cause massive avx slowdown [solved - my bad]

Author
Thread Post new topic Reply to topic
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 16 Sep 2023, 16:43
Here is a speed issue revolution can get behind. I could post more details about the problem, but, absent those details, I would like to hear any thoughts/conjectures on the general problem.

The code is being generated by gcc. If I run some heavy x87 long double calculations first and then some avx calculations, the latter run 10x slower than they run on their own. Please do not ask where my motherboard was manufactured or what my ram speed is. I am looking for explanations of a different character / possible source of compiler bugs. I tried putting an fldpi (and no pop anywhere) before calling the avx, and this on its own runs at the normal speed (even though there is a stack overflow after the eight invocation).

What could a compiler be doing with x87 math that poisons later avx calculations?


Last edited by tthsqe on 17 Sep 2023, 22:42; edited 1 time in total
Post 16 Sep 2023, 16:43
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20306
Location: In your JS exploiting you and your system
revolution 16 Sep 2023, 17:23
I wouldn't be so quick to blame the compiler. It might be the CPU.

CPUs can/might (do?) share the execution units to do both x87 and AVX computations.. Regularly switching between the two could cause some kind of flush/reload thing to give a noticeable delay is done rapidly enough.

Also note that the long double is "special" because it needs more resources than the "normal" doubles that AVX supports. Perhaps the float circuitry needs to reconfigure itself to switch between lengths?

Do you remember that MMX has instructions to signal the "end" of the MMX computations. The CPU will switch automatically if you forgot to use it, but it would be "slower" on the older CPUs. So it wouldn't surprise me if x87/AVX is getting a similar treatment today.
Post 16 Sep 2023, 17:23
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 16 Sep 2023, 17:48
What's your CPU? Micro-architecture name should be enough, not specific model, so I can look it up.
Post 16 Sep 2023, 17:48
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 16 Sep 2023, 18:08
rocketlake. The switch from x87 to avx only happens once.
Post 16 Sep 2023, 18:08
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20306
Location: In your JS exploiting you and your system
revolution 16 Sep 2023, 19:28
I would think it more likely it is a cache and/or alignment problem, than anything else. I base this mainly on the experience I have with performance problems, where it is nearly always cache or alignment. If you don't have those buttoned down solid then doing other analysis before that is kind of pointless IMO.

Have you tried using non-temporal stores for the x87 data the code produces? I have found those to be awesome to help relieve the workload of the cache. Depends upon what you are doing of course.

Without seeing your code or knowing the full system arch, it is really hard to have to guess.
Post 16 Sep 2023, 19:28
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 17 Sep 2023, 22:55
oops, turned out to be my fault - when the long double calculations ran, they wrote long doubles into temp memory that the avx used as well. Some of the avx lanes were unused, and I was just copying the junk memory right into the avx regs. I guess gcc uses 16 bytes for the long double, and so the upper 16 bits makes a big fat denormal double, which apparently slows these things to a crawl.

I have noticed that infinities and nans are lightning fast at least in scalar double calculations, but these denormals are new news to me.
Make sure your 'unused' avx lanes aren't doing something stupid.

Interestingly enough, you can absolutely trash the x87 state and your avx code will be fine (this doesn't say anything about transitions though).
Post 17 Sep 2023, 22:55
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20306
Location: In your JS exploiting you and your system
revolution 17 Sep 2023, 23:20
I've never had a use case for denormals.

There exists a Flush-To-Zero flag you can set. It uses zero in-place of a denormal. Perhaps it was included because of this very problem?
Post 17 Sep 2023, 23:20
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 17 Sep 2023, 23:35
Ha, it wasn't a use case on my part. It was a case of accidentally putting junk into avx lanes I wasn't considering as part of the calculation.
Code:
|-- double --|-- double --|-- double --|-- double --|-- double --|-- double --| ...
|-- long double --|   0   |-- long double --|   0   |-- long double --|
    


I think flush to zero is the default, but that doesn't mean that the floating point units enjoy being fed denormal inputs.
Post 17 Sep 2023, 23:35
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20306
Location: In your JS exploiting you and your system
revolution 17 Sep 2023, 23:49
That alignment is the recommended method for long doubles. So the compiler was "doing-it-right"™

And it shows to importance of actually testing and not assuming.
Post 17 Sep 2023, 23:49
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 18 Sep 2023, 16:55
tthsqe wrote:
Ha, it wasn't a use case on my part. It was a case of accidentally putting junk into avx lanes I wasn't considering as part of the calculation.
Code:
|-- double --|-- double --|-- double --|-- double --|-- double --|-- double --| ...
|-- long double --|   0   |-- long double --|   0   |-- long double --|
    


I think flush to zero is the default, but that doesn't mean that the floating point units enjoy being fed denormal inputs.
Your test was flawed to begin with, even without alignment. Why would you read long double data into AVX in the first place? It doesn't match double's format, so it's senseless. It's not a realistic test at all.
Post 18 Sep 2023, 16:55
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 18 Sep 2023, 23:58
Yes, this was a definite oops on my part. But, I did learn that, at least for rocketlake:
1. avx speed seems to be independent of whatever (possibly trashed) x87 state you are in.
2. putting junk into ignored avx lanes can cause a big slowdown when double precision operations hit denormal input.

I am aware that there are write masks now so that I can truly ignore certain lanes, but I wasn't doing that.
Post 18 Sep 2023, 23:58
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20306
Location: In your JS exploiting you and your system
revolution 19 Sep 2023, 04:20
I wrote a useless fake timing test to see what effect FTZ and DAZ flags might have. It appears that the DAZ flag is the better choice (at least on my system and in this particular useless "application").
Code:
format elf64 executable

MEM_SIZE        = 1 shl 22
STEPS           = 1 shl 9

MXCSR_FZ        = 0x8000        ; Flush to zero
MXCSR_RN        = 0x0000        ; To nearest rounding mode
MXCSR_RM        = 0x2000        ; Toward negative infinity rounding mode
MXCSR_RP        = 0x4000        ; Toward positive infinity rounding mode
MXCSR_RZ        = 0x6000        ; Toward zero rounding mode
MXCSR_PM        = 0x1000        ; Precision mask
MXCSR_UM        = 0x0800        ; Underflow mask
MXCSR_OM        = 0x0400        ; Overflow mask
MXCSR_ZM        = 0x0200        ; Divide-by-zero mask
MXCSR_DM        = 0x0100        ; Denormal mask
MXCSR_IM        = 0x0080        ; Invalid operation mask
MXCSR_DZ        = 0x0040        ; Denormals are zeros
MXCSR_PE        = 0x0020        ; Precision flag
MXCSR_UE        = 0x0010        ; Underflow flag
MXCSR_OE        = 0x0008        ; Overflow flag
MXCSR_ZE        = 0x0004        ; Divide-by-zero flag
MXCSR_DE        = 0x0002        ; Denormal flag
MXCSR_IE        = 0x0001        ; Invalid operation flag

STDOUT_FILENO   = 1
SYS64_write     = 1
SYS64_brk       = 12
SYS64_exit_group= 231

segment writeable readable

mem_base        rq 1
hex_table:      db '0123456789abcdef'
mxcsr           rd 1

segment executable readable

entry $
        xor     edi,edi
        mov     eax,SYS64_brk
        syscall
        mov     [mem_base],rax
        lea     rdi,[rax + MEM_SIZE]
        mov     eax,SYS64_brk
        syscall
        cmp     rax,rdi
        mov     edi,1
        jnz     .quit
        stmxcsr [mxcsr]

; none
        mov     dl,'-'
        call    print_char
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not (MXCSR_DZ or MXCSR_FZ)
        ldmxcsr [mxcsr]
        call    fake_time

; DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'-'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_FZ
        or      [mxcsr],MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ
        mov     dl,'-'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_DZ
        or      [mxcsr],MXCSR_FZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ + DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        or      [mxcsr],MXCSR_FZ or MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

        xor     edi,edi
    .quit:
        mov     eax,SYS64_exit_group
        syscall

macro time reg {
        rdtsc
        mov     e#reg,eax
        shl     rdx,32
        or      r#reg,rdx
}

fake_time:
        push    rdi rdx rcx rbx rax
        xor     rax,rax
        call    fill_mem
        time    bx
        mov     rax,STEPS
        call    fake_application
        time    cx
        sub     rcx,rbx
        mov     rax,1
        call    fill_mem
        time    bx
        mov     rax,STEPS
        call    fake_application
        time    di
        sub     rdi,rbx
        mov     rax,rcx
        call    print_hex16
        mov     dl,10
        call    print_char
        mov     rax,rdi
        call    print_hex16
        mov     dl,10
        call    print_char
        pop     rax rbx rcx rdx rdi
        ret

fill_mem:
        ;rax = value
        push    rdi rcx
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 3
        rep     stosq
        pop     rcx rdi
        ret

fake_application:
        ;rax = loops
        push    rdi rcx rax
    .outer_loop:
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 8
    .inner_loop:
        rept 16 r:0 {
                movapd  xmm#r,[rdi + 16 * r]
        }
        rept 16 r:0 {
                mulpd   xmm#r,xmm#r
        }
        add     rdi,1 shl 8
        dec     rcx
        jnz     .inner_loop
        dec     rax
        jnz     .outer_loop
        pop     rax rcx rdi
        ret

print_hex16:
        ;rax = value
        push    rbx rax rcx rdx
        mov     ecx,16
        lea     rbx,[hex_table]
    .next_nibble:
        rol     rax,4
        mov     edx,eax
        and     edx,0xf
        mov     dl,[rbx + rdx]
        call    print_char
        dec     ecx
        jnz     .next_nibble
        pop     rdx rcx rax rbx
        ret

print_char:
        ;dl = character
        push    rsi rdi rax rcx r11 rdx
        mov     eax,SYS64_write
        mov     edi,STDOUT_FILENO
        mov     rsi,rsp
        mov     edx,1
        syscall
        pop     rdx r11 rcx rax rdi rsi
        ret    
The output
Code:
--
000000001cdc3215
00000001d07753de ; <--- 16x slower!
D-
000000001ccd3fdd
000000001d7717eb
-F
000000001da1ea45
000000006b1e2ec6 ; <--- 3x slower!
DF
000000001de9e62c
000000001c89f473    
The first time value is for zero, and the second is for the smallest possible denormal (0x000...001). The two flags are D=DAZ and F=FTZ, The dash means the flag is not set.

Setting the DAZ flag appears to make denormal timing the same as a normal value of zero. Your results may vary depending upon the CPU/OS and about a million other factors. By default Linux sets the mask bits so this test doesn't trigger any exceptions, it is all handled by the CPU internally.

FTZ appears to only be done during calculations. DAZ appears to be for both loading a value and for calculations.
Post 19 Sep 2023, 04:20
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20306
Location: In your JS exploiting you and your system
revolution 24 Sep 2023, 20:53
I tested with normal, denormal, infinity, qnan and snan. For this test on my system only the denormal shows any timing variation.
Code:
format elf64 executable

NORM    = 0
DENORM  = 1
INF     = (1 shl 11 - 1) shl 52
QNAN    = INF + 1 + 1 shl 51
SNAN    = INF + 1

MEM_SIZE        = 1 shl 22
STEPS           = 1 shl 7

MXCSR_FZ        = 0x8000        ; Flush to zero
MXCSR_RN        = 0x0000        ; To nearest rounding mode
MXCSR_RM        = 0x2000        ; Toward negative infinity rounding mode
MXCSR_RP        = 0x4000        ; Toward positive infinity rounding mode
MXCSR_RZ        = 0x6000        ; Toward zero rounding mode
MXCSR_PM        = 0x1000        ; Precision mask
MXCSR_UM        = 0x0800        ; Underflow mask
MXCSR_OM        = 0x0400        ; Overflow mask
MXCSR_ZM        = 0x0200        ; Divide-by-zero mask
MXCSR_DM        = 0x0100        ; Denormal mask
MXCSR_IM        = 0x0080        ; Invalid operation mask
MXCSR_DZ        = 0x0040        ; Denormals are zeros
MXCSR_PE        = 0x0020        ; Precision flag
MXCSR_UE        = 0x0010        ; Underflow flag
MXCSR_OE        = 0x0008        ; Overflow flag
MXCSR_ZE        = 0x0004        ; Divide-by-zero flag
MXCSR_DE        = 0x0002        ; Denormal flag
MXCSR_IE        = 0x0001        ; Invalid operation flag

STDOUT_FILENO   = 1
SYS64_write     = 1
SYS64_brk       = 12
SYS64_exit_group= 231

segment writeable readable

mem_base        rq 1
hex_table:      db '0123456789abcdef'
mxcsr           rd 1

segment executable readable

entry $
        xor     edi,edi
        mov     eax,SYS64_brk
        syscall
        mov     [mem_base],rax
        lea     rdi,[rax + MEM_SIZE]
        mov     eax,SYS64_brk
        syscall
        cmp     rax,rdi
        mov     edi,1
        jnz     .quit
        stmxcsr [mxcsr]

; none
        mov     dl,'-'
        call    print_char
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not (MXCSR_DZ or MXCSR_FZ)
        ldmxcsr [mxcsr]
        call    fake_time

; DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'-'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_FZ
        or      [mxcsr],MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ
        mov     dl,'-'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_DZ
        or      [mxcsr],MXCSR_FZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ + DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        or      [mxcsr],MXCSR_FZ or MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

        xor     edi,edi
    .quit:
        mov     eax,SYS64_exit_group
        syscall

macro time reg {
        rdtsc
        mov     e#reg,eax
        shl     rdx,32
        or      r#reg,rdx
}

fake_time:
        push    rdx rbx rax
        irp     tst, <"N", NORM>, <"D", DENORM>, <"I", INF>, <"Q", QNAN>, <"S", SNAN> {
                match   text =, number, tst \{
                        mov     dl,text
                        call    print_char
                        mov     dl," "
                        call    print_char
                        mov     rax,number
                        call    fill_mem
                        time    bx
                        mov     rax,STEPS
                        call    fake_application
                        time    ax
                        sub     rax,rbx
                        call    print_hex16
                        mov     dl,10
                        call    print_char
                \}
        }
        pop     rax rbx rdx
        ret

fill_mem:
        ;rax = value
        push    rdi rcx
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 3
        rep     stosq
        pop     rcx rdi
        ret

fake_application:
        ;rax = loops
        push    rdi rcx rax
    .outer_loop:
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 8
    .inner_loop:
        rept 16 r:0 {
                movapd  xmm#r,[rdi + 16 * r]
        }
        rept 16 r:0 {
                mulpd   xmm#r,xmm#r
        }
        add     rdi,1 shl 8
        dec     rcx
        jnz     .inner_loop
        dec     rax
        jnz     .outer_loop
        pop     rax rcx rdi
        ret

print_hex16:
        ;rax = value
        push    rbx rax rcx rdx
        mov     ecx,16
        lea     rbx,[hex_table]
    .next_nibble:
        rol     rax,4
        mov     edx,eax
        and     edx,0xf
        mov     dl,[rbx + rdx]
        call    print_char
        dec     ecx
        jnz     .next_nibble
        pop     rdx rcx rax rbx
        ret

print_char:
        ;dl = character
        push    rsi rdi rax rcx r11 rdx
        mov     eax,SYS64_write
        mov     edi,STDOUT_FILENO
        mov     rsi,rsp
        mov     edx,1
        syscall
        pop     rdx r11 rcx rax rdi rsi
        ret    
Sample output
Code:
--                      ; no flags
N 000000001d6a0b04      ; normal
D 00000001c8ac2677      ; denormal
I 000000001b782a1e      ; infinity
Q 000000001d6c3e60      ; qnan
S 000000001cb3a943      ; snan
D-                      ; DAZ
N 000000001cf4597c
D 000000001d5ef212
I 000000001c886b2a
Q 000000001c46ce59
S 000000001c0e2237
-F                      ; FTZ
N 000000001dd31803
D 000000006617b934
I 000000001bda8deb
Q 000000001cb8e0fe
S 000000001ccbdedc
DF                      ; DAZ + FTZ
N 000000001bdf4f7f
D 000000001b9ffbf8
I 000000001c685b4b
Q 000000001bff47bb
S 000000001be52b7a    
My system is clearly not optimised to handle denormal values.
Post 24 Sep 2023, 20:53
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20306
Location: In your JS exploiting you and your system
revolution 25 Sep 2023, 08:29
Curiously when using 4 singles vs 2 doubles it took longer to process half as many doubles. Below shows the shorter time taken to load and operate on 4 single floats vs 2 double floats.
Code:
; Neither DAZ or FTZ flags
0000000199e3998c        ; 4 * single denormals
00000001cefecf08        ; 2 * double denormals

; Use FTZ flag only
000000005c18a244        ; 4 * single denormals
0000000068d47d67        ; 2 * double denormals    
For reference, the code
Code:
;       for SIZE in s d ; do fasm -d OP_SIZE=$SIZE mxcsr_sse_timing.asm && ./mxcsr_sse_timing ; done

format elf64 executable

if OP_SIZE eq d         ; double
        MANTISSA_WIDTH = 52
        EXPONENT_WIDTH = 11
        MULTIPLIER = 1
else if OP_SIZE eq s    ; single
        MANTISSA_WIDTH = 23
        EXPONENT_WIDTH = 8
        MULTIPLIER = 1 shl 32 + 1
else
        err
end if

NORM    = 0
DENORM  = 1
INF     = (1 shl EXPONENT_WIDTH - 1) shl MANTISSA_WIDTH
QNAN    = INF + 1 + 1 shl (MANTISSA_WIDTH - 1)
SNAN    = INF + 1

MEM_SIZE        = 1 shl 22
STEPS           = 1 shl 7

MXCSR_FZ        = 0x8000        ; Flush to zero
MXCSR_RN        = 0x0000        ; To nearest rounding mode
MXCSR_RM        = 0x2000        ; Toward negative infinity rounding mode
MXCSR_RP        = 0x4000        ; Toward positive infinity rounding mode
MXCSR_RZ        = 0x6000        ; Toward zero rounding mode
MXCSR_PM        = 0x1000        ; Precision mask
MXCSR_UM        = 0x0800        ; Underflow mask
MXCSR_OM        = 0x0400        ; Overflow mask
MXCSR_ZM        = 0x0200        ; Divide-by-zero mask
MXCSR_DM        = 0x0100        ; Denormal mask
MXCSR_IM        = 0x0080        ; Invalid operation mask
MXCSR_DZ        = 0x0040        ; Denormals are zeros
MXCSR_PE        = 0x0020        ; Precision flag
MXCSR_UE        = 0x0010        ; Underflow flag
MXCSR_OE        = 0x0008        ; Overflow flag
MXCSR_ZE        = 0x0004        ; Divide-by-zero flag
MXCSR_DE        = 0x0002        ; Denormal flag
MXCSR_IE        = 0x0001        ; Invalid operation flag

STDOUT_FILENO   = 1
SYS64_write     = 1
SYS64_brk       = 12
SYS64_exit_group= 231

segment writeable readable

mem_base        rq 1
hex_table:      db '0123456789abcdef'
mxcsr           rd 1

segment executable readable

entry $
        xor     edi,edi
        mov     eax,SYS64_brk
        syscall
        mov     [mem_base],rax
        lea     rdi,[rax + MEM_SIZE]
        mov     eax,SYS64_brk
        syscall
        cmp     rax,rdi
        mov     edi,1
        jnz     .quit
        stmxcsr [mxcsr]

; none
        mov     dl,'-'
        call    print_char
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not (MXCSR_DZ or MXCSR_FZ)
        ldmxcsr [mxcsr]
        call    fake_time

; DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'-'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_FZ
        or      [mxcsr],MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ
        mov     dl,'-'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_DZ
        or      [mxcsr],MXCSR_FZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ + DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        or      [mxcsr],MXCSR_FZ or MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

        xor     edi,edi
    .quit:
        mov     eax,SYS64_exit_group
        syscall

macro time reg {
        rdtsc
        mov     e#reg,eax
        shl     rdx,32
        or      r#reg,rdx
}

fake_time:
        push    rdx rbx rax
        irp     tst, <"N", NORM>, <"D", DENORM>, <"I", INF>, <"Q", QNAN>, <"S", SNAN> {
                match   text =, number, tst \{
                        mov     dl,text
                        call    print_char
                        mov     dl," "
                        call    print_char
                        mov     rax,number * MULTIPLIER
                        call    fill_mem
                        time    bx
                        mov     rax,STEPS
                        call    fake_application
                        time    ax
                        sub     rax,rbx
                        call    print_hex16
                        mov     dl,10
                        call    print_char
                \}
        }
        pop     rax rbx rdx
        ret

fill_mem:
        ;rax = value
        push    rdi rcx
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 3
        rep     stosq
        pop     rcx rdi
        ret

fake_application:
        ;rax = loops
        push    rdi rcx rax
    .outer_loop:
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 8
    .inner_loop:
        match opsize, OP_SIZE {
                rept 16 r:0 \{
                        movap#opsize xmm\#r,[rdi + 16 * r]
                \}
                rept 16 r:0 \{
                        mulp#opsize xmm\#r,xmm\#r
                \}
        }
        add     rdi,1 shl 8
        dec     rcx
        jnz     .inner_loop
        dec     rax
        jnz     .outer_loop
        pop     rax rcx rdi
        ret

print_hex16:
        ;rax = value
        push    rbx rax rcx rdx
        mov     ecx,16
        lea     rbx,[hex_table]
    .next_nibble:
        rol     rax,4
        mov     edx,eax
        and     edx,0xf
        mov     dl,[rbx + rdx]
        call    print_char
        dec     ecx
        jnz     .next_nibble
        pop     rdx rcx rax rbx
        ret

print_char:
        ;dl = character
        push    rsi rdi rax rcx r11 rdx
        mov     eax,SYS64_write
        mov     edi,STDOUT_FILENO
        mov     rsi,rsp
        mov     edx,1
        syscall
        pop     rdx r11 rcx rax rdi rsi
        ret    
Post 25 Sep 2023, 08:29
View user's profile Send private message Visit poster's website Reply with quote
macgub



Joined: 11 Jan 2006
Posts: 346
Location: Poland
macgub 28 Oct 2023, 11:51
Hi!
tthsqe wrote:

Code:
|-- double --|-- double --|-- double --|-- double --|-- double --|-- double --| ...
|-- long double --|   0   |-- long double --|   0   |-- long double --|
    

.

I guess topic is about 80bit floats? Are vector packed operations on such data type in standard AVX 2/ 512 instructions set range? What are mnemonics of such instructions (just few examples)?
As I read Intel manuals - I dont see any informations about 80/(128?) bit floats on AVX registers Did I miss something?.
Thanks for answer !!
Post 28 Oct 2023, 11:51
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20306
Location: In your JS exploiting you and your system
revolution 28 Oct 2023, 11:55
SSE and AVX don't support 80-bit floats, or anything longer than 64-bits, including integers, max size is 64-bits per element.
Post 28 Oct 2023, 11:55
View user's profile Send private message Visit poster's website Reply with quote
macgub



Joined: 11 Jan 2006
Posts: 346
Location: Poland
macgub 28 Oct 2023, 18:13
@revolution - thanks for reply!
OK - so AVX 512?
afaik FPU support 80bit floats as internal temporal variable. (I dont know if if orginal specification describes this values can be poped from FPU stack - I need reread ). But ttthsqe writes above code that shows something like vector operation. Is this implemented on real hardware , or manual emulation? 'Rocket lake', name used in topic seems to me like 11th generation of Intel x64 series. Or its something like PowerPC?
Thanks for answer.
Post 28 Oct 2023, 18:13
View user's profile Send private message Visit poster's website Reply with quote
macgub



Joined: 11 Jan 2006
Posts: 346
Location: Poland
macgub 28 Oct 2023, 18:19
tthsqe wrote:

x87 math.. .... avx

Seems to I missinterpreted tthsqes notation.....
Post 28 Oct 2023, 18:19
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20306
Location: In your JS exploiting you and your system
revolution 28 Oct 2023, 18:34
The FPU has supported 80-bit floats since day one. You can freely load and store them from/to memory. There are no vector operations in the FPU. Only SSE/AVX have vector ops.
Post 28 Oct 2023, 18:34
View user's profile Send private message Visit poster's website Reply with quote
macgub



Joined: 11 Jan 2006
Posts: 346
Location: Poland
macgub 29 Oct 2023, 09:23
Thanks for reply. As I wrote, I not correct understood meaning of ASCII diagram below:
tthsqe wrote:

Code:
|-- double --|-- double --|-- double --|-- double --|-- double --|-- double --| ...
|-- long double --|   0   |-- long double --|   0   |-- long double --|
    

.
Post 29 Oct 2023, 09:23
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.