flat assembler - How can x87 use cause massive avx slowdown [solved

Index > Main > How can x87 use cause massive avx slowdown [solved - my bad]

Author

Thread

tthsqe

Joined: 20 May 2009
Posts: 773

tthsqe 16 Sep 2023, 16:43

Here is a speed issue revolution can get behind. I could post more details about the problem, but, absent those details, I would like to hear any thoughts/conjectures on the general problem.

The code is being generated by gcc. If I run some heavy x87 long double calculations first and then some avx calculations, the latter run 10x slower than they run on their own. Please do not ask where my motherboard was manufactured or what my ram speed is. I am looking for explanations of a different character / possible source of compiler bugs. I tried putting an fldpi (and no pop anywhere) before calling the avx, and this on its own runs at the normal speed (even though there is a stack overflow after the eight invocation).

What could a compiler be doing with x87 math that poisons later avx calculations?

Last edited by tthsqe on 17 Sep 2023, 22:42; edited 1 time in total

16 Sep 2023, 16:43

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20998
Location: In your JS exploiting you and your system

revolution 16 Sep 2023, 17:23

I wouldn't be so quick to blame the compiler. It might be the CPU.

CPUs can/might (do?) share the execution units to do both x87 and AVX computations.. Regularly switching between the two could cause some kind of flush/reload thing to give a noticeable delay is done rapidly enough.

Also note that the long double is "special" because it needs more resources than the "normal" doubles that AVX supports. Perhaps the float circuitry needs to reconfigure itself to switch between lengths?

Do you remember that MMX has instructions to signal the "end" of the MMX computations. The CPU will switch automatically if you forgot to use it, but it would be "slower" on the older CPUs. So it wouldn't surprise me if x87/AVX is getting a similar treatment today.

16 Sep 2023, 17:23

Furs

Joined: 04 Mar 2016
Posts: 2738

Furs 16 Sep 2023, 17:48

What's your CPU? Micro-architecture name should be enough, not specific model, so I can look it up.

16 Sep 2023, 17:48

tthsqe

Joined: 20 May 2009
Posts: 773

tthsqe 16 Sep 2023, 18:08

rocketlake. The switch from x87 to avx only happens once.

16 Sep 2023, 18:08

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20998
Location: In your JS exploiting you and your system

revolution 16 Sep 2023, 19:28

I would think it more likely it is a cache and/or alignment problem, than anything else. I base this mainly on the experience I have with performance problems, where it is nearly always cache or alignment. If you don't have those buttoned down solid then doing other analysis before that is kind of pointless IMO.

Have you tried using non-temporal stores for the x87 data the code produces? I have found those to be awesome to help relieve the workload of the cache. Depends upon what you are doing of course.

Without seeing your code or knowing the full system arch, it is really hard to have to guess.

16 Sep 2023, 19:28

tthsqe

Joined: 20 May 2009
Posts: 773

tthsqe 17 Sep 2023, 22:55

oops, turned out to be my fault - when the long double calculations ran, they wrote long doubles into temp memory that the avx used as well. Some of the avx lanes were unused, and I was just copying the junk memory right into the avx regs. I guess gcc uses 16 bytes for the long double, and so the upper 16 bits makes a big fat denormal double, which apparently slows these things to a crawl.

I have noticed that infinities and nans are lightning fast at least in scalar double calculations, but these denormals are new news to me.
Make sure your 'unused' avx lanes aren't doing something stupid.

Interestingly enough, you can absolutely trash the x87 state and your avx code will be fine (this doesn't say anything about transitions though).

17 Sep 2023, 22:55

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20998
Location: In your JS exploiting you and your system

revolution 17 Sep 2023, 23:20

I've never had a use case for denormals.

There exists a Flush-To-Zero flag you can set. It uses zero in-place of a denormal. Perhaps it was included because of this very problem?

17 Sep 2023, 23:20

tthsqe

Joined: 20 May 2009
Posts: 773

tthsqe 17 Sep 2023, 23:35

Ha, it wasn't a use case on my part. It was a case of accidentally putting junk into avx lanes I wasn't considering as part of the calculation.

Code:

|-- double --|-- double --|-- double --|-- double --|-- double --|-- double --| ...
|-- long double --|   0   |-- long double --|   0   |-- long double --|

I think flush to zero is the default, but that doesn't mean that the floating point units enjoy being fed denormal inputs.

17 Sep 2023, 23:35

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20998
Location: In your JS exploiting you and your system

revolution 17 Sep 2023, 23:49

That alignment is the recommended method for long doubles. So the compiler was "doing-it-right"™

And it shows to importance of actually testing and not assuming.

17 Sep 2023, 23:49

Furs

Joined: 04 Mar 2016
Posts: 2738

Furs 18 Sep 2023, 16:55

tthsqe wrote:

Ha, it wasn't a use case on my part. It was a case of accidentally putting junk into avx lanes I wasn't considering as part of the calculation.
Code:
|-- double --|-- double --|-- double --|-- double --|-- double --|-- double --| ...
|-- long double --|   0   |-- long double --|   0   |-- long double --|
    
I think flush to zero is the default, but that doesn't mean that the floating point units enjoy being fed denormal inputs.

Your test was flawed to begin with, even without alignment. Why would you read long double data into AVX in the first place? It doesn't match double's format, so it's senseless. It's not a realistic test at all.

18 Sep 2023, 16:55

tthsqe

Joined: 20 May 2009
Posts: 773

tthsqe 18 Sep 2023, 23:58

Yes, this was a definite oops on my part. But, I did learn that, at least for rocketlake:
1. avx speed seems to be independent of whatever (possibly trashed) x87 state you are in.
2. putting junk into ignored avx lanes can cause a big slowdown when double precision operations hit denormal input.

I am aware that there are write masks now so that I can truly ignore certain lanes, but I wasn't doing that.

18 Sep 2023, 23:58

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20998
Location: In your JS exploiting you and your system

revolution 19 Sep 2023, 04:20

I wrote a useless fake timing test to see what effect FTZ and DAZ flags might have. It appears that the DAZ flag is the better choice (at least on my system and in this particular useless "application").

Code:

format elf64 executable

MEM_SIZE        = 1 shl 22
STEPS           = 1 shl 9

MXCSR_FZ        = 0x8000        ; Flush to zero
MXCSR_RN        = 0x0000        ; To nearest rounding mode
MXCSR_RM        = 0x2000        ; Toward negative infinity rounding mode
MXCSR_RP        = 0x4000        ; Toward positive infinity rounding mode
MXCSR_RZ        = 0x6000        ; Toward zero rounding mode
MXCSR_PM        = 0x1000        ; Precision mask
MXCSR_UM        = 0x0800        ; Underflow mask
MXCSR_OM        = 0x0400        ; Overflow mask
MXCSR_ZM        = 0x0200        ; Divide-by-zero mask
MXCSR_DM        = 0x0100        ; Denormal mask
MXCSR_IM        = 0x0080        ; Invalid operation mask
MXCSR_DZ        = 0x0040        ; Denormals are zeros
MXCSR_PE        = 0x0020        ; Precision flag
MXCSR_UE        = 0x0010        ; Underflow flag
MXCSR_OE        = 0x0008        ; Overflow flag
MXCSR_ZE        = 0x0004        ; Divide-by-zero flag
MXCSR_DE        = 0x0002        ; Denormal flag
MXCSR_IE        = 0x0001        ; Invalid operation flag

STDOUT_FILENO   = 1
SYS64_write     = 1
SYS64_brk       = 12
SYS64_exit_group= 231

segment writeable readable

mem_base        rq 1
hex_table:      db '0123456789abcdef'
mxcsr           rd 1

segment executable readable

entry $
        xor     edi,edi
        mov     eax,SYS64_brk
        syscall
        mov     [mem_base],rax
        lea     rdi,[rax + MEM_SIZE]
        mov     eax,SYS64_brk
        syscall
        cmp     rax,rdi
        mov     edi,1
        jnz     .quit
        stmxcsr [mxcsr]

; none
        mov     dl,'-'
        call    print_char
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not (MXCSR_DZ or MXCSR_FZ)
        ldmxcsr [mxcsr]
        call    fake_time

; DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'-'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_FZ
        or      [mxcsr],MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ
        mov     dl,'-'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_DZ
        or      [mxcsr],MXCSR_FZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ + DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        or      [mxcsr],MXCSR_FZ or MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

        xor     edi,edi
    .quit:
        mov     eax,SYS64_exit_group
        syscall

macro time reg {
        rdtsc
        mov     e#reg,eax
        shl     rdx,32
        or      r#reg,rdx
}

fake_time:
        push    rdi rdx rcx rbx rax
        xor     rax,rax
        call    fill_mem
        time    bx
        mov     rax,STEPS
        call    fake_application
        time    cx
        sub     rcx,rbx
        mov     rax,1
        call    fill_mem
        time    bx
        mov     rax,STEPS
        call    fake_application
        time    di
        sub     rdi,rbx
        mov     rax,rcx
        call    print_hex16
        mov     dl,10
        call    print_char
        mov     rax,rdi
        call    print_hex16
        mov     dl,10
        call    print_char
        pop     rax rbx rcx rdx rdi
        ret

fill_mem:
        ;rax = value
        push    rdi rcx
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 3
        rep     stosq
        pop     rcx rdi
        ret

fake_application:
        ;rax = loops
        push    rdi rcx rax
    .outer_loop:
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 8
    .inner_loop:
        rept 16 r:0 {
                movapd  xmm#r,[rdi + 16 * r]
        }
        rept 16 r:0 {
                mulpd   xmm#r,xmm#r
        }
        add     rdi,1 shl 8
        dec     rcx
        jnz     .inner_loop
        dec     rax
        jnz     .outer_loop
        pop     rax rcx rdi
        ret

print_hex16:
        ;rax = value
        push    rbx rax rcx rdx
        mov     ecx,16
        lea     rbx,[hex_table]
    .next_nibble:
        rol     rax,4
        mov     edx,eax
        and     edx,0xf
        mov     dl,[rbx + rdx]
        call    print_char
        dec     ecx
        jnz     .next_nibble
        pop     rdx rcx rax rbx
        ret

print_char:
        ;dl = character
        push    rsi rdi rax rcx r11 rdx
        mov     eax,SYS64_write
        mov     edi,STDOUT_FILENO
        mov     rsi,rsp
        mov     edx,1
        syscall
        pop     rdx r11 rcx rax rdi rsi
        ret

The output

Code:

--
000000001cdc3215
00000001d07753de ; <--- 16x slower!
D-
000000001ccd3fdd
000000001d7717eb
-F
000000001da1ea45
000000006b1e2ec6 ; <--- 3x slower!
DF
000000001de9e62c
000000001c89f473

The first time value is for zero, and the second is for the smallest possible denormal (0x000...001). The two flags are D=DAZ and F=FTZ, The dash means the flag is not set.

Setting the DAZ flag appears to make denormal timing the same as a normal value of zero. Your results may vary depending upon the CPU/OS and about a million other factors. By default Linux sets the mask bits so this test doesn't trigger any exceptions, it is all handled by the CPU internally.

FTZ appears to only be done during calculations. DAZ appears to be for both loading a value and for calculations.

19 Sep 2023, 04:20

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20998
Location: In your JS exploiting you and your system

revolution 24 Sep 2023, 20:53

I tested with normal, denormal, infinity, qnan and snan. For this test on my system only the denormal shows any timing variation.

Code:

format elf64 executable

NORM    = 0
DENORM  = 1
INF     = (1 shl 11 - 1) shl 52
QNAN    = INF + 1 + 1 shl 51
SNAN    = INF + 1

MEM_SIZE        = 1 shl 22
STEPS           = 1 shl 7

MXCSR_FZ        = 0x8000        ; Flush to zero
MXCSR_RN        = 0x0000        ; To nearest rounding mode
MXCSR_RM        = 0x2000        ; Toward negative infinity rounding mode
MXCSR_RP        = 0x4000        ; Toward positive infinity rounding mode
MXCSR_RZ        = 0x6000        ; Toward zero rounding mode
MXCSR_PM        = 0x1000        ; Precision mask
MXCSR_UM        = 0x0800        ; Underflow mask
MXCSR_OM        = 0x0400        ; Overflow mask
MXCSR_ZM        = 0x0200        ; Divide-by-zero mask
MXCSR_DM        = 0x0100        ; Denormal mask
MXCSR_IM        = 0x0080        ; Invalid operation mask
MXCSR_DZ        = 0x0040        ; Denormals are zeros
MXCSR_PE        = 0x0020        ; Precision flag
MXCSR_UE        = 0x0010        ; Underflow flag
MXCSR_OE        = 0x0008        ; Overflow flag
MXCSR_ZE        = 0x0004        ; Divide-by-zero flag
MXCSR_DE        = 0x0002        ; Denormal flag
MXCSR_IE        = 0x0001        ; Invalid operation flag

STDOUT_FILENO   = 1
SYS64_write     = 1
SYS64_brk       = 12
SYS64_exit_group= 231

segment writeable readable

mem_base        rq 1
hex_table:      db '0123456789abcdef'
mxcsr           rd 1

segment executable readable

entry $
        xor     edi,edi
        mov     eax,SYS64_brk
        syscall
        mov     [mem_base],rax
        lea     rdi,[rax + MEM_SIZE]
        mov     eax,SYS64_brk
        syscall
        cmp     rax,rdi
        mov     edi,1
        jnz     .quit
        stmxcsr [mxcsr]

; none
        mov     dl,'-'
        call    print_char
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not (MXCSR_DZ or MXCSR_FZ)
        ldmxcsr [mxcsr]
        call    fake_time

; DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'-'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_FZ
        or      [mxcsr],MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ
        mov     dl,'-'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_DZ
        or      [mxcsr],MXCSR_FZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ + DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        or      [mxcsr],MXCSR_FZ or MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

        xor     edi,edi
    .quit:
        mov     eax,SYS64_exit_group
        syscall

macro time reg {
        rdtsc
        mov     e#reg,eax
        shl     rdx,32
        or      r#reg,rdx
}

fake_time:
        push    rdx rbx rax
        irp     tst, <"N", NORM>, <"D", DENORM>, <"I", INF>, <"Q", QNAN>, <"S", SNAN> {
                match   text =, number, tst \{
                        mov     dl,text
                        call    print_char
                        mov     dl," "
                        call    print_char
                        mov     rax,number
                        call    fill_mem
                        time    bx
                        mov     rax,STEPS
                        call    fake_application
                        time    ax
                        sub     rax,rbx
                        call    print_hex16
                        mov     dl,10
                        call    print_char
                \}
        }
        pop     rax rbx rdx
        ret

fill_mem:
        ;rax = value
        push    rdi rcx
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 3
        rep     stosq
        pop     rcx rdi
        ret

fake_application:
        ;rax = loops
        push    rdi rcx rax
    .outer_loop:
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 8
    .inner_loop:
        rept 16 r:0 {
                movapd  xmm#r,[rdi + 16 * r]
        }
        rept 16 r:0 {
                mulpd   xmm#r,xmm#r
        }
        add     rdi,1 shl 8
        dec     rcx
        jnz     .inner_loop
        dec     rax
        jnz     .outer_loop
        pop     rax rcx rdi
        ret

print_hex16:
        ;rax = value
        push    rbx rax rcx rdx
        mov     ecx,16
        lea     rbx,[hex_table]
    .next_nibble:
        rol     rax,4
        mov     edx,eax
        and     edx,0xf
        mov     dl,[rbx + rdx]
        call    print_char
        dec     ecx
        jnz     .next_nibble
        pop     rdx rcx rax rbx
        ret

print_char:
        ;dl = character
        push    rsi rdi rax rcx r11 rdx
        mov     eax,SYS64_write
        mov     edi,STDOUT_FILENO
        mov     rsi,rsp
        mov     edx,1
        syscall
        pop     rdx r11 rcx rax rdi rsi
        ret

Sample output

Code:

--                      ; no flags
N 000000001d6a0b04      ; normal
D 00000001c8ac2677      ; denormal
I 000000001b782a1e      ; infinity
Q 000000001d6c3e60      ; qnan
S 000000001cb3a943      ; snan
D-                      ; DAZ
N 000000001cf4597c
D 000000001d5ef212
I 000000001c886b2a
Q 000000001c46ce59
S 000000001c0e2237
-F                      ; FTZ
N 000000001dd31803
D 000000006617b934
I 000000001bda8deb
Q 000000001cb8e0fe
S 000000001ccbdedc
DF                      ; DAZ + FTZ
N 000000001bdf4f7f
D 000000001b9ffbf8
I 000000001c685b4b
Q 000000001bff47bb
S 000000001be52b7a

My system is clearly not optimised to handle denormal values.

24 Sep 2023, 20:53

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20998
Location: In your JS exploiting you and your system

revolution 25 Sep 2023, 08:29

Curiously when using 4 singles vs 2 doubles it took longer to process half as many doubles. Below shows the shorter time taken to load and operate on 4 single floats vs 2 double floats.

Code:

; Neither DAZ or FTZ flags
0000000199e3998c        ; 4 * single denormals
00000001cefecf08        ; 2 * double denormals

; Use FTZ flag only
000000005c18a244        ; 4 * single denormals
0000000068d47d67        ; 2 * double denormals

For reference, the code

Code:

;       for SIZE in s d ; do fasm -d OP_SIZE=$SIZE mxcsr_sse_timing.asm && ./mxcsr_sse_timing ; done

format elf64 executable

if OP_SIZE eq d         ; double
        MANTISSA_WIDTH = 52
        EXPONENT_WIDTH = 11
        MULTIPLIER = 1
else if OP_SIZE eq s    ; single
        MANTISSA_WIDTH = 23
        EXPONENT_WIDTH = 8
        MULTIPLIER = 1 shl 32 + 1
else
        err
end if

NORM    = 0
DENORM  = 1
INF     = (1 shl EXPONENT_WIDTH - 1) shl MANTISSA_WIDTH
QNAN    = INF + 1 + 1 shl (MANTISSA_WIDTH - 1)
SNAN    = INF + 1

MEM_SIZE        = 1 shl 22
STEPS           = 1 shl 7

MXCSR_FZ        = 0x8000        ; Flush to zero
MXCSR_RN        = 0x0000        ; To nearest rounding mode
MXCSR_RM        = 0x2000        ; Toward negative infinity rounding mode
MXCSR_RP        = 0x4000        ; Toward positive infinity rounding mode
MXCSR_RZ        = 0x6000        ; Toward zero rounding mode
MXCSR_PM        = 0x1000        ; Precision mask
MXCSR_UM        = 0x0800        ; Underflow mask
MXCSR_OM        = 0x0400        ; Overflow mask
MXCSR_ZM        = 0x0200        ; Divide-by-zero mask
MXCSR_DM        = 0x0100        ; Denormal mask
MXCSR_IM        = 0x0080        ; Invalid operation mask
MXCSR_DZ        = 0x0040        ; Denormals are zeros
MXCSR_PE        = 0x0020        ; Precision flag
MXCSR_UE        = 0x0010        ; Underflow flag
MXCSR_OE        = 0x0008        ; Overflow flag
MXCSR_ZE        = 0x0004        ; Divide-by-zero flag
MXCSR_DE        = 0x0002        ; Denormal flag
MXCSR_IE        = 0x0001        ; Invalid operation flag

STDOUT_FILENO   = 1
SYS64_write     = 1
SYS64_brk       = 12
SYS64_exit_group= 231

segment writeable readable

mem_base        rq 1
hex_table:      db '0123456789abcdef'
mxcsr           rd 1

segment executable readable

entry $
        xor     edi,edi
        mov     eax,SYS64_brk
        syscall
        mov     [mem_base],rax
        lea     rdi,[rax + MEM_SIZE]
        mov     eax,SYS64_brk
        syscall
        cmp     rax,rdi
        mov     edi,1
        jnz     .quit
        stmxcsr [mxcsr]

; none
        mov     dl,'-'
        call    print_char
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not (MXCSR_DZ or MXCSR_FZ)
        ldmxcsr [mxcsr]
        call    fake_time

; DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'-'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_FZ
        or      [mxcsr],MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ
        mov     dl,'-'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        and     [mxcsr],not MXCSR_DZ
        or      [mxcsr],MXCSR_FZ
        ldmxcsr [mxcsr]
        call    fake_time

; FTZ + DAZ
        mov     dl,'D'
        call    print_char
        mov     dl,'F'
        call    print_char
        mov     dl,10
        call    print_char
        or      [mxcsr],MXCSR_FZ or MXCSR_DZ
        ldmxcsr [mxcsr]
        call    fake_time

        xor     edi,edi
    .quit:
        mov     eax,SYS64_exit_group
        syscall

macro time reg {
        rdtsc
        mov     e#reg,eax
        shl     rdx,32
        or      r#reg,rdx
}

fake_time:
        push    rdx rbx rax
        irp     tst, <"N", NORM>, <"D", DENORM>, <"I", INF>, <"Q", QNAN>, <"S", SNAN> {
                match   text =, number, tst \{
                        mov     dl,text
                        call    print_char
                        mov     dl," "
                        call    print_char
                        mov     rax,number * MULTIPLIER
                        call    fill_mem
                        time    bx
                        mov     rax,STEPS
                        call    fake_application
                        time    ax
                        sub     rax,rbx
                        call    print_hex16
                        mov     dl,10
                        call    print_char
                \}
        }
        pop     rax rbx rdx
        ret

fill_mem:
        ;rax = value
        push    rdi rcx
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 3
        rep     stosq
        pop     rcx rdi
        ret

fake_application:
        ;rax = loops
        push    rdi rcx rax
    .outer_loop:
        mov     rdi,[mem_base]
        mov     ecx,MEM_SIZE shr 8
    .inner_loop:
        match opsize, OP_SIZE {
                rept 16 r:0 \{
                        movap#opsize xmm\#r,[rdi + 16 * r]
                \}
                rept 16 r:0 \{
                        mulp#opsize xmm\#r,xmm\#r
                \}
        }
        add     rdi,1 shl 8
        dec     rcx
        jnz     .inner_loop
        dec     rax
        jnz     .outer_loop
        pop     rax rcx rdi
        ret

print_hex16:
        ;rax = value
        push    rbx rax rcx rdx
        mov     ecx,16
        lea     rbx,[hex_table]
    .next_nibble:
        rol     rax,4
        mov     edx,eax
        and     edx,0xf
        mov     dl,[rbx + rdx]
        call    print_char
        dec     ecx
        jnz     .next_nibble
        pop     rdx rcx rax rbx
        ret

print_char:
        ;dl = character
        push    rsi rdi rax rcx r11 rdx
        mov     eax,SYS64_write
        mov     edi,STDOUT_FILENO
        mov     rsi,rsp
        mov     edx,1
        syscall
        pop     rdx r11 rcx rax rdi rsi
        ret

25 Sep 2023, 08:29

macgub

Joined: 11 Jan 2006
Posts: 367
Location: Poland

macgub 28 Oct 2023, 11:51

Hi!

tthsqe wrote:

Code:

|-- double --|-- double --|-- double --|-- double --|-- double --|-- double --| ...
|-- long double --|   0   |-- long double --|   0   |-- long double --|

I guess topic is about 80bit floats? Are vector packed operations on such data type in standard AVX 2/ 512 instructions set range? What are mnemonics of such instructions (just few examples)?
As I read Intel manuals - I dont see any informations about 80/(128?) bit floats on AVX registers Did I miss something?.
Thanks for answer !!

28 Oct 2023, 11:51

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20998
Location: In your JS exploiting you and your system

revolution 28 Oct 2023, 11:55

SSE and AVX don't support 80-bit floats, or anything longer than 64-bits, including integers, max size is 64-bits per element.

28 Oct 2023, 11:55

macgub

Joined: 11 Jan 2006
Posts: 367
Location: Poland

macgub 28 Oct 2023, 18:13

@revolution - thanks for reply!
OK - so AVX 512?
afaik FPU support 80bit floats as internal temporal variable. (I dont know if if orginal specification describes this values can be poped from FPU stack - I need reread ). But ttthsqe writes above code that shows something like vector operation. Is this implemented on real hardware , or manual emulation? 'Rocket lake', name used in topic seems to me like 11th generation of Intel x64 series. Or its something like PowerPC?
Thanks for answer.

28 Oct 2023, 18:13

macgub

Joined: 11 Jan 2006
Posts: 367
Location: Poland

macgub 28 Oct 2023, 18:19

tthsqe wrote:

x87 math.. .... avx

Seems to I missinterpreted tthsqes notation.....

28 Oct 2023, 18:19

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20998
Location: In your JS exploiting you and your system

revolution 28 Oct 2023, 18:34

The FPU has supported 80-bit floats since day one. You can freely load and store them from/to memory. There are no vector operations in the FPU. Only SSE/AVX have vector ops.

28 Oct 2023, 18:34

macgub

Joined: 11 Jan 2006
Posts: 367
Location: Poland

macgub 29 Oct 2023, 09:23

Thanks for reply. As I wrote, I not correct understood meaning of ASCII diagram below:

tthsqe wrote:

Code:

|-- double --|-- double --|-- double --|-- double --|-- double --|-- double --| ...
|-- long double --|   0   |-- long double --|   0   |-- long double --|

29 Oct 2023, 09:23

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum