flat assembler
Message board for the users of flat assembler.
Index
> Main > How can x87 use cause massive avx slowdown [solved - my bad] |
Author |
|
revolution 16 Sep 2023, 17:23
I wouldn't be so quick to blame the compiler. It might be the CPU.
CPUs can/might (do?) share the execution units to do both x87 and AVX computations.. Regularly switching between the two could cause some kind of flush/reload thing to give a noticeable delay is done rapidly enough. Also note that the long double is "special" because it needs more resources than the "normal" doubles that AVX supports. Perhaps the float circuitry needs to reconfigure itself to switch between lengths? Do you remember that MMX has instructions to signal the "end" of the MMX computations. The CPU will switch automatically if you forgot to use it, but it would be "slower" on the older CPUs. So it wouldn't surprise me if x87/AVX is getting a similar treatment today. |
|||
16 Sep 2023, 17:23 |
|
Furs 16 Sep 2023, 17:48
What's your CPU? Micro-architecture name should be enough, not specific model, so I can look it up.
|
|||
16 Sep 2023, 17:48 |
|
tthsqe 16 Sep 2023, 18:08
rocketlake. The switch from x87 to avx only happens once.
|
|||
16 Sep 2023, 18:08 |
|
revolution 16 Sep 2023, 19:28
I would think it more likely it is a cache and/or alignment problem, than anything else. I base this mainly on the experience I have with performance problems, where it is nearly always cache or alignment. If you don't have those buttoned down solid then doing other analysis before that is kind of pointless IMO.
Have you tried using non-temporal stores for the x87 data the code produces? I have found those to be awesome to help relieve the workload of the cache. Depends upon what you are doing of course. Without seeing your code or knowing the full system arch, it is really hard to have to guess. |
|||
16 Sep 2023, 19:28 |
|
tthsqe 17 Sep 2023, 22:55
oops, turned out to be my fault - when the long double calculations ran, they wrote long doubles into temp memory that the avx used as well. Some of the avx lanes were unused, and I was just copying the junk memory right into the avx regs. I guess gcc uses 16 bytes for the long double, and so the upper 16 bits makes a big fat denormal double, which apparently slows these things to a crawl.
I have noticed that infinities and nans are lightning fast at least in scalar double calculations, but these denormals are new news to me. Make sure your 'unused' avx lanes aren't doing something stupid. Interestingly enough, you can absolutely trash the x87 state and your avx code will be fine (this doesn't say anything about transitions though). |
|||
17 Sep 2023, 22:55 |
|
revolution 17 Sep 2023, 23:20
I've never had a use case for denormals.
There exists a Flush-To-Zero flag you can set. It uses zero in-place of a denormal. Perhaps it was included because of this very problem? |
|||
17 Sep 2023, 23:20 |
|
tthsqe 17 Sep 2023, 23:35
Ha, it wasn't a use case on my part. It was a case of accidentally putting junk into avx lanes I wasn't considering as part of the calculation.
Code: |-- double --|-- double --|-- double --|-- double --|-- double --|-- double --| ... |-- long double --| 0 |-- long double --| 0 |-- long double --| I think flush to zero is the default, but that doesn't mean that the floating point units enjoy being fed denormal inputs. |
|||
17 Sep 2023, 23:35 |
|
revolution 17 Sep 2023, 23:49
That alignment is the recommended method for long doubles. So the compiler was "doing-it-right"™
And it shows to importance of actually testing and not assuming. |
|||
17 Sep 2023, 23:49 |
|
Furs 18 Sep 2023, 16:55
tthsqe wrote: Ha, it wasn't a use case on my part. It was a case of accidentally putting junk into avx lanes I wasn't considering as part of the calculation. |
|||
18 Sep 2023, 16:55 |
|
tthsqe 18 Sep 2023, 23:58
Yes, this was a definite oops on my part. But, I did learn that, at least for rocketlake:
1. avx speed seems to be independent of whatever (possibly trashed) x87 state you are in. 2. putting junk into ignored avx lanes can cause a big slowdown when double precision operations hit denormal input. I am aware that there are write masks now so that I can truly ignore certain lanes, but I wasn't doing that. |
|||
18 Sep 2023, 23:58 |
|
revolution 19 Sep 2023, 04:20
I wrote a useless fake timing test to see what effect FTZ and DAZ flags might have. It appears that the DAZ flag is the better choice (at least on my system and in this particular useless "application").
Code: format elf64 executable MEM_SIZE = 1 shl 22 STEPS = 1 shl 9 MXCSR_FZ = 0x8000 ; Flush to zero MXCSR_RN = 0x0000 ; To nearest rounding mode MXCSR_RM = 0x2000 ; Toward negative infinity rounding mode MXCSR_RP = 0x4000 ; Toward positive infinity rounding mode MXCSR_RZ = 0x6000 ; Toward zero rounding mode MXCSR_PM = 0x1000 ; Precision mask MXCSR_UM = 0x0800 ; Underflow mask MXCSR_OM = 0x0400 ; Overflow mask MXCSR_ZM = 0x0200 ; Divide-by-zero mask MXCSR_DM = 0x0100 ; Denormal mask MXCSR_IM = 0x0080 ; Invalid operation mask MXCSR_DZ = 0x0040 ; Denormals are zeros MXCSR_PE = 0x0020 ; Precision flag MXCSR_UE = 0x0010 ; Underflow flag MXCSR_OE = 0x0008 ; Overflow flag MXCSR_ZE = 0x0004 ; Divide-by-zero flag MXCSR_DE = 0x0002 ; Denormal flag MXCSR_IE = 0x0001 ; Invalid operation flag STDOUT_FILENO = 1 SYS64_write = 1 SYS64_brk = 12 SYS64_exit_group= 231 segment writeable readable mem_base rq 1 hex_table: db '0123456789abcdef' mxcsr rd 1 segment executable readable entry $ xor edi,edi mov eax,SYS64_brk syscall mov [mem_base],rax lea rdi,[rax + MEM_SIZE] mov eax,SYS64_brk syscall cmp rax,rdi mov edi,1 jnz .quit stmxcsr [mxcsr] ; none mov dl,'-' call print_char call print_char mov dl,10 call print_char and [mxcsr],not (MXCSR_DZ or MXCSR_FZ) ldmxcsr [mxcsr] call fake_time ; DAZ mov dl,'D' call print_char mov dl,'-' call print_char mov dl,10 call print_char and [mxcsr],not MXCSR_FZ or [mxcsr],MXCSR_DZ ldmxcsr [mxcsr] call fake_time ; FTZ mov dl,'-' call print_char mov dl,'F' call print_char mov dl,10 call print_char and [mxcsr],not MXCSR_DZ or [mxcsr],MXCSR_FZ ldmxcsr [mxcsr] call fake_time ; FTZ + DAZ mov dl,'D' call print_char mov dl,'F' call print_char mov dl,10 call print_char or [mxcsr],MXCSR_FZ or MXCSR_DZ ldmxcsr [mxcsr] call fake_time xor edi,edi .quit: mov eax,SYS64_exit_group syscall macro time reg { rdtsc mov e#reg,eax shl rdx,32 or r#reg,rdx } fake_time: push rdi rdx rcx rbx rax xor rax,rax call fill_mem time bx mov rax,STEPS call fake_application time cx sub rcx,rbx mov rax,1 call fill_mem time bx mov rax,STEPS call fake_application time di sub rdi,rbx mov rax,rcx call print_hex16 mov dl,10 call print_char mov rax,rdi call print_hex16 mov dl,10 call print_char pop rax rbx rcx rdx rdi ret fill_mem: ;rax = value push rdi rcx mov rdi,[mem_base] mov ecx,MEM_SIZE shr 3 rep stosq pop rcx rdi ret fake_application: ;rax = loops push rdi rcx rax .outer_loop: mov rdi,[mem_base] mov ecx,MEM_SIZE shr 8 .inner_loop: rept 16 r:0 { movapd xmm#r,[rdi + 16 * r] } rept 16 r:0 { mulpd xmm#r,xmm#r } add rdi,1 shl 8 dec rcx jnz .inner_loop dec rax jnz .outer_loop pop rax rcx rdi ret print_hex16: ;rax = value push rbx rax rcx rdx mov ecx,16 lea rbx,[hex_table] .next_nibble: rol rax,4 mov edx,eax and edx,0xf mov dl,[rbx + rdx] call print_char dec ecx jnz .next_nibble pop rdx rcx rax rbx ret print_char: ;dl = character push rsi rdi rax rcx r11 rdx mov eax,SYS64_write mov edi,STDOUT_FILENO mov rsi,rsp mov edx,1 syscall pop rdx r11 rcx rax rdi rsi ret Code: -- 000000001cdc3215 00000001d07753de ; <--- 16x slower! D- 000000001ccd3fdd 000000001d7717eb -F 000000001da1ea45 000000006b1e2ec6 ; <--- 3x slower! DF 000000001de9e62c 000000001c89f473 Setting the DAZ flag appears to make denormal timing the same as a normal value of zero. Your results may vary depending upon the CPU/OS and about a million other factors. By default Linux sets the mask bits so this test doesn't trigger any exceptions, it is all handled by the CPU internally. FTZ appears to only be done during calculations. DAZ appears to be for both loading a value and for calculations. |
|||
19 Sep 2023, 04:20 |
|
revolution 24 Sep 2023, 20:53
I tested with normal, denormal, infinity, qnan and snan. For this test on my system only the denormal shows any timing variation.
Code: format elf64 executable NORM = 0 DENORM = 1 INF = (1 shl 11 - 1) shl 52 QNAN = INF + 1 + 1 shl 51 SNAN = INF + 1 MEM_SIZE = 1 shl 22 STEPS = 1 shl 7 MXCSR_FZ = 0x8000 ; Flush to zero MXCSR_RN = 0x0000 ; To nearest rounding mode MXCSR_RM = 0x2000 ; Toward negative infinity rounding mode MXCSR_RP = 0x4000 ; Toward positive infinity rounding mode MXCSR_RZ = 0x6000 ; Toward zero rounding mode MXCSR_PM = 0x1000 ; Precision mask MXCSR_UM = 0x0800 ; Underflow mask MXCSR_OM = 0x0400 ; Overflow mask MXCSR_ZM = 0x0200 ; Divide-by-zero mask MXCSR_DM = 0x0100 ; Denormal mask MXCSR_IM = 0x0080 ; Invalid operation mask MXCSR_DZ = 0x0040 ; Denormals are zeros MXCSR_PE = 0x0020 ; Precision flag MXCSR_UE = 0x0010 ; Underflow flag MXCSR_OE = 0x0008 ; Overflow flag MXCSR_ZE = 0x0004 ; Divide-by-zero flag MXCSR_DE = 0x0002 ; Denormal flag MXCSR_IE = 0x0001 ; Invalid operation flag STDOUT_FILENO = 1 SYS64_write = 1 SYS64_brk = 12 SYS64_exit_group= 231 segment writeable readable mem_base rq 1 hex_table: db '0123456789abcdef' mxcsr rd 1 segment executable readable entry $ xor edi,edi mov eax,SYS64_brk syscall mov [mem_base],rax lea rdi,[rax + MEM_SIZE] mov eax,SYS64_brk syscall cmp rax,rdi mov edi,1 jnz .quit stmxcsr [mxcsr] ; none mov dl,'-' call print_char call print_char mov dl,10 call print_char and [mxcsr],not (MXCSR_DZ or MXCSR_FZ) ldmxcsr [mxcsr] call fake_time ; DAZ mov dl,'D' call print_char mov dl,'-' call print_char mov dl,10 call print_char and [mxcsr],not MXCSR_FZ or [mxcsr],MXCSR_DZ ldmxcsr [mxcsr] call fake_time ; FTZ mov dl,'-' call print_char mov dl,'F' call print_char mov dl,10 call print_char and [mxcsr],not MXCSR_DZ or [mxcsr],MXCSR_FZ ldmxcsr [mxcsr] call fake_time ; FTZ + DAZ mov dl,'D' call print_char mov dl,'F' call print_char mov dl,10 call print_char or [mxcsr],MXCSR_FZ or MXCSR_DZ ldmxcsr [mxcsr] call fake_time xor edi,edi .quit: mov eax,SYS64_exit_group syscall macro time reg { rdtsc mov e#reg,eax shl rdx,32 or r#reg,rdx } fake_time: push rdx rbx rax irp tst, <"N", NORM>, <"D", DENORM>, <"I", INF>, <"Q", QNAN>, <"S", SNAN> { match text =, number, tst \{ mov dl,text call print_char mov dl," " call print_char mov rax,number call fill_mem time bx mov rax,STEPS call fake_application time ax sub rax,rbx call print_hex16 mov dl,10 call print_char \} } pop rax rbx rdx ret fill_mem: ;rax = value push rdi rcx mov rdi,[mem_base] mov ecx,MEM_SIZE shr 3 rep stosq pop rcx rdi ret fake_application: ;rax = loops push rdi rcx rax .outer_loop: mov rdi,[mem_base] mov ecx,MEM_SIZE shr 8 .inner_loop: rept 16 r:0 { movapd xmm#r,[rdi + 16 * r] } rept 16 r:0 { mulpd xmm#r,xmm#r } add rdi,1 shl 8 dec rcx jnz .inner_loop dec rax jnz .outer_loop pop rax rcx rdi ret print_hex16: ;rax = value push rbx rax rcx rdx mov ecx,16 lea rbx,[hex_table] .next_nibble: rol rax,4 mov edx,eax and edx,0xf mov dl,[rbx + rdx] call print_char dec ecx jnz .next_nibble pop rdx rcx rax rbx ret print_char: ;dl = character push rsi rdi rax rcx r11 rdx mov eax,SYS64_write mov edi,STDOUT_FILENO mov rsi,rsp mov edx,1 syscall pop rdx r11 rcx rax rdi rsi ret Code: -- ; no flags N 000000001d6a0b04 ; normal D 00000001c8ac2677 ; denormal I 000000001b782a1e ; infinity Q 000000001d6c3e60 ; qnan S 000000001cb3a943 ; snan D- ; DAZ N 000000001cf4597c D 000000001d5ef212 I 000000001c886b2a Q 000000001c46ce59 S 000000001c0e2237 -F ; FTZ N 000000001dd31803 D 000000006617b934 I 000000001bda8deb Q 000000001cb8e0fe S 000000001ccbdedc DF ; DAZ + FTZ N 000000001bdf4f7f D 000000001b9ffbf8 I 000000001c685b4b Q 000000001bff47bb S 000000001be52b7a |
|||
24 Sep 2023, 20:53 |
|
revolution 25 Sep 2023, 08:29
Curiously when using 4 singles vs 2 doubles it took longer to process half as many doubles. Below shows the shorter time taken to load and operate on 4 single floats vs 2 double floats.
Code: ; Neither DAZ or FTZ flags 0000000199e3998c ; 4 * single denormals 00000001cefecf08 ; 2 * double denormals ; Use FTZ flag only 000000005c18a244 ; 4 * single denormals 0000000068d47d67 ; 2 * double denormals Code: ; for SIZE in s d ; do fasm -d OP_SIZE=$SIZE mxcsr_sse_timing.asm && ./mxcsr_sse_timing ; done format elf64 executable if OP_SIZE eq d ; double MANTISSA_WIDTH = 52 EXPONENT_WIDTH = 11 MULTIPLIER = 1 else if OP_SIZE eq s ; single MANTISSA_WIDTH = 23 EXPONENT_WIDTH = 8 MULTIPLIER = 1 shl 32 + 1 else err end if NORM = 0 DENORM = 1 INF = (1 shl EXPONENT_WIDTH - 1) shl MANTISSA_WIDTH QNAN = INF + 1 + 1 shl (MANTISSA_WIDTH - 1) SNAN = INF + 1 MEM_SIZE = 1 shl 22 STEPS = 1 shl 7 MXCSR_FZ = 0x8000 ; Flush to zero MXCSR_RN = 0x0000 ; To nearest rounding mode MXCSR_RM = 0x2000 ; Toward negative infinity rounding mode MXCSR_RP = 0x4000 ; Toward positive infinity rounding mode MXCSR_RZ = 0x6000 ; Toward zero rounding mode MXCSR_PM = 0x1000 ; Precision mask MXCSR_UM = 0x0800 ; Underflow mask MXCSR_OM = 0x0400 ; Overflow mask MXCSR_ZM = 0x0200 ; Divide-by-zero mask MXCSR_DM = 0x0100 ; Denormal mask MXCSR_IM = 0x0080 ; Invalid operation mask MXCSR_DZ = 0x0040 ; Denormals are zeros MXCSR_PE = 0x0020 ; Precision flag MXCSR_UE = 0x0010 ; Underflow flag MXCSR_OE = 0x0008 ; Overflow flag MXCSR_ZE = 0x0004 ; Divide-by-zero flag MXCSR_DE = 0x0002 ; Denormal flag MXCSR_IE = 0x0001 ; Invalid operation flag STDOUT_FILENO = 1 SYS64_write = 1 SYS64_brk = 12 SYS64_exit_group= 231 segment writeable readable mem_base rq 1 hex_table: db '0123456789abcdef' mxcsr rd 1 segment executable readable entry $ xor edi,edi mov eax,SYS64_brk syscall mov [mem_base],rax lea rdi,[rax + MEM_SIZE] mov eax,SYS64_brk syscall cmp rax,rdi mov edi,1 jnz .quit stmxcsr [mxcsr] ; none mov dl,'-' call print_char call print_char mov dl,10 call print_char and [mxcsr],not (MXCSR_DZ or MXCSR_FZ) ldmxcsr [mxcsr] call fake_time ; DAZ mov dl,'D' call print_char mov dl,'-' call print_char mov dl,10 call print_char and [mxcsr],not MXCSR_FZ or [mxcsr],MXCSR_DZ ldmxcsr [mxcsr] call fake_time ; FTZ mov dl,'-' call print_char mov dl,'F' call print_char mov dl,10 call print_char and [mxcsr],not MXCSR_DZ or [mxcsr],MXCSR_FZ ldmxcsr [mxcsr] call fake_time ; FTZ + DAZ mov dl,'D' call print_char mov dl,'F' call print_char mov dl,10 call print_char or [mxcsr],MXCSR_FZ or MXCSR_DZ ldmxcsr [mxcsr] call fake_time xor edi,edi .quit: mov eax,SYS64_exit_group syscall macro time reg { rdtsc mov e#reg,eax shl rdx,32 or r#reg,rdx } fake_time: push rdx rbx rax irp tst, <"N", NORM>, <"D", DENORM>, <"I", INF>, <"Q", QNAN>, <"S", SNAN> { match text =, number, tst \{ mov dl,text call print_char mov dl," " call print_char mov rax,number * MULTIPLIER call fill_mem time bx mov rax,STEPS call fake_application time ax sub rax,rbx call print_hex16 mov dl,10 call print_char \} } pop rax rbx rdx ret fill_mem: ;rax = value push rdi rcx mov rdi,[mem_base] mov ecx,MEM_SIZE shr 3 rep stosq pop rcx rdi ret fake_application: ;rax = loops push rdi rcx rax .outer_loop: mov rdi,[mem_base] mov ecx,MEM_SIZE shr 8 .inner_loop: match opsize, OP_SIZE { rept 16 r:0 \{ movap#opsize xmm\#r,[rdi + 16 * r] \} rept 16 r:0 \{ mulp#opsize xmm\#r,xmm\#r \} } add rdi,1 shl 8 dec rcx jnz .inner_loop dec rax jnz .outer_loop pop rax rcx rdi ret print_hex16: ;rax = value push rbx rax rcx rdx mov ecx,16 lea rbx,[hex_table] .next_nibble: rol rax,4 mov edx,eax and edx,0xf mov dl,[rbx + rdx] call print_char dec ecx jnz .next_nibble pop rdx rcx rax rbx ret print_char: ;dl = character push rsi rdi rax rcx r11 rdx mov eax,SYS64_write mov edi,STDOUT_FILENO mov rsi,rsp mov edx,1 syscall pop rdx r11 rcx rax rdi rsi ret |
|||
25 Sep 2023, 08:29 |
|
macgub 28 Oct 2023, 11:51
Hi!
tthsqe wrote:
I guess topic is about 80bit floats? Are vector packed operations on such data type in standard AVX 2/ 512 instructions set range? What are mnemonics of such instructions (just few examples)? As I read Intel manuals - I dont see any informations about 80/(128?) bit floats on AVX registers Did I miss something?. Thanks for answer !! |
|||
28 Oct 2023, 11:51 |
|
revolution 28 Oct 2023, 11:55
SSE and AVX don't support 80-bit floats, or anything longer than 64-bits, including integers, max size is 64-bits per element.
|
|||
28 Oct 2023, 11:55 |
|
macgub 28 Oct 2023, 18:13
@revolution - thanks for reply!
OK - so AVX 512? afaik FPU support 80bit floats as internal temporal variable. (I dont know if if orginal specification describes this values can be poped from FPU stack - I need reread ). But ttthsqe writes above code that shows something like vector operation. Is this implemented on real hardware , or manual emulation? 'Rocket lake', name used in topic seems to me like 11th generation of Intel x64 series. Or its something like PowerPC? Thanks for answer. |
|||
28 Oct 2023, 18:13 |
|
macgub 28 Oct 2023, 18:19
tthsqe wrote:
Seems to I missinterpreted tthsqes notation..... |
|||
28 Oct 2023, 18:19 |
|
revolution 28 Oct 2023, 18:34
The FPU has supported 80-bit floats since day one. You can freely load and store them from/to memory. There are no vector operations in the FPU. Only SSE/AVX have vector ops.
|
|||
28 Oct 2023, 18:34 |
|
macgub 29 Oct 2023, 09:23
Thanks for reply. As I wrote, I not correct understood meaning of ASCII diagram below:
tthsqe wrote:
|
|||
29 Oct 2023, 09:23 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.