flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
revolution 13 Jan 2022, 13:31
I suppose, in theory, if SP is enough for the application then it might be more efficient with regard to memory accesses, since it only uses 4 bytes per value.
But, it might make no difference also. It depends upon what the CPU is doing, how it does it, and when it does it. Plus each CPU/system has different performance anyway. The only way to really know is to test the code in both configurations to see if there is a noticeable difference. |
|||
![]() |
|
sts-q 13 Jan 2022, 14:21
I _*believe*_ mixing 32 and 64 bit values slows things down, when accessing RAM.
( I _*know*_ this needs a lot more testing... ) ( on i3-550 ) ![]() Agner wrote about that, but where ![]() |
|||
![]() |
|
revolution 13 Jan 2022, 14:35
sts-q wrote: I _*believe*_ mixing 32 and 64 bit values slows things down, when accessing RAM. There is no single answer that applies to all systems, or the code it runs. |
|||
![]() |
|
Hrstka 13 Jan 2022, 15:30
If you are using x87 instructions, the numbers are always converted by the CPU to "long double" (80-bit) floating point format. So I think it makes no difference. When using sse/avx instructions, you can do more calculations in parallel with single precision.
|
|||
![]() |
|
revolution 13 Jan 2022, 16:08
Hrstka wrote: If you are using x87 instructions, the numbers are always converted by the CPU to "long double" (80-bit) floating point format. For modern CPUs the latency is probably only different for a few of the instructions, like fdiv or fsqrt. But check your particular system to see what effect it actually has. |
|||
![]() |
|
donn 20 Jan 2022, 05:44
In terms of raw calculations, I think Single Precision SSE is considered twice as fast.
If you look at GFLOPS metrics on GPUs, Single Precision peak performance is higher than Double Precision from what I've seen and I suspected it's the same on CPUs: So, if you look at instruction latencies, such as on AMD's Developer Guides Optimization guides, a floating point instruction has: Code: Instruction Op1 Op2 Op3 Op4 "APM Vol" Cpuid flag Ops Unit Latency Throughput VMULPD ymm1 ymm2 ymm3 4 AVX 1 FP0/1 3 2 VMULPS xmm1 xmm2 xmm3 4 AVX 1 FP0/1 3 2 So they execute at the same speed, but you get twice as many calculations in there per computation. An Intel Employee seems to confirm this: https://community.intel.com/t5/Intel-ISA-Extensions/GFLOPS-numbers-advertised-by-Intel/m-p/905807 Quote:
So, this is just considering the speed of a computation and if you take advantage of it. Sometimes passing in twice as many floating point values is not possible or what the calculation calls for, so this performance gain may not be achieved. Single precision numbers also have a smaller memory footprint, which is a plus. Double precision is handy when a larger range is required in a number, single precision provides less in a single number, sort of like high quality HD images versus low quality. They're encoded as scientific notation, so it's not necessarily just bigger or smaller numbers are possible, but the range within it, the fidelity. When dealing with performance, there are definitely unknowns, nothing is clear cut, but there are risks and opportunities you can deal with and take advantage of and then test. |
|||
![]() |
|
revolution 20 Jan 2022, 06:13
In a purely theoretical scenario then SP >= DP, and would never show lower performance.
But CPUs don't follow theoretical scenarios, they are real world practical devices, with many independent moving parts, each doing their thing in their own time, and competing with each other for limited resources. And there can even be unintuitive pathological cases where SP < DP. However, I think in general, it would be quite a safe assumption to say the SP >= DP for all but the weirdest cases. As to whether SP > DP though, that requires testing to confirm. There are many cases where SP = DP. Your particular application might fall into that category. It all depends upon what you are doing. |
|||
![]() |
|
donn 20 Jan 2022, 06:59
Sure yep, theories. There are probably theories behind scenarios that yield unexpected results too and probabilities associated with their causes.
So it definitely helps to learn context around the theories so they can be used or not. In this example, SP can be faster than DP, but if you don't use more SP values in the equivalent instruction, you may not gain any benefit. Maybe you need more SP values to model your data so you run more computations, perform more memory moves without proper batching, and you can get slower results. Agreed also, real world testing does not necessarily line up with theory due to so many moving parts, like background processes running that are totally unrelated. |
|||
![]() |
|
revolution 20 Jan 2022, 07:19
Perhaps to answer the OPs questions (all IMO):
1. If SP precision is sufficient for your use then use SP. 2. If SP isn't enough bits, then you have no choice, use DP. Using that, then you probably will be just fine with regard to performance. Almost never slower, sometimes equal, and sometimes a gain. |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.