flat assembler
Message board for the users of flat assembler.
![]() Goto page 1, 2 Next |
Author |
|
randall 04 Jun 2013, 10:04
Peak single precision floating point throughput (in Turbo Mode, 3.9 GHz):
32 FLOPS * 4 cores * 3.9GHz = 499.2 GFLOPS/s New AVX2 and FMA extensions. Really good CPU for graphics (fractal) programming. "The mighty AVX2 command for multiply-add (FMA) can be performed in parallel in two units via Port 0 and 1 with a latency of 5 cycles. In throughput Haswell comes with FMA at 32 flops per clock and core in single-precision and 16 flops / clock / core in the double precision." http://www.realworldtech.com/haswell-cpu/ http://www.anandtech.com/show/6355/intels-haswell-architecture Last edited by randall on 05 Jun 2013, 08:19; edited 1 time in total |
|||
![]() |
|
randall 04 Jun 2013, 14:14
FMA is very popular in graphics programming. Almost all linera algebra computations can benefit from it (matrix multiplication, dot product, distance estimation etc. etc.).
Shading equatations can be expressed using FMAs (it's just linear combinations + maybe pow() for specular term). Most graphics workloads are just sequences of mul and add operations. By merging them in sequences of FMAs we can save a lot of time. FMA is one of the reasons why GPUs are so fast in graphics. They had FMA (MAD) from the beginning. |
|||
![]() |
|
tthsqe 04 Jun 2013, 16:47
Ok, so what is the bottom line for add, mul, and fmadd?
Each cycle, the processor can dispatch one of the following: 1) an add and a mul 2) an add and a fmadd 3) a mul and a fmadd 4) two fmadd's ???????? I recall that the sandbridge could dispatch at most one add and one multiply per clock. It seems weird if haswell can dispatch two fmadd's per clock but not two mul's or two add's per clock. |
|||
![]() |
|
randall 04 Jun 2013, 17:29
5) Two FP MUL (but not 2 FP ADD)
"A side effect of the FMA units is that you now get two ports worth of FP multiply units, which can be a big boon to legacy FP code." Here are some nice pictures for Haswell and Sandy Bridge: http://www.anandtech.com/show/6355/intels-haswell-architecture/8 |
|||
![]() |
|
tthsqe 04 Jun 2013, 18:12
Ok, so we get every combo except 2x add. That should be a monster compared to my 2600K.
![]() |
|||
![]() |
|
randall 04 Jun 2013, 20:09
Yes, this is quite an improvement from Intel. I will test it next week. Currently, I have an old Core2 Duo 1.86 GHz so there will be a difference.
![]() |
|||
![]() |
|
bitRAKE 05 Jun 2013, 03:36
Wonder if they've actually covered the increased latency with the increased cache bandwidth? This always has the effect of increasing the optimal code granularity.
|
|||
![]() |
|
randall 05 Jun 2013, 07:54
bitRAKE wrote: Wonder if they've actually covered the increased latency with the increased cache bandwidth? This always has the effect of increasing the optimal code granularity. "Feeding the Beast: 2x Cache Bandwidth in Haswell" "With an outright doubling of peak FP throughput in Haswell, Intel had to ensure that the execution units had ample bandwidth to the caches to sustain performance. As a result L1 bandwidth is doubled, as is the interface between the L1 and L2 caches." http://www.anandtech.com/show/6355/intels-haswell-architecture/9 But what do you mean by "increased latency"? Which latency? |
|||
![]() |
|
hopcode 05 Jun 2013, 10:21
hallo people, thanks for sharing docs. it is awesome this Haswell.
Quote: what do you mean by "increased latency"? Which latency? 8 physical ports in the exec-engine are 2 more than the old 6 ones. anyway instro fectching from cache happens at 16B like on old ones. the difference is in the 56 fused uop buffer, and on the 4-uops-on-hit (corresponding to a full 32B instruction). read it from http://www.realworldtech.com/haswell-cpu/2/ past the picture. also i think the answer to bitRAKE should be no,not directly. but the advantage of a single 56 entry uop buffer is huge, especially for normal and mixed (FP/integer) applications on a single thread. _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
bitRAKE 05 Jun 2013, 10:44
I have no doubt there is greater throughput, but how much resources need to be in flight before the break-even point? This is the code granularity. It will make previously optimized code preform poorly, or visa-versa - haswell optimized code preform poorly on previous models.
There is greater decode latency, and cache latency. Bandwidth helps throughput not latency. There will always be operations which do not happen massively-parallel. What is the increase in cost for these types of operations? To some extent it is always a trade-off between the two aspects. The later part of... http://www.realworldtech.com/haswell-cpu/5/ ...suggests that there are no increases in latency. But my L1 cache latency is only 3 cycles, not 4. From a processor a few generations old. My L2 cache latency is 15 cycles though. Trade-offs to increase throughput. |
|||
![]() |
|
hopcode 05 Jun 2013, 11:27
Quote: haswell optimized code preform poorly on previous models. for the rest, being doubled bandwidth for the same min. latency, i think older code performs better even when not-massively-parallel. docs & general benchmarks to share ? _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
Feryno 05 Jun 2013, 11:57
cool, virtualization improved also e.g. VMCS shadowing (easier to implement nested hypervisors then)
also Xeon E3-1200 V3 family (e.g. E3-1230V3 for $250) so it will be possible to buy CPU without GPU, but requires to find compatible MB (perhaps not yet available, maybe ASUS MB soon?) |
|||
![]() |
|
randall 05 Jun 2013, 14:31
hopcode wrote:
http://www.anandtech.com/show/7003/the-haswell-review-intel-core-i74770k-i54560k-tested/6 http://www.phoronix.com/scan.php?page=article&item=intel_4770k_linux&num=1 |
|||
![]() |
|
hopcode 05 Jun 2013, 16:21
thank you randall. i imagined something more and, alas! practically disappointing overall, after considering
first docs http://www.anandtech.com/show/6355/intels-haswell-architecture/9 i know only that polynomial evaluation matters, even when bignum is not in handle (i.e. normal application using NO-CACHE and the precision of the vector size available, 256bit) those benchmarkers might be not able to use all the power of it in normal cases. because they may "suffer" that pain inherited from the same Sandy Bridge frameworks. or i may be wrong. but 7-zip results are really frustrating ![]() my conclusion is, after some learning on tthsqe code+mathe: assembly mandelbrot is required, where i would anyway - avoid such new TSX instruction - use only bignum for 10^-1006 precision and take advantage of the improved L2 cache. this may show spectacular results. what i like is the new exec-engine. it has something very inspirative Cheers, _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
randall 05 Jun 2013, 17:02
I think that these results are as expected (I don't count regressions which can be caused by not fully prepared Linux kernel). Speed improvement in current (not compiled for a new architecture) programs is from 5% to 20%.
I think that the biggest problem with current floating point workloads (programs) is that they do not use FMA extension. Using FMA makes program shorter and uses biggest Haswell advantage - two FMADs per clock. I think that in carefully optimized programs using AVX2 and FMA we can expect much higher speed improvements. |
|||
![]() |
|
Melissa 05 Jun 2013, 21:01
Transition to Penryn was great improvement in SSE performance
from previous generations. (more then double according to some benchmarks I have tried on q6600 and e8400). AMD fares pretty well on Linux with gcc. All in all they didn;t try -mfma compiler option to test that. On older code Haswell is not that impressive. We need pure SSE benchmarks in order to compare performance with previous generations. Nbody benchmark I have posted in Linux section is good at differentiating between generations in SSE performance. eg q6600/e8400. I expect it to be much faster on Haswell than eg Ivy Bridge. Intel says vector performance should be *double*. |
|||
![]() |
|
tthsqe 06 Jun 2013, 19:17
Does anybody know if vdivpd/vsqrtpd is still split internally (thus doubling the latency of divpd/sqrtpd) on haswell? I noticed this when there was not a 2:1 performance ratio on AVX:SSE code for the mandelbox on sandy/ivebridge. Unfortuately, the mandelbox uses a divide, and this ratio is only 1.3:1.
![]() |
|||
![]() |
|
randall 07 Jun 2013, 08:15
tthsqe wrote: Does anybody know if vdivpd/vsqrtpd is still split internally (thus doubling the latency of divpd/sqrtpd) on haswell? I noticed this when there was not a 2:1 performance ratio on AVX:SSE code for the mandelbox on sandy/ivebridge. Unfortuately, the mandelbox uses a divide, and this ratio is only 1.3:1. Maybe rcp, mul instead of div would help? |
|||
![]() |
|
Melissa 07 Jun 2013, 23:00
Looking at himeno bench and kernel compilations, seems
that Linux has problems with frequency scaling with Haswell. I cannot explain in other way. For example with i5 3570k if I use offset voltage, compilation of gcc/clang/kernel is slower than using fixed voltage. Seems that this is Linux common problem and for benchmarks it is better to use performance governor, but they (phoronix) didn;t state what scaling governor was used. All in all 3770k seems don't have problem 4770k has. |
|||
![]() |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.