flat assembler
Message board for the users of flat assembler.
Index
> Main > Great read about Haswell CPU Microarchitecture Goto page 1, 2 Next |
Author |
|
randall 04 Jun 2013, 10:04
Peak single precision floating point throughput (in Turbo Mode, 3.9 GHz):
32 FLOPS * 4 cores * 3.9GHz = 499.2 GFLOPS/s New AVX2 and FMA extensions. Really good CPU for graphics (fractal) programming. "The mighty AVX2 command for multiply-add (FMA) can be performed in parallel in two units via Port 0 and 1 with a latency of 5 cycles. In throughput Haswell comes with FMA at 32 flops per clock and core in single-precision and 16 flops / clock / core in the double precision." http://www.realworldtech.com/haswell-cpu/ http://www.anandtech.com/show/6355/intels-haswell-architecture Last edited by randall on 05 Jun 2013, 08:19; edited 1 time in total |
|||
04 Jun 2013, 10:04 |
|
randall 04 Jun 2013, 14:14
FMA is very popular in graphics programming. Almost all linera algebra computations can benefit from it (matrix multiplication, dot product, distance estimation etc. etc.).
Shading equatations can be expressed using FMAs (it's just linear combinations + maybe pow() for specular term). Most graphics workloads are just sequences of mul and add operations. By merging them in sequences of FMAs we can save a lot of time. FMA is one of the reasons why GPUs are so fast in graphics. They had FMA (MAD) from the beginning. |
|||
04 Jun 2013, 14:14 |
|
tthsqe 04 Jun 2013, 16:47
Ok, so what is the bottom line for add, mul, and fmadd?
Each cycle, the processor can dispatch one of the following: 1) an add and a mul 2) an add and a fmadd 3) a mul and a fmadd 4) two fmadd's ???????? I recall that the sandbridge could dispatch at most one add and one multiply per clock. It seems weird if haswell can dispatch two fmadd's per clock but not two mul's or two add's per clock. |
|||
04 Jun 2013, 16:47 |
|
randall 04 Jun 2013, 17:29
5) Two FP MUL (but not 2 FP ADD)
"A side effect of the FMA units is that you now get two ports worth of FP multiply units, which can be a big boon to legacy FP code." Here are some nice pictures for Haswell and Sandy Bridge: http://www.anandtech.com/show/6355/intels-haswell-architecture/8 |
|||
04 Jun 2013, 17:29 |
|
tthsqe 04 Jun 2013, 18:12
Ok, so we get every combo except 2x add. That should be a monster compared to my 2600K.
|
|||
04 Jun 2013, 18:12 |
|
randall 04 Jun 2013, 20:09
Yes, this is quite an improvement from Intel. I will test it next week. Currently, I have an old Core2 Duo 1.86 GHz so there will be a difference.
|
|||
04 Jun 2013, 20:09 |
|
bitRAKE 05 Jun 2013, 03:36
Wonder if they've actually covered the increased latency with the increased cache bandwidth? This always has the effect of increasing the optimal code granularity.
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
05 Jun 2013, 03:36 |
|
randall 05 Jun 2013, 07:54
bitRAKE wrote: Wonder if they've actually covered the increased latency with the increased cache bandwidth? This always has the effect of increasing the optimal code granularity. "Feeding the Beast: 2x Cache Bandwidth in Haswell" "With an outright doubling of peak FP throughput in Haswell, Intel had to ensure that the execution units had ample bandwidth to the caches to sustain performance. As a result L1 bandwidth is doubled, as is the interface between the L1 and L2 caches." http://www.anandtech.com/show/6355/intels-haswell-architecture/9 But what do you mean by "increased latency"? Which latency? |
|||
05 Jun 2013, 07:54 |
|
hopcode 05 Jun 2013, 10:21
hallo people, thanks for sharing docs. it is awesome this Haswell.
Quote: what do you mean by "increased latency"? Which latency? 8 physical ports in the exec-engine are 2 more than the old 6 ones. anyway instro fectching from cache happens at 16B like on old ones. the difference is in the 56 fused uop buffer, and on the 4-uops-on-hit (corresponding to a full 32B instruction). read it from http://www.realworldtech.com/haswell-cpu/2/ past the picture. also i think the answer to bitRAKE should be no,not directly. but the advantage of a single 56 entry uop buffer is huge, especially for normal and mixed (FP/integer) applications on a single thread. _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
05 Jun 2013, 10:21 |
|
bitRAKE 05 Jun 2013, 10:44
I have no doubt there is greater throughput, but how much resources need to be in flight before the break-even point? This is the code granularity. It will make previously optimized code preform poorly, or visa-versa - haswell optimized code preform poorly on previous models.
There is greater decode latency, and cache latency. Bandwidth helps throughput not latency. There will always be operations which do not happen massively-parallel. What is the increase in cost for these types of operations? To some extent it is always a trade-off between the two aspects. The later part of... http://www.realworldtech.com/haswell-cpu/5/ ...suggests that there are no increases in latency. But my L1 cache latency is only 3 cycles, not 4. From a processor a few generations old. My L2 cache latency is 15 cycles though. Trade-offs to increase throughput. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
05 Jun 2013, 10:44 |
|
hopcode 05 Jun 2013, 11:27
Quote: haswell optimized code preform poorly on previous models. for the rest, being doubled bandwidth for the same min. latency, i think older code performs better even when not-massively-parallel. docs & general benchmarks to share ? _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
05 Jun 2013, 11:27 |
|
Feryno 05 Jun 2013, 11:57
cool, virtualization improved also e.g. VMCS shadowing (easier to implement nested hypervisors then)
also Xeon E3-1200 V3 family (e.g. E3-1230V3 for $250) so it will be possible to buy CPU without GPU, but requires to find compatible MB (perhaps not yet available, maybe ASUS MB soon?) |
|||
05 Jun 2013, 11:57 |
|
randall 05 Jun 2013, 14:31
hopcode wrote:
http://www.anandtech.com/show/7003/the-haswell-review-intel-core-i74770k-i54560k-tested/6 http://www.phoronix.com/scan.php?page=article&item=intel_4770k_linux&num=1 |
|||
05 Jun 2013, 14:31 |
|
hopcode 05 Jun 2013, 16:21
thank you randall. i imagined something more and, alas! practically disappointing overall, after considering
first docs http://www.anandtech.com/show/6355/intels-haswell-architecture/9 i know only that polynomial evaluation matters, even when bignum is not in handle (i.e. normal application using NO-CACHE and the precision of the vector size available, 256bit) those benchmarkers might be not able to use all the power of it in normal cases. because they may "suffer" that pain inherited from the same Sandy Bridge frameworks. or i may be wrong. but 7-zip results are really frustrating my conclusion is, after some learning on tthsqe code+mathe: assembly mandelbrot is required, where i would anyway - avoid such new TSX instruction - use only bignum for 10^-1006 precision and take advantage of the improved L2 cache. this may show spectacular results. what i like is the new exec-engine. it has something very inspirative Cheers, _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
05 Jun 2013, 16:21 |
|
randall 05 Jun 2013, 17:02
I think that these results are as expected (I don't count regressions which can be caused by not fully prepared Linux kernel). Speed improvement in current (not compiled for a new architecture) programs is from 5% to 20%.
I think that the biggest problem with current floating point workloads (programs) is that they do not use FMA extension. Using FMA makes program shorter and uses biggest Haswell advantage - two FMADs per clock. I think that in carefully optimized programs using AVX2 and FMA we can expect much higher speed improvements. |
|||
05 Jun 2013, 17:02 |
|
Melissa 05 Jun 2013, 21:01
Transition to Penryn was great improvement in SSE performance
from previous generations. (more then double according to some benchmarks I have tried on q6600 and e8400). AMD fares pretty well on Linux with gcc. All in all they didn;t try -mfma compiler option to test that. On older code Haswell is not that impressive. We need pure SSE benchmarks in order to compare performance with previous generations. Nbody benchmark I have posted in Linux section is good at differentiating between generations in SSE performance. eg q6600/e8400. I expect it to be much faster on Haswell than eg Ivy Bridge. Intel says vector performance should be *double*. |
|||
05 Jun 2013, 21:01 |
|
tthsqe 06 Jun 2013, 19:17
Does anybody know if vdivpd/vsqrtpd is still split internally (thus doubling the latency of divpd/sqrtpd) on haswell? I noticed this when there was not a 2:1 performance ratio on AVX:SSE code for the mandelbox on sandy/ivebridge. Unfortuately, the mandelbox uses a divide, and this ratio is only 1.3:1.
|
|||
06 Jun 2013, 19:17 |
|
randall 07 Jun 2013, 08:15
tthsqe wrote: Does anybody know if vdivpd/vsqrtpd is still split internally (thus doubling the latency of divpd/sqrtpd) on haswell? I noticed this when there was not a 2:1 performance ratio on AVX:SSE code for the mandelbox on sandy/ivebridge. Unfortuately, the mandelbox uses a divide, and this ratio is only 1.3:1. Maybe rcp, mul instead of div would help? |
|||
07 Jun 2013, 08:15 |
|
Melissa 07 Jun 2013, 23:00
Looking at himeno bench and kernel compilations, seems
that Linux has problems with frequency scaling with Haswell. I cannot explain in other way. For example with i5 3570k if I use offset voltage, compilation of gcc/clang/kernel is slower than using fixed voltage. Seems that this is Linux common problem and for benchmarks it is better to use performance governor, but they (phoronix) didn;t state what scaling governor was used. All in all 3770k seems don't have problem 4770k has. |
|||
07 Jun 2013, 23:00 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.