flat assembler
Message board for the users of flat assembler.

Index > Main > Great read about Haswell CPU Microarchitecture

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 04 Jun 2013, 10:04
Peak single precision floating point throughput (in Turbo Mode, 3.9 GHz):
32 FLOPS * 4 cores * 3.9GHz = 499.2 GFLOPS/s

New AVX2 and FMA extensions. Really good CPU for graphics (fractal) programming.

"The mighty AVX2 command for multiply-add (FMA) can be performed in parallel in two units via Port 0 and 1 with a latency of 5 cycles. In throughput Haswell comes with FMA at 32 flops per clock and core in single-precision and 16 flops / clock / core in the double precision."

http://www.realworldtech.com/haswell-cpu/
http://www.anandtech.com/show/6355/intels-haswell-architecture


Last edited by randall on 05 Jun 2013, 08:19; edited 1 time in total
Post 04 Jun 2013, 10:04
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 04 Jun 2013, 13:48
The only thing that I know of that uses pure FMA all the way through is matrix multiplication. Maybe we'll have to run some tests comparing sandybridge/ivybridge/haswell on our fractal drawing programs.
Post 04 Jun 2013, 13:48
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 04 Jun 2013, 14:14
FMA is very popular in graphics programming. Almost all linera algebra computations can benefit from it (matrix multiplication, dot product, distance estimation etc. etc.).

Shading equatations can be expressed using FMAs (it's just linear combinations + maybe pow() for specular term).

Most graphics workloads are just sequences of mul and add operations. By merging them in sequences of FMAs we can save a lot of time.

FMA is one of the reasons why GPUs are so fast in graphics. They had FMA (MAD) from the beginning.
Post 04 Jun 2013, 14:14
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 04 Jun 2013, 16:47
Ok, so what is the bottom line for add, mul, and fmadd?

Each cycle, the processor can dispatch one of the following:
1) an add and a mul
2) an add and a fmadd
3) a mul and a fmadd
4) two fmadd's
????????

I recall that the sandbridge could dispatch at most one add and one multiply per clock. It seems weird if haswell can dispatch two fmadd's per clock but not two mul's or two add's per clock.
Post 04 Jun 2013, 16:47
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 04 Jun 2013, 17:29
5) Two FP MUL (but not 2 FP ADD)

"A side effect of the FMA units is that you now get two ports worth of FP multiply units, which can be a big boon to legacy FP code."

Here are some nice pictures for Haswell and Sandy Bridge:
http://www.anandtech.com/show/6355/intels-haswell-architecture/8
Post 04 Jun 2013, 17:29
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 04 Jun 2013, 18:12
Ok, so we get every combo except 2x add. That should be a monster compared to my 2600K. Smile
Post 04 Jun 2013, 18:12
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 04 Jun 2013, 20:09
Yes, this is quite an improvement from Intel. I will test it next week. Currently, I have an old Core2 Duo 1.86 GHz so there will be a difference. Smile
Post 04 Jun 2013, 20:09
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4050
Location: vpcmpistri
bitRAKE 05 Jun 2013, 03:36
Wonder if they've actually covered the increased latency with the increased cache bandwidth? This always has the effect of increasing the optimal code granularity.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 05 Jun 2013, 03:36
View user's profile Send private message Visit poster's website Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 05 Jun 2013, 07:54
bitRAKE wrote:
Wonder if they've actually covered the increased latency with the increased cache bandwidth? This always has the effect of increasing the optimal code granularity.

"Feeding the Beast: 2x Cache Bandwidth in Haswell"

"With an outright doubling of peak FP throughput in Haswell, Intel had to ensure that the execution units had ample bandwidth to the caches to sustain performance. As a result L1 bandwidth is doubled, as is the interface between the L1 and L2 caches."

http://www.anandtech.com/show/6355/intels-haswell-architecture/9

But what do you mean by "increased latency"? Which latency?
Post 05 Jun 2013, 07:54
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 05 Jun 2013, 10:21
hallo people, thanks for sharing docs. it is awesome this Haswell.
Quote:
what do you mean by "increased latency"? Which latency?

8 physical ports in the exec-engine are 2 more than the old 6 ones. anyway instro fectching from cache happens at 16B like on old ones.
the difference is in the 56 fused uop buffer, and on the 4-uops-on-hit (corresponding to a full 32B instruction).
read it from http://www.realworldtech.com/haswell-cpu/2/ past the picture. also i think the answer to bitRAKE should be no,not directly.
but the advantage of a single 56 entry uop buffer is huge, especially for normal and mixed (FP/integer) applications
on a single thread.

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 05 Jun 2013, 10:21
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4050
Location: vpcmpistri
bitRAKE 05 Jun 2013, 10:44
I have no doubt there is greater throughput, but how much resources need to be in flight before the break-even point? This is the code granularity. It will make previously optimized code preform poorly, or visa-versa - haswell optimized code preform poorly on previous models.

There is greater decode latency, and cache latency. Bandwidth helps throughput not latency. There will always be operations which do not happen massively-parallel. What is the increase in cost for these types of operations? To some extent it is always a trade-off between the two aspects.

The later part of...
http://www.realworldtech.com/haswell-cpu/5/
...suggests that there are no increases in latency.

But my L1 cache latency is only 3 cycles, not 4. From a processor a few generations old. My L2 cache latency is 15 cycles though. Trade-offs to increase throughput.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 05 Jun 2013, 10:44
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 05 Jun 2013, 11:27
Quote:
haswell optimized code preform poorly on previous models.
i agree 100% on that, considering the new exec-engine (i like it). but we know now in a predetermined way how much slower it performs on older processor like Sandy Bridge and Nehalem.
for the rest, being doubled bandwidth for the same min. latency, i think older code performs better even when not-massively-parallel.
docs & general benchmarks to share ?

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 05 Jun 2013, 11:27
View user's profile Send private message Visit poster's website Reply with quote
Feryno



Joined: 23 Mar 2005
Posts: 509
Location: Czech republic, Slovak republic
Feryno 05 Jun 2013, 11:57
cool, virtualization improved also e.g. VMCS shadowing (easier to implement nested hypervisors then)
also Xeon E3-1200 V3 family (e.g. E3-1230V3 for $250) so it will be possible to buy CPU without GPU, but requires to find compatible MB (perhaps not yet available, maybe ASUS MB soon?)
Post 05 Jun 2013, 11:57
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 05 Jun 2013, 14:31
Post 05 Jun 2013, 14:31
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 05 Jun 2013, 16:21
thank you randall. i imagined something more and, alas! practically disappointing overall, after considering
first docs http://www.anandtech.com/show/6355/intels-haswell-architecture/9

i know only that polynomial evaluation matters, even when bignum is not in handle
(i.e. normal application using NO-CACHE and the precision of the vector size available, 256bit)

those benchmarkers might be not able to use all the power of it in normal cases. because they may "suffer"
that pain inherited from the same Sandy Bridge frameworks. or i may be wrong. but 7-zip results are really frustrating Smile

my conclusion is, after some learning on tthsqe code+mathe: assembly mandelbrot is required, where i would anyway
- avoid such new TSX instruction
- use only bignum for 10^-1006 precision

and take advantage of the improved L2 cache. this may show spectacular results.
what i like is the new exec-engine. it has something very inspirative

Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 05 Jun 2013, 16:21
View user's profile Send private message Visit poster's website Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 05 Jun 2013, 17:02
I think that these results are as expected (I don't count regressions which can be caused by not fully prepared Linux kernel). Speed improvement in current (not compiled for a new architecture) programs is from 5% to 20%.

I think that the biggest problem with current floating point workloads (programs) is that they do not use FMA extension. Using FMA makes program shorter and uses biggest Haswell advantage - two FMADs per clock.

I think that in carefully optimized programs using AVX2 and FMA we can expect much higher speed improvements.
Post 05 Jun 2013, 17:02
View user's profile Send private message Visit poster's website Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 125
Melissa 05 Jun 2013, 21:01
Transition to Penryn was great improvement in SSE performance
from previous generations. (more then double according to
some benchmarks I have tried on q6600 and e8400).
AMD fares pretty well on Linux with gcc.
All in all they didn;t try -mfma compiler option to test that.
On older code Haswell is not that impressive.
We need pure SSE benchmarks in order to compare performance
with previous generations.
Nbody benchmark I have posted in Linux section is good at differentiating between generations in SSE performance. eg q6600/e8400.
I expect it to be much faster on Haswell than eg Ivy Bridge.
Intel says vector performance should be *double*.
Post 05 Jun 2013, 21:01
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 06 Jun 2013, 19:17
Does anybody know if vdivpd/vsqrtpd is still split internally (thus doubling the latency of divpd/sqrtpd) on haswell? I noticed this when there was not a 2:1 performance ratio on AVX:SSE code for the mandelbox on sandy/ivebridge. Unfortuately, the mandelbox uses a divide, and this ratio is only 1.3:1. Sad
Post 06 Jun 2013, 19:17
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 07 Jun 2013, 08:15
tthsqe wrote:
Does anybody know if vdivpd/vsqrtpd is still split internally (thus doubling the latency of divpd/sqrtpd) on haswell? I noticed this when there was not a 2:1 performance ratio on AVX:SSE code for the mandelbox on sandy/ivebridge. Unfortuately, the mandelbox uses a divide, and this ratio is only 1.3:1. Sad


Maybe rcp, mul instead of div would help?
Post 07 Jun 2013, 08:15
View user's profile Send private message Visit poster's website Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 125
Melissa 07 Jun 2013, 23:00
Looking at himeno bench and kernel compilations, seems
that Linux has problems with frequency scaling with Haswell.
I cannot explain in other way.
For example with i5 3570k if I use offset voltage, compilation
of gcc/clang/kernel is slower than using fixed voltage.
Seems that this is Linux common problem and for
benchmarks it is better to use performance governor,
but they (phoronix) didn;t state what scaling governor
was used.
All in all 3770k seems don't have problem 4770k has.
Post 07 Jun 2013, 23:00
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.