Message board for the users of flat assembler.
> Main > Great read about Haswell CPU Microarchitecture
Goto page Previous 1, 2
..as expected...Speed improvement in current (not compiled for a new architecture) programs is from 5% to 20%.
ok, as expected, you say, on normal app; i read docs again, and the only parameter bound to this expectation seems to be
the ROB expansion to 192 entry, +15% above Sandy bridge.
supposing tests are correct there, we should admit the existance of something wrong in the uop cache, i.e in its design or
in programming it from those test software.
it should be 2 loads + 1 store and 5 instructions per cycle !
or the +2 ports come to be practically unused
de facto while using vectors and FMA (where there's practically no i-caches misses) all seems running fine, as announced by Intel.
|09 Jun 2013, 12:04||
Does anybody know if vdivpd/vsqrtpd is still split internally (thus doubling the latency of divpd/sqrtpd) on haswell? I noticed this when there was not a 2:1 performance ratio on AVX:SSE code for the mandelbox on sandy/ivebridge. Unfortuately, the mandelbox uses a divide, and this ratio is only 1.3:1.
I have asked in clax newsgroup and got this list:
Seems that situation is same as for previous generations.
|11 Jun 2013, 20:23||
thanks for the info. I was previously doing a divide BEFORE clamping to [0.25,1]. I did see some decent improvement by dividing AFTER clamping to [0.25,1], as the divpd instruction is faster on powers of 2. This improvement is good enough that the rcps+newton solution is not better.
|11 Jun 2013, 20:32||
|Goto page Previous 1, 2
< Last Thread | Next Thread >
Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.