flat assembler
Message board for the users of flat assembler.
Index
> Main > curious FLOPS issue |
Author |
|
revolution 18 Mar 2010, 06:27
Which CPU? Which OS?
|
|||
18 Mar 2010, 06:27 |
|
baldr 18 Mar 2010, 13:57
tthsqe,
I think, performance tests should be done in OS-independent way, from special boot sector for example. TSC is OK, but PMCs can give some insight on code execution. The only question is, who will bother to write such framework? Isn't rept 15-2+1 i:2 { movaps xmm#i, dqword[r] } more readable than irps i, 2 3 4 5 6 7 8 9 10 11 12 13 14 15 { movaps xmm#i,dqword[r] }? |
|||
18 Mar 2010, 13:57 |
|
tthsqe 19 Mar 2010, 05:51
Core 2 Quad Q8200 2.3 GHz, 64bit Vista.
The cpu seems to be choking - maybe there is such a thing as too much ILP? It would be interesting to see if the i7 has the same issue ... |
|||
19 Mar 2010, 05:51 |
|
revolution 19 Mar 2010, 06:08
tthsqe: Speed optimisation is really hard. Seriously, it is no simple matter. The internals of the CPUs are very complex. It is unlikely that the effect you see is simply a matter of "too much ILP". There are many mechanisms inside the CPU that can do things with small loops to make them use less bandwidth in the decoder and the cache, thus allowing more bandwidth for other things. Once your loop became larger than a certain size then those mechanisms no longer work and other factors like cache efficiency and decoder speed become more prominent.
This is why when optimising any code you have to say for which CPU, which OS, what data ranges and so on. But other things also matter like SDRAM timings and number of channels, clock multipliers, heatsink efficiency, other processes/tasks running, etc. My guess would be that you have hit the decoder limit for SSE instructions, probably close to one instruction per cycle for code that is in L1 cache and is larger than the decoder prefetch queue. |
|||
19 Mar 2010, 06:08 |
|
tthsqe 19 Mar 2010, 06:37
I don't see how a decoder limit would cause the steady increase until 5 and then the sudden drop off. ??
The code is virtually identical - the only thing varying is the number of independed tasks per block |
|||
19 Mar 2010, 06:37 |
|
revolution 19 Mar 2010, 06:40
Well the decoder is not as simple is it might first appear. It has a lot of internal optimisations that can work with small loops efficiently. Larger loops drop back to the default throughput.
|
|||
19 Mar 2010, 06:40 |
|
hopcode 19 Mar 2010, 10:56
tthsqe wrote: Core 2 Quad Q8200 2.3 GHz, 64bit Vista. ~ Like me Code: Number of cores 4 (max 4) Number of threads 4 (max 4) Name Intel Core 2 Quad Q8300 Codename Yorkfield Specification Intel(R) Core(TM)2 Quad CPU Q8300 @ 2.50GHz Package (platform ID) Socket 775 LGA (0x4) CPUID 6.7.A Extended CPUID 6.17 Core Stepping R0 Technology 45 nm Core Speed 2000.0 MHz Multiplier x FSB 6.0 x 333.3 MHz Rated Bus speed 1333.4 MHz Stock frequency 2500 MHz Instructions sets MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, EM64T L1 Data cache 4 x 32 KBytes, 8-way set associative, 64-byte line size L1 Instruction cache 4 x 32 KBytes, 8-way set associative, 64-byte line size L2 cache 2 x 2048 KBytes, 8-way set associative, 64-byte line size FID/VID Control yes FID range 6.0x - 7.5x Max VID 1.250 V It is important having not a shared cache. My results are similiar to yours. But the rdtsc in your code is very messing. For example, to have a delta variance only ~ 100% the value, like in the following results Code: 3 ---- 0,76 4 ---- 1,03 5 ---- 1,29 6 ---- 1,52 7 ---- 1,78 put this instruction here Code: @@: rept 4 { emms ;<--------- this instruction addpd xmm2,xmm0 ; .... rest I have a recipe, but i will not publish it until i have found improvements on it. Cheers, hopcode |
|||
19 Mar 2010, 10:56 |
|
Madis731 19 Mar 2010, 14:38
What should the bootable framework look like?
For example you could just run your code and print out the results in Real Mode, can you not? The advanced variant of this bootable framework is of course some shell where you can write fancy things like "run", "help" and "bench test_3a -threads=3" |
|||
19 Mar 2010, 14:38 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.