flat assembler
Message board for the users of flat assembler.
Index
> Windows > Java faster than ASM!? Goto page Previous 1, 2, 3, 4, 5 Next |
Author |
|
LocoDelAssembly 02 Feb 2008, 20:37
About registers rugxulo, that excerpt doesn't really backs you, it says that the excesive use of registers could led to unoptimal results. I think that xorpd! made some changes on his mandelbrot code to make better use of execution units and one of the changes was using some instructions with memory operands even though enough registers were available (his code is 64-bit).
[edit] http://board.flatassembler.net/topic.php?p=69250#69250 (Maybe I'm wrong about Mandelbrot) |
|||
02 Feb 2008, 20:37 |
|
bitRAKE 02 Feb 2008, 21:16
We need to look at the dependancy chain first - if the processor is waiting on a result it hardly matters where that result is going (because of cache and delayed write to memory). See how many NOPs can be put in the loop without effecting the timing - the processor is just waiting most of the time.
|
|||
02 Feb 2008, 21:16 |
|
itsnobody 03 Feb 2008, 17:14
Hmm....this is interesting
I added one divide instruction to the Java speed test and it jumped from 1000 milliseconds to over 25000 milliseconds....then I added a divide instruction to the ASM Speed Test and the same thing happened, around 24000-25000 milliseconds I knew div was slow, but I didn't know it slows things down by 25 times |
|||
03 Feb 2008, 17:14 |
|
sleepsleep 03 Feb 2008, 17:27
does this thread somehow conclude,
java is the next BIG thing to learn ? |
|||
03 Feb 2008, 17:27 |
|
edfed 03 Feb 2008, 17:31
no
java suxxx. soon dead, cause asm will invade the world. |
|||
03 Feb 2008, 17:31 |
|
itsnobody 03 Feb 2008, 17:31
sleepsleep wrote: does this thread somehow conclude, I think it concludes that using good fast code is very important.... |
|||
03 Feb 2008, 17:31 |
|
f0dder 03 Feb 2008, 23:46
edfed wrote: no Please learn to differentiate between wishful thinking and The Real World(TM) Surprises me (or perhaps not...) that the JAVA compiler doesn't optimize-out the for loop - any half-assed C++ compiler would drop the for loop and replace it with a constant assignment instead, if that value is used at all. Theoretically a JIT'ed language could provide much better performance than anything you hand-code, because profiling can be done on actual data loads, and generated code be tailored to the needs (of course the profiling itself takes up some execution cycles as well...) In reality, the current JITers generate "okay enough code for what they're generally used for", but aren't offering jaw-dropping performance. Microsoft's JVM was a lot faster than SUN's, and I remember running filtered + color-lit texturemapped quake2.bsp loading 3D rendering in java. I think it was on a pmmx-200, or perhaps it was an athlon700. _________________ - carpe noctem |
|||
03 Feb 2008, 23:46 |
|
Yardman 04 Feb 2008, 00:50
[ Post removed by author. ]
Last edited by Yardman on 04 Apr 2012, 03:10; edited 2 times in total |
|||
04 Feb 2008, 00:50 |
|
Yardman 04 Feb 2008, 00:53
[ Post removed by author. ]
Last edited by Yardman on 04 Apr 2012, 03:11; edited 1 time in total |
|||
04 Feb 2008, 00:53 |
|
OzzY 04 Feb 2008, 22:02
Faster I could get:
Code: include 'win32ax.inc' .data _output rb 20 .code start: invoke MessageBox, 0, "Click Ok to Start", "Speed Test", 0 invoke GetTickCount xchg eax, ebx mov ecx, 1000000000 place: dec ecx jnz place invoke GetTickCount sub eax, ebx invoke wsprintf, _output, "%d milliseconds", eax invoke MessageBox, 0, _output, "Speed Test Finished", 0 invoke ExitProcess, 0 .end start |
|||
04 Feb 2008, 22:02 |
|
rugxulo 04 Feb 2008, 22:03
Yardman, you could either change the first two printf()s to puts(), which are usually faster, or at least combine the last two. And you should say "int main" with a "return 0;" at least, just to be correct.
In fact, it might be more useful if you actually did something tangible inside the "while" loop. It then probably wouldn't avoid the prologue crud or even be able to use ESI then. That would be more accurate, IMO. In short, you example needs more complexity. |
|||
04 Feb 2008, 22:03 |
|
itsnobody 05 Feb 2008, 03:51
add nop = dwCount/7; and see how much everything slows down
|
|||
05 Feb 2008, 03:51 |
|
itsnobody 05 Feb 2008, 04:03
OzzY wrote: Faster I could get: Seems faster, but it benchmarks more or less equal to Code: invoke MessageBox,NULL,"Click Ok to start","Speed Test",MB_OK invoke GetTickCount xchg eax, ebx mov ecx, 1000000000 place: dec ecx cmp ecx,0 jg place invoke GetTickCount sub eax, ebx invoke wsprintf,_output,"%d milliseconds",eax invoke MessageBox,NULL,_output,"Speed Test",MB_OK invoke ExitProcess,0 Also adding a div instruction slows everything down to slower than Java |
|||
05 Feb 2008, 04:03 |
|
bitRAKE 05 Feb 2008, 06:01
Code: include 'win32ax.inc' .data _output rb 4096 .code start: mov edi, 5 xor ebp,ebp invoke MessageBox, ebp, "Click Ok to Start", "Speed Test", ebp DoIt: mov ebx, 1000000000 invoke GetTickCount xchg eax, ebx place: dec eax jnz place invoke GetTickCount sub eax, ebx push eax dec edi jne DoIt invoke wsprintf, _output, <"%d ms",10,13,"%d ms",10,13,"%d ms",10,13,"%d ms",10,13,"%d ms"> invoke MessageBox, ebp, _output, "Speed Test Finished", ebp invoke ExitProcess, ebp .end start |
|||
05 Feb 2008, 06:01 |
|
Madis731 05 Feb 2008, 11:51
Use rax as register and compare it to Java's long woo-hoo!
|
|||
05 Feb 2008, 11:51 |
|
revolution 05 Feb 2008, 12:20
Just FYI using bitRAKE's code immediately above:
Pentium M: 2 clocks per loop. Core Duo: 1 clock per loop. (note: NOT a Core2 Duo) |
|||
05 Feb 2008, 12:20 |
|
Madis731 05 Feb 2008, 14:00
Topic1:
Khkm - excuse me, but you are considering a spin-loop a *CORRECT*!??!? test for Intel Architecture (yes, capitalized "A")??? This loop can never run under 1 clock cycle because of the jump! You're not testing memory neither are you testing registers. You're testing branch prediction and loops and jumping efficiency. Code: ;init count to 0 inc [count] cmp [count],const; = 3609 ;init ecx to 0 inc ecx inc [count] cmp ecx,1000000000; = 3125 ;init ecx and count to 0 mov ecx,[count] add ecx,1 cmp ecx,1000000000 mov [count],ecx; = 3062 ;init ecx to 0 inc ecx cmp ecx,const; = 531 ;init ecx to 1000000000 sub ecx,1; = 547 ;init ecx to 1000000000 nop sub ecx,1; = 532 Without dependency, its 282ms. Btw, the tests were conducted on a T7200 CPU, which is exactly 2GHz so 500ms means 1 loop per clock, whereas 250ms (282ms with some overhead) means 2 loops per clock. Code: mov eax,250000000 mov ebx,eax mov ecx,eax mov edx,eax align 16 place: sub eax,1 sub ebx,1 sub ecx,1 sub edx,1 jne place There's one more trick. If you double the 4 SUBs in the loop, you get 207-219ms (200ms in ideal). 2.5 loops per clock. Why you can't get more juice out of it is because CPU can't handle more than 4 uops per cycle and that is how much they generate. Topic2: No compiler, assembler or interpreter can disobey you and delete the code that you wrote! If I put for(;; ) loop to make an infinite delay - it must not remove it. I need it. And if I need a precise delay with for(int a=0;a<1000000000;a++){} I definitely need it - why else would I write it there - it must absolutely NOT delete it!!! That's not optimizing, that's censurizing BUT... damn - it (the billion one not the infinite loop) works even on Intel C++ Compiler. Oh well... |
|||
05 Feb 2008, 14:00 |
|
LocoDelAssembly 05 Feb 2008, 14:40
But Madis, infinite loops are not removed, just the ones that doesn't affect output or those that can be executed in compile time and replace them with the final state of the variables. The precise delay you say, in fact is not possible anymore, specialized delay code must be used instead.
As for why it is there if I don't need it, perhaps it was macro generated and the compiler do the favor of removing the cases where the macro generates an useless loop. And look at the advantages of removing a loop, I don't need to calculate a Taylor series with a calculator and place the constant because I can just write the loop and the compiler will place the final result instead of the loop so I can even change a constant and inmediatelly have more iterations of the calculation without touching a calculator or doing it by hand. HLLs must remove useless code, them are not Assembly so there is no need to have no least that one CPU instruction per code line. |
|||
05 Feb 2008, 14:40 |
|
f0dder 05 Feb 2008, 15:09
Madis731 wrote: Topic1: It's normal optimizing, just like function inlining, expression reduction ("...but I NEEEEED 2*2*2*2 to perform three muls!" (use PowerBASIC then)), et cetera. If you want a dummy loop, use a volatile counter: Quote:
Code: _main PROC ; 2 : { push ecx ; 3 : for(unsigned i=0; i<1000; i++) ; ; 4 : for(volatile unsigned i=0; i<42000; i++) ; mov DWORD PTR _i$2532[esp+4], 0 mov eax, DWORD PTR _i$2532[esp+4] cmp eax, 42000 ; 0000a410H jae SHORT $LN13@main mov eax, 1 npad 9 $LL3@main: add DWORD PTR _i$2532[esp+4], eax mov ecx, DWORD PTR _i$2532[esp+4] cmp ecx, 42000 ; 0000a410H jb SHORT $LL3@main $LN13@main: ; 5 : } xor eax, eax pop ecx ret 0 _main ENDP _________________ - carpe noctem |
|||
05 Feb 2008, 15:09 |
|
Goto page Previous 1, 2, 3, 4, 5 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.