flat assembler
Message board for the users of flat assembler.

Index > Windows > Java faster than ASM!?

Goto page Previous  1, 2, 3, 4, 5  Next
Author
Thread Post new topic Reply to topic
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 02 Feb 2008, 19:59
Quote:
I know Java has some type of advanced memory management system, that could be a reason it's faster also

Java has garbage collecting, which in some cases can be faster than normal heap. Also thanks to GC, it can do some more optimizations impossible for languages without GC.
Post 02 Feb 2008, 19:59
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 02 Feb 2008, 20:37
About registers rugxulo, that excerpt doesn't really backs you, it says that the excesive use of registers could led to unoptimal results. I think that xorpd! made some changes on his mandelbrot code to make better use of execution units and one of the changes was using some instructions with memory operands even though enough registers were available (his code is 64-bit).

[edit] http://board.flatassembler.net/topic.php?p=69250#69250 (Maybe I'm wrong about Mandelbrot)
Post 02 Feb 2008, 20:37
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4024
Location: vpcmpistri
bitRAKE 02 Feb 2008, 21:16
We need to look at the dependancy chain first - if the processor is waiting on a result it hardly matters where that result is going (because of cache and delayed write to memory). See how many NOPs can be put in the loop without effecting the timing - the processor is just waiting most of the time.
Post 02 Feb 2008, 21:16
View user's profile Send private message Visit poster's website Reply with quote
itsnobody



Joined: 01 Feb 2008
Posts: 93
Location: Silver Spring, MD
itsnobody 03 Feb 2008, 17:14
Hmm....this is interesting

I added one divide instruction to the Java speed test and it jumped from 1000 milliseconds to over 25000 milliseconds....then I added a divide instruction to the ASM Speed Test and the same thing happened, around 24000-25000 milliseconds

I knew div was slow, but I didn't know it slows things down by 25 times
Post 03 Feb 2008, 17:14
View user's profile Send private message Reply with quote
sleepsleep



Joined: 05 Oct 2006
Posts: 12741
Location: ˛                             ⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣Posts: 0010456
sleepsleep 03 Feb 2008, 17:27
does this thread somehow conclude,
java is the next BIG thing to learn ?
Post 03 Feb 2008, 17:27
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 03 Feb 2008, 17:31
no
java suxxx.

soon dead, cause asm will invade the world.
Post 03 Feb 2008, 17:31
View user's profile Send private message Visit poster's website Reply with quote
itsnobody



Joined: 01 Feb 2008
Posts: 93
Location: Silver Spring, MD
itsnobody 03 Feb 2008, 17:31
sleepsleep wrote:
does this thread somehow conclude,
java is the next BIG thing to learn ?


I think it concludes that using good fast code is very important....
Post 03 Feb 2008, 17:31
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 03 Feb 2008, 23:46
edfed wrote:
no
java suxxx.

soon dead, cause asm will invade the world.


Please learn to differentiate between wishful thinking and The Real World(TM) Smile

Surprises me (or perhaps not...) that the JAVA compiler doesn't optimize-out the for loop - any half-assed C++ compiler would drop the for loop and replace it with a constant assignment instead, if that value is used at all.

Theoretically a JIT'ed language could provide much better performance than anything you hand-code, because profiling can be done on actual data loads, and generated code be tailored to the needs (of course the profiling itself takes up some execution cycles as well...)

In reality, the current JITers generate "okay enough code for what they're generally used for", but aren't offering jaw-dropping performance. Microsoft's JVM was a lot faster than SUN's, and I remember running filtered + color-lit texturemapped quake2.bsp loading 3D rendering in java. I think it was on a pmmx-200, or perhaps it was an athlon700.

_________________
Image - carpe noctem
Post 03 Feb 2008, 23:46
View user's profile Send private message Visit poster's website Reply with quote
Yardman



Joined: 12 Apr 2005
Posts: 244
Location: US
Yardman 04 Feb 2008, 00:50
[ Post removed by author. ]


Last edited by Yardman on 04 Apr 2012, 03:10; edited 2 times in total
Post 04 Feb 2008, 00:50
View user's profile Send private message Reply with quote
Yardman



Joined: 12 Apr 2005
Posts: 244
Location: US
Yardman 04 Feb 2008, 00:53
[ Post removed by author. ]


Last edited by Yardman on 04 Apr 2012, 03:11; edited 1 time in total
Post 04 Feb 2008, 00:53
View user's profile Send private message Reply with quote
OzzY



Joined: 19 Sep 2003
Posts: 1029
Location: Everywhere
OzzY 04 Feb 2008, 22:02
Faster I could get:

Code:
include 'win32ax.inc'
.data  
    _output rb 20  
.code  

start:  
            invoke MessageBox, 0, "Click Ok to Start", "Speed Test", 0 
            invoke GetTickCount  
            xchg eax, ebx
            mov ecx, 1000000000 
            place:  
              dec ecx
             jnz place
            invoke GetTickCount
            sub eax, ebx 
            invoke wsprintf, _output, "%d milliseconds", eax  
            invoke MessageBox, 0, _output, "Speed Test Finished", 0
            invoke ExitProcess, 0 
.end start
    
Post 04 Feb 2008, 22:02
View user's profile Send private message Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 04 Feb 2008, 22:03
Yardman, you could either change the first two printf()s to puts(), which are usually faster, or at least combine the last two. And you should say "int main" with a "return 0;" at least, just to be correct.

In fact, it might be more useful if you actually did something tangible inside the "while" loop. It then probably wouldn't avoid the prologue crud or even be able to use ESI then. That would be more accurate, IMO. In short, you example needs more complexity. Wink
Post 04 Feb 2008, 22:03
View user's profile Send private message Visit poster's website Reply with quote
itsnobody



Joined: 01 Feb 2008
Posts: 93
Location: Silver Spring, MD
itsnobody 05 Feb 2008, 03:51
add nop = dwCount/7; and see how much everything slows down
Post 05 Feb 2008, 03:51
View user's profile Send private message Reply with quote
itsnobody



Joined: 01 Feb 2008
Posts: 93
Location: Silver Spring, MD
itsnobody 05 Feb 2008, 04:03
OzzY wrote:
Faster I could get:

Code:
include 'win32ax.inc'
.data  
    _output rb 20  
.code  

start:  
            invoke MessageBox, 0, "Click Ok to Start", "Speed Test", 0 
            invoke GetTickCount  
            xchg eax, ebx
            mov ecx, 1000000000 
            place:  
             dec ecx
             jnz place
            invoke GetTickCount
            sub eax, ebx 
            invoke wsprintf, _output, "%d milliseconds", eax  
            invoke MessageBox, 0, _output, "Speed Test Finished", 0
            invoke ExitProcess, 0  
.end start
    


Seems faster, but it benchmarks more or less equal to
Code:
            invoke MessageBox,NULL,"Click Ok to start","Speed Test",MB_OK
            invoke GetTickCount
            xchg eax, ebx
            mov ecx, 1000000000
            place:  
                dec ecx
                cmp ecx,0
                jg place
            invoke GetTickCount
            sub eax, ebx
            invoke wsprintf,_output,"%d milliseconds",eax
            invoke MessageBox,NULL,_output,"Speed Test",MB_OK
            invoke ExitProcess,0   
    


Also adding a div instruction slows everything down to slower than Java
Post 05 Feb 2008, 04:03
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4024
Location: vpcmpistri
bitRAKE 05 Feb 2008, 06:01
Code:
include 'win32ax.inc'
.data
    _output rb 4096
.code   

start:
  mov edi, 5
  xor ebp,ebp

  invoke MessageBox, ebp, "Click Ok to Start", "Speed Test", ebp
DoIt:
  mov ebx, 1000000000
  invoke GetTickCount
  xchg eax, ebx

place:
  dec eax
  jnz place

  invoke GetTickCount
  sub eax, ebx
  push eax
  dec edi
  jne DoIt

  invoke wsprintf, _output, <"%d ms",10,13,"%d ms",10,13,"%d ms",10,13,"%d ms",10,13,"%d ms">
  invoke MessageBox, ebp, _output, "Speed Test Finished", ebp
  invoke ExitProcess, ebp
.end start    
Consistently, 625ms with DEC/ADD/SUB - which is an impressive cycle per loop. Adding a NOP almost doubled the time. Confused Using memory was 6x as long (3780ms)!
Post 05 Feb 2008, 06:01
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 05 Feb 2008, 11:51
Use rax as register and compare it to Java's long Very Happy woo-hoo!
Post 05 Feb 2008, 11:51
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20302
Location: In your JS exploiting you and your system
revolution 05 Feb 2008, 12:20
Just FYI using bitRAKE's code immediately above:

Pentium M: 2 clocks per loop.
Core Duo: 1 clock per loop. (note: NOT a Core2 Duo)
Post 05 Feb 2008, 12:20
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 05 Feb 2008, 14:00
Topic1:
Khkm - excuse me, but you are considering a spin-loop a *CORRECT*!??!? test for Intel Architecture (yes, capitalized "A")???

This loop can never run under 1 clock cycle because of the jump! You're not testing memory neither are you testing registers. You're testing branch prediction and loops and jumping efficiency.

Code:
;init count to 0
inc [count]
cmp [count],const; = 3609

;init ecx to 0
inc ecx
inc [count]
cmp ecx,1000000000; = 3125

;init ecx and count to 0
mov ecx,[count]
add ecx,1
cmp ecx,1000000000
mov [count],ecx; = 3062

;init ecx to 0
inc ecx
cmp ecx,const; = 531

;init ecx to 1000000000
sub ecx,1; = 547

;init ecx to 1000000000
nop
sub ecx,1; = 532
    

Without dependency, its 282ms. Btw, the tests were conducted on a T7200 CPU, which is exactly 2GHz so 500ms means 1 loop per clock, whereas 250ms (282ms with some overhead) means 2 loops per clock.
Code:
            mov eax,250000000
            mov ebx,eax
            mov ecx,eax
            mov edx,eax
       align 16
            place:
            sub eax,1
            sub ebx,1
            sub ecx,1
            sub edx,1
            jne place
    

There's one more trick. If you double the 4 SUBs in the loop, you get 207-219ms (200ms in ideal). 2.5 loops per clock. Why you can't get more juice out of it is because CPU can't handle more than 4 uops per cycle and that is how much they generate.

Topic2:
No compiler, assembler or interpreter can disobey you and delete the code that you wrote! If I put for(;; ) loop to make an infinite delay - it must not remove it. I need it. And if I need a precise delay with for(int a=0;a<1000000000;a++){} I definitely need it - why else would I write it there - it must absolutely NOT delete it!!! That's not optimizing, that's censurizing Razz

BUT... damn - it (the billion one not the infinite loop) works even on Intel C++ Compiler. Oh well...
Post 05 Feb 2008, 14:00
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 05 Feb 2008, 14:40
But Madis, infinite loops are not removed, just the ones that doesn't affect output or those that can be executed in compile time and replace them with the final state of the variables. The precise delay you say, in fact is not possible anymore, specialized delay code must be used instead.

As for why it is there if I don't need it, perhaps it was macro generated and the compiler do the favor of removing the cases where the macro generates an useless loop.

And look at the advantages of removing a loop, I don't need to calculate a Taylor series with a calculator and place the constant because I can just write the loop and the compiler will place the final result instead of the loop so I can even change a constant and inmediatelly have more iterations of the calculation without touching a calculator or doing it by hand.

HLLs must remove useless code, them are not Assembly so there is no need to have no least that one CPU instruction per code line.
Post 05 Feb 2008, 14:40
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 05 Feb 2008, 15:09
Madis731 wrote:
Topic1:
No compiler, assembler or interpreter can disobey you and delete the code that you wrote! If I put for(;; ) loop to make an infinite delay - it must not remove it. I need it. And if I need a precise delay with for(int a=0;a<1000000000;a++){} I definitely need it - why else would I write it there - it must absolutely NOT delete it!!! That's not optimizing, that's censurizing Razz

It's normal optimizing, just like function inlining, expression reduction ("...but I NEEEEED 2*2*2*2 to perform three muls!" (use PowerBASIC then)), et cetera.

If you want a dummy loop, use a volatile counter:
Quote:

C:\Temp>cl /Ox /FAs /c test.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.21022.08 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.

Code:
_main     PROC

; 2    : {

 push    ecx

; 3    :         for(unsigned i=0; i<1000; i++) ;
; 4    :     for(volatile unsigned i=0; i<42000; i++) ;

   mov     DWORD PTR _i$2532[esp+4], 0
 mov     eax, DWORD PTR _i$2532[esp+4]
       cmp     eax, 42000                              ; 0000a410H
 jae     SHORT $LN13@main
    mov     eax, 1
      npad    9
$LL3@main:
 add     DWORD PTR _i$2532[esp+4], eax
       mov     ecx, DWORD PTR _i$2532[esp+4]
       cmp     ecx, 42000                              ; 0000a410H
 jb      SHORT $LL3@main
$LN13@main:

; 5    : }

   xor     eax, eax
    pop     ecx
 ret     0
_main      ENDP

    

_________________
Image - carpe noctem
Post 05 Feb 2008, 15:09
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.