flat assembler
Message board for the users of flat assembler.

Index > Windows > Java faster than ASM!?

Goto page Previous  1, 2, 3, 4, 5  Next
Author
Thread Post new topic Reply to topic
itsnobody



Joined: 01 Feb 2008
Posts: 93
Location: Silver Spring, MD
itsnobody
LocoDelAssembly wrote:
For certain code both will equally faster, that can happens, what cannot however is that assembly is slower that Java, if such thing happens that means just that Java found better CPU instruction sequence than you which means then that you have lower skills than a HLL compiler.


But how can it even be equal? Doesn't the Java runtimes have to read the byte code then execute the instructions?
Post 06 Feb 2008, 04:52
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17717
Location: In your JS exploiting you and your system
revolution
In ARM code with Jazelle support the CPU can execute Jave byte codes directly, at full clock speed. There are still many codes that need OS support, but the simple and common codes are native to the processor. You have to rewrite your algo to beat the Java in that case.
Post 06 Feb 2008, 04:59
View user's profile Send private message Visit poster's website Reply with quote
itsnobody



Joined: 01 Feb 2008
Posts: 93
Location: Silver Spring, MD
itsnobody
revolution wrote:
In ARM code with Jazelle support the CPU can execute Jave byte codes directly, at full clock speed. There are still many codes that need OS support, but the simple and common codes are native to the processor. You have to rewrite your algo to beat the Java in that case.


Oh, that explains it, for ARM
Post 06 Feb 2008, 05:02
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

But how can it even be equal? Doesn't the Java runtimes have to read the byte code then execute the instructions?

Only once, the JIT is meant for compile the Java bytecodes to native machine code, not as Visual Basic 6.0 and older with its p-code that it is interpreted all the time and never JIT'ed.

PS: Visual Basic 6.0 allows compilation for both, native and p-code.
Post 06 Feb 2008, 05:11
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3056
Location: vpcmipstrm
bitRAKE
Maybe we can move on to some actual algorithm optimization. Single instructions are hardly representative of coder or HLL JIT ablities. Basically, enough loops are being executed to hide the overhead of the HLL - JIT or not abstractions cost real time.

Code this algorithm in assembly and compare to Java/C++ speed.
Post 06 Feb 2008, 06:18
View user's profile Send private message Visit poster's website Reply with quote
itsnobody



Joined: 01 Feb 2008
Posts: 93
Location: Silver Spring, MD
itsnobody
LocoDelAssembly wrote:
Quote:

But how can it even be equal? Doesn't the Java runtimes have to read the byte code then execute the instructions?

Only once, the JIT is meant for compile the Java bytecodes to native machine code, not as Visual Basic 6.0 and older with its p-code that it is interpreted all the time and never JIT'ed.

PS: Visual Basic 6.0 allows compilation for both, native and p-code.


Ohhh

So that explains it now, I didn't know VB 6.0 was compiled native code now

But Java is still slower with loading up since it has to convert all the bytecode into machine code and then run
Post 06 Feb 2008, 15:00
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Code:
include 'win32ax.inc'
use_reciprocal = 0 ; Comment this line to use DIV instruction
.data 
    start_time dd 0 
    _output rb 20 
.code 

start: 
            invoke MessageBox,NULL,"Click Ok to start","Speed Test",MB_OK 
            invoke GetTickCount 
            mov [start_time],eax 
            xor ecx, ecx
            mov ebx, 024924925h
            mov edi, 7

            align 16
            place:
            if defined use_reciprocal
                mov eax, ecx
                mul ebx
                mov eax, ecx
                sub eax, edx
                shr eax, 1
                add eax, edx
                shr eax, 2
            else
                xor edx, edx
                mov eax, ecx
                div edi
            end if
                dec ecx
                jnz place

            invoke GetTickCount
            sub eax, [start_time] 
            invoke wsprintf,_output,"%d milliseconds",eax 
            invoke MessageBox,NULL,_output,"Speed Test",MB_OK 
            invoke ExitProcess,0

 .end start    

On Athlon64 2.0 GHz Venice:
With DIV ~89000 ms
With reciprocal ~10000 ms

I have to confess that I cheated Razz The reciprocal code was generated by Visual C++ 2005 Express by using volatile on all variables.

Code:
int _tmain(int argc, _TCHAR* argv[])
{
00401000  sub         esp,8 
       _asm{int 3}
00401003  int         3    
        volatile unsigned int count = 1000000000;
00401004  mov         dword ptr [esp],3B9ACA00h 
       volatile unsigned int nop;
  while (count)
0040100B  mov         eax,dword ptr [esp] 
0040100E  test        eax,eax 
00401010  je          wmain+32h (401032h) 
 {
              nop = count / 7;
00401012  mov         ecx,dword ptr [esp] 
00401015  mov         eax,24924925h 
0040101A  mul         eax,ecx 
0040101C  sub         ecx,edx 
0040101E  shr         ecx,1 
00401020  add         ecx,edx 
00401022  shr         ecx,2 
00401025  mov         dword ptr [esp+4],ecx 
                count--;
00401029  add         dword ptr [esp],0FFFFFFFFh 
0040102D  mov         ecx,dword ptr [esp] 
00401030  jne         wmain+12h (401012h) 
   }
}
00401032  add         esp,8 
00401035  ret     

I tested the reciprocal against all posible dividends and produces correct result for them all, and if the count max value is the same as in the original code you can remove all the SHR-ADD because it will produce correct result without them (but with bigger dividends it works wrong, I not tested from exactly what dividend starts to fails but I tested 100000000 and below and worked)
Post 06 Feb 2008, 16:37
View user's profile Send private message Reply with quote
OzzY



Joined: 19 Sep 2003
Posts: 1029
Location: Everywhere
OzzY
Another thing that might count a lot in optimization of loops is this: http://www.jorgon.freeserve.co.uk/TestbugHelp/Optimisation.htm#rep

Take a look at branch prediction.
Post 06 Feb 2008, 18:09
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17717
Location: In your JS exploiting you and your system
revolution
LocoDelAssembly wrote:
(but with bigger dividends it works wrong, I not tested from exactly what dividend starts to fails but I tested 100000000 and below and worked)
It is possible to compute the first errored value.

The multiplier, 24924925h, is the infinitely precise result plus some error term. In this case, 24924924h+4/7 + e, where e is the error term, 3/7.

We just have to work out when X * e = 2^32/divisor, where X is the lower bound of the first errored result.

Therefore X = (2^32/7) / (3/7) --cancelling 7's--> 2^32/3 = 1431655765.333

This shows that all values below 1431655766 are guaranteed to be correct.

Values from 1-2x 1431655766 may be incorrect by up to +1 extra when the remainder is 6.
Values from 2-3x 1431655766 may be incorrect by up to +1 extra when the remainder is 5 or 6.
Post 06 Feb 2008, 18:51
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
I'll have to believe you Razz
Post 06 Feb 2008, 19:28
View user's profile Send private message Reply with quote
asmhack



Joined: 01 Feb 2008
Posts: 431
asmhack
about the topic.. how java can be faster than asm on x86 machine ?
it uses something better than direct access to the registers ?
it uses something better than for example 'xor eax,eax' to zero a 'variable' ?

Image
Post 06 Feb 2008, 20:01
View user's profile Send private message Reply with quote
itsnobody



Joined: 01 Feb 2008
Posts: 93
Location: Silver Spring, MD
itsnobody
LocoDelAssembly wrote:
Code:
include 'win32ax.inc'
use_reciprocal = 0 ; Comment this line to use DIV instruction
.data 
    start_time dd 0 
    _output rb 20 
.code 

start: 
            invoke MessageBox,NULL,"Click Ok to start","Speed Test",MB_OK 
            invoke GetTickCount 
            mov [start_time],eax 
            xor ecx, ecx
            mov ebx, 024924925h
            mov edi, 7

            align 16
            place:
            if defined use_reciprocal
                mov eax, ecx
                mul ebx
                mov eax, ecx
                sub eax, edx
                shr eax, 1
                add eax, edx
                shr eax, 2
            else
                xor edx, edx
                mov eax, ecx
                div edi
            end if
                dec ecx
                jnz place

            invoke GetTickCount
            sub eax, [start_time] 
            invoke wsprintf,_output,"%d milliseconds",eax 
            invoke MessageBox,NULL,_output,"Speed Test",MB_OK 
            invoke ExitProcess,0

 .end start    

On Athlon64 2.0 GHz Venice:
With DIV ~89000 ms
With reciprocal ~10000 ms

I have to confess that I cheated Razz The reciprocal code was generated by Visual C++ 2005 Express by using volatile on all variables.

Code:
int _tmain(int argc, _TCHAR* argv[])
{
00401000  sub         esp,8 
   _asm{int 3}
00401003  int         3    
        volatile unsigned int count = 1000000000;
00401004  mov         dword ptr [esp],3B9ACA00h 
       volatile unsigned int nop;
  while (count)
0040100B  mov         eax,dword ptr [esp] 
0040100E  test        eax,eax 
00401010  je          wmain+32h (401032h) 
 {
              nop = count / 7;
00401012  mov         ecx,dword ptr [esp] 
00401015  mov         eax,24924925h 
0040101A  mul         eax,ecx 
0040101C  sub         ecx,edx 
0040101E  shr         ecx,1 
00401020  add         ecx,edx 
00401022  shr         ecx,2 
00401025  mov         dword ptr [esp+4],ecx 
                count--;
00401029  add         dword ptr [esp],0FFFFFFFFh 
0040102D  mov         ecx,dword ptr [esp] 
00401030  jne         wmain+12h (401012h) 
   }
}
00401032  add         esp,8 
00401035  ret     

I tested the reciprocal against all posible dividends and produces correct result for them all, and if the count max value is the same as in the original code you can remove all the SHR-ADD because it will produce correct result without them (but with bigger dividends it works wrong, I not tested from exactly what dividend starts to fails but I tested 100000000 and below and worked)


I don't really understand how the hell that works, but it speeds this up a lot and beats the hell out of Java now...375 milliseconds vs. 2400 milliseconds
Post 06 Feb 2008, 20:29
View user's profile Send private message Reply with quote
daniel.lewis



Joined: 28 Jan 2008
Posts: 92
daniel.lewis
Of course it does. Correctly coded assembler will always knock the socks off code written in Java or any other HLL. Incorrectly coded assembler will probably get a worse score than an HLL because the compiler will at least try to optimize some of it for you.

If you program something quickly, do it in HLL. If you do it carefully, taking your time to think through everything and do it right, do it in assembler.

And... as they say, "write one to throw away", so I recommend writing it in HLL the first time, and assembler for 2.0.

Good luck.

_________________
dd 0x90909090 ; problem solved.
Post 07 Feb 2008, 00:04
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
...but what stops you from doing the reciprocal-mul-instead-of-div in Java? :]
Post 07 Feb 2008, 00:29
View user's profile Send private message Visit poster's website Reply with quote
Goplat



Joined: 15 Sep 2006
Posts: 181
Goplat
It requires the use of a 32x32->64 multiplication. Even though this operation exists as an instruction on x86, most high level languages including Java don't have a way to do it; they only have 32x32->32 (inadequate) and 64x64->64 (slow, because it requires multiple instructions on x86).
Post 07 Feb 2008, 02:22
View user's profile Send private message Reply with quote
Dopefish



Joined: 18 Nov 2007
Posts: 10
Dopefish
Using what LocoDelAssembly posted:

Intel Core 2 Duo E6600 @ 3.0GHz
With DIV 52281 ms
With reciprocal 6531 ms
Post 07 Feb 2008, 03:23
View user's profile Send private message Reply with quote
itsnobody



Joined: 01 Feb 2008
Posts: 93
Location: Silver Spring, MD
itsnobody
On my Intel Pentium dual-core processor @ 1.46GHZ

With DIV - 112492 ms
With Reciprocal - 14009 ms

Man I wonder how much you could speed up things if you replaced all div instructions with this....
Post 07 Feb 2008, 03:42
View user's profile Send private message Reply with quote
daniel.lewis



Joined: 28 Jan 2008
Posts: 92
daniel.lewis
Alot.

You might also notice major improvements here:
- there's a sqrt function built into SSE2 instruction set
- store the length of strings instead of using strlen to repeatedly get it
- movsb/scasw and friends are vastly slower than SSE2 with prefetch. Something about handling 16 bytes in parallel.
- Most commercial programs are obfuscated, and the obfuscation algorithm involves injecting excessive jumps and waste instructions into the source. Kills performance, and actually incites me to reverse the algorithms in order to streamline them. I wrote a bunch of scripts to help me do so.

There are alot more on Agner Fog and Paul Hsieh's websites.
Regards,
Dan
Post 07 Feb 2008, 04:40
View user's profile Send private message Reply with quote
ChrisLeslie



Joined: 04 Jun 2006
Posts: 50
Location: Australia
ChrisLeslie
This thread demonstrates that well coded Java, with JIT, can be at least as fast as ASM, faster than poorly coded ASM, but slower than well coded ASM. I discovered this some time ago because I found that no speed improvements were gained using ASM unless I tried very hard to optimize. Even then I could never get a string compare proc to work as fast as Java string.equals(), even with Agner Fog tricks etc. Often then the highly optimized ASM code can end up very hard to read again after a few weeks away from it unless it is carefully commented.

So, if you want fast ASM code you have to work at it and experiment with ways that may not be intuitive or elegant.

Chris
Post 07 Feb 2008, 10:15
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
On a 64bit system you should be able to calculate the binary reciprocal accurate enough to remove any error for all 32bit numbers using the 64x64->128 MUL.

I wonder if calculating the reciprocal on the fly using SSE2 would be faster or slower than integer DIV.
Code:
;;IN rcx = 32bit num ;;OUT rcx = binary rcp of ecx
MOVQ xmm1, qword[SD_ONE]
CVTDQ2SD xmm0, rcx
DIVSD xmm1, xmm0 ;; Sign(63) Exp(62..52) Mant(51...0)
PSLLQ xmm1, xmm1, 12 ;; Mant(63...12)
MOVQ rcx, xmm1

    

EDIT: I made up the RCPSD instruction (it seemed good) RCPSS is a 1-3 cycle latency instruction but 12 bits of percision isn't enough.
After getting a chance to check out the instruction latencies and thoroughput on Core2 and AMD64, I'm relatively sure without benchmarking that dynamic calculation to optimize division would NOT improve performance.

ON TOPIC:
Thinking you can easily beat the running speed of an optimizing compiler by simply direct porting an algorithm to ASM is asinine. But this irrational thread has been mildly entertaining and thought provoking.


Last edited by r22 on 07 Feb 2008, 19:37; edited 2 times in total
Post 07 Feb 2008, 15:30
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.