flat assembler
Message board for the users of flat assembler.
Index
> Windows > Java faster than ASM!? Goto page Previous 1, 2, 3, 4, 5 Next |
Author |
|
revolution 06 Feb 2008, 04:59
In ARM code with Jazelle support the CPU can execute Jave byte codes directly, at full clock speed. There are still many codes that need OS support, but the simple and common codes are native to the processor. You have to rewrite your algo to beat the Java in that case.
|
|||
06 Feb 2008, 04:59 |
|
itsnobody 06 Feb 2008, 05:02
revolution wrote: In ARM code with Jazelle support the CPU can execute Jave byte codes directly, at full clock speed. There are still many codes that need OS support, but the simple and common codes are native to the processor. You have to rewrite your algo to beat the Java in that case. Oh, that explains it, for ARM |
|||
06 Feb 2008, 05:02 |
|
LocoDelAssembly 06 Feb 2008, 05:11
Quote:
Only once, the JIT is meant for compile the Java bytecodes to native machine code, not as Visual Basic 6.0 and older with its p-code that it is interpreted all the time and never JIT'ed. PS: Visual Basic 6.0 allows compilation for both, native and p-code. |
|||
06 Feb 2008, 05:11 |
|
bitRAKE 06 Feb 2008, 06:18
Maybe we can move on to some actual algorithm optimization. Single instructions are hardly representative of coder or HLL JIT ablities. Basically, enough loops are being executed to hide the overhead of the HLL - JIT or not abstractions cost real time.
Code this algorithm in assembly and compare to Java/C++ speed. |
|||
06 Feb 2008, 06:18 |
|
itsnobody 06 Feb 2008, 15:00
LocoDelAssembly wrote:
Ohhh So that explains it now, I didn't know VB 6.0 was compiled native code now But Java is still slower with loading up since it has to convert all the bytecode into machine code and then run |
|||
06 Feb 2008, 15:00 |
|
LocoDelAssembly 06 Feb 2008, 16:37
Code: include 'win32ax.inc' use_reciprocal = 0 ; Comment this line to use DIV instruction .data start_time dd 0 _output rb 20 .code start: invoke MessageBox,NULL,"Click Ok to start","Speed Test",MB_OK invoke GetTickCount mov [start_time],eax xor ecx, ecx mov ebx, 024924925h mov edi, 7 align 16 place: if defined use_reciprocal mov eax, ecx mul ebx mov eax, ecx sub eax, edx shr eax, 1 add eax, edx shr eax, 2 else xor edx, edx mov eax, ecx div edi end if dec ecx jnz place invoke GetTickCount sub eax, [start_time] invoke wsprintf,_output,"%d milliseconds",eax invoke MessageBox,NULL,_output,"Speed Test",MB_OK invoke ExitProcess,0 .end start On Athlon64 2.0 GHz Venice: With DIV ~89000 ms With reciprocal ~10000 ms I have to confess that I cheated The reciprocal code was generated by Visual C++ 2005 Express by using volatile on all variables. Code: int _tmain(int argc, _TCHAR* argv[]) { 00401000 sub esp,8 _asm{int 3} 00401003 int 3 volatile unsigned int count = 1000000000; 00401004 mov dword ptr [esp],3B9ACA00h volatile unsigned int nop; while (count) 0040100B mov eax,dword ptr [esp] 0040100E test eax,eax 00401010 je wmain+32h (401032h) { nop = count / 7; 00401012 mov ecx,dword ptr [esp] 00401015 mov eax,24924925h 0040101A mul eax,ecx 0040101C sub ecx,edx 0040101E shr ecx,1 00401020 add ecx,edx 00401022 shr ecx,2 00401025 mov dword ptr [esp+4],ecx count--; 00401029 add dword ptr [esp],0FFFFFFFFh 0040102D mov ecx,dword ptr [esp] 00401030 jne wmain+12h (401012h) } } 00401032 add esp,8 00401035 ret I tested the reciprocal against all posible dividends and produces correct result for them all, and if the count max value is the same as in the original code you can remove all the SHR-ADD because it will produce correct result without them (but with bigger dividends it works wrong, I not tested from exactly what dividend starts to fails but I tested 100000000 and below and worked) |
|||
06 Feb 2008, 16:37 |
|
OzzY 06 Feb 2008, 18:09
Another thing that might count a lot in optimization of loops is this: http://www.jorgon.freeserve.co.uk/TestbugHelp/Optimisation.htm#rep
Take a look at branch prediction. |
|||
06 Feb 2008, 18:09 |
|
revolution 06 Feb 2008, 18:51
LocoDelAssembly wrote: (but with bigger dividends it works wrong, I not tested from exactly what dividend starts to fails but I tested 100000000 and below and worked) The multiplier, 24924925h, is the infinitely precise result plus some error term. In this case, 24924924h+4/7 + e, where e is the error term, 3/7. We just have to work out when X * e = 2^32/divisor, where X is the lower bound of the first errored result. Therefore X = (2^32/7) / (3/7) --cancelling 7's--> 2^32/3 = 1431655765.333 This shows that all values below 1431655766 are guaranteed to be correct. Values from 1-2x 1431655766 may be incorrect by up to +1 extra when the remainder is 6. Values from 2-3x 1431655766 may be incorrect by up to +1 extra when the remainder is 5 or 6. |
|||
06 Feb 2008, 18:51 |
|
LocoDelAssembly 06 Feb 2008, 19:28
I'll have to believe you
|
|||
06 Feb 2008, 19:28 |
|
asmhack 06 Feb 2008, 20:01
about the topic.. how java can be faster than asm on x86 machine ?
it uses something better than direct access to the registers ? it uses something better than for example 'xor eax,eax' to zero a 'variable' ? |
|||
06 Feb 2008, 20:01 |
|
itsnobody 06 Feb 2008, 20:29
LocoDelAssembly wrote:
I don't really understand how the hell that works, but it speeds this up a lot and beats the hell out of Java now...375 milliseconds vs. 2400 milliseconds |
|||
06 Feb 2008, 20:29 |
|
daniel.lewis 07 Feb 2008, 00:04
Of course it does. Correctly coded assembler will always knock the socks off code written in Java or any other HLL. Incorrectly coded assembler will probably get a worse score than an HLL because the compiler will at least try to optimize some of it for you.
If you program something quickly, do it in HLL. If you do it carefully, taking your time to think through everything and do it right, do it in assembler. And... as they say, "write one to throw away", so I recommend writing it in HLL the first time, and assembler for 2.0. Good luck. _________________ dd 0x90909090 ; problem solved. |
|||
07 Feb 2008, 00:04 |
|
f0dder 07 Feb 2008, 00:29
...but what stops you from doing the reciprocal-mul-instead-of-div in Java? :]
|
|||
07 Feb 2008, 00:29 |
|
Goplat 07 Feb 2008, 02:22
It requires the use of a 32x32->64 multiplication. Even though this operation exists as an instruction on x86, most high level languages including Java don't have a way to do it; they only have 32x32->32 (inadequate) and 64x64->64 (slow, because it requires multiple instructions on x86).
|
|||
07 Feb 2008, 02:22 |
|
Dopefish 07 Feb 2008, 03:23
Using what LocoDelAssembly posted:
Intel Core 2 Duo E6600 @ 3.0GHz With DIV 52281 ms With reciprocal 6531 ms |
|||
07 Feb 2008, 03:23 |
|
itsnobody 07 Feb 2008, 03:42
On my Intel Pentium dual-core processor @ 1.46GHZ
With DIV - 112492 ms With Reciprocal - 14009 ms Man I wonder how much you could speed up things if you replaced all div instructions with this.... |
|||
07 Feb 2008, 03:42 |
|
daniel.lewis 07 Feb 2008, 04:40
Alot.
You might also notice major improvements here: - there's a sqrt function built into SSE2 instruction set - store the length of strings instead of using strlen to repeatedly get it - movsb/scasw and friends are vastly slower than SSE2 with prefetch. Something about handling 16 bytes in parallel. - Most commercial programs are obfuscated, and the obfuscation algorithm involves injecting excessive jumps and waste instructions into the source. Kills performance, and actually incites me to reverse the algorithms in order to streamline them. I wrote a bunch of scripts to help me do so. There are alot more on Agner Fog and Paul Hsieh's websites. Regards, Dan |
|||
07 Feb 2008, 04:40 |
|
ChrisLeslie 07 Feb 2008, 10:15
This thread demonstrates that well coded Java, with JIT, can be at least as fast as ASM, faster than poorly coded ASM, but slower than well coded ASM. I discovered this some time ago because I found that no speed improvements were gained using ASM unless I tried very hard to optimize. Even then I could never get a string compare proc to work as fast as Java string.equals(), even with Agner Fog tricks etc. Often then the highly optimized ASM code can end up very hard to read again after a few weeks away from it unless it is carefully commented.
So, if you want fast ASM code you have to work at it and experiment with ways that may not be intuitive or elegant. Chris |
|||
07 Feb 2008, 10:15 |
|
r22 07 Feb 2008, 15:30
On a 64bit system you should be able to calculate the binary reciprocal accurate enough to remove any error for all 32bit numbers using the 64x64->128 MUL.
I wonder if calculating the reciprocal on the fly using SSE2 would be faster or slower than integer DIV. Code: ;;IN rcx = 32bit num ;;OUT rcx = binary rcp of ecx MOVQ xmm1, qword[SD_ONE] CVTDQ2SD xmm0, rcx DIVSD xmm1, xmm0 ;; Sign(63) Exp(62..52) Mant(51...0) PSLLQ xmm1, xmm1, 12 ;; Mant(63...12) MOVQ rcx, xmm1 EDIT: I made up the RCPSD instruction (it seemed good) RCPSS is a 1-3 cycle latency instruction but 12 bits of percision isn't enough. After getting a chance to check out the instruction latencies and thoroughput on Core2 and AMD64, I'm relatively sure without benchmarking that dynamic calculation to optimize division would NOT improve performance. ON TOPIC: Thinking you can easily beat the running speed of an optimizing compiler by simply direct porting an algorithm to ASM is asinine. But this irrational thread has been mildly entertaining and thought provoking. Last edited by r22 on 07 Feb 2008, 19:37; edited 2 times in total |
|||
07 Feb 2008, 15:30 |
|
Goto page Previous 1, 2, 3, 4, 5 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.