flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2, 3, 4, 5, 6, 7 Next |
Author |
|
HaHaAnonymous 09 May 2013, 22:28
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 20:51; edited 1 time in total |
|||
![]() |
|
Bob++ 09 May 2013, 23:45
"Well, I think the 13 lines function would be faster than the 22+ lines generated by the compiler. If I am not mistaken. "
But usually,tit means nothing. |
|||
![]() |
|
HaHaAnonymous 10 May 2013, 01:12
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 20:48; edited 1 time in total |
|||
![]() |
|
HaHaAnonymous 10 May 2013, 03:08
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 20:48; edited 1 time in total |
|||
![]() |
|
HaHaAnonymous 10 May 2013, 14:01
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 20:48; edited 1 time in total |
|||
![]() |
|
tthsqe 21 May 2013, 00:39
I was playing around with a very nice program Mandelbulber, and I came across the following situation.
SOURCE: Code: double dMax(double a, double b, double c) { if(a > b) { if (a > c) return a; return c; } if (b > c) return b; return c; } double dMin(double a, double b, double c) { if(a < b) { if (a < c) return a; return c; } if (b < c) return b; return c; } Now, any human would do Code: dMax: maxsd xmm0,xmm1 maxsd xmm0,xmm2 ret but take a look at what is in the EXE: Code: dMax: ;0x042D450 movapd xmm3,xmm2 sub rsp,8 maxsd xmm2,xmm0 add rsp,8 maxsd xmm3,xmm1 cmpltsd xmm1,xmm0 movapd xmm0,xmm1 addpd xmm2,xmm1 andnpd xmm0,xmm3 orpd xmm0,xmm2 ret dMin: ;0x042D480 movapd xmm3,xmm2 sub rsp,8 minsd xmm2,xmm0 add rsp,8 minsd xmm3,xmm1 cmpltsd xmm0,xmm1 movapd xmm0,xmm1 addpd xmm2,xmm0 andnpd xmm0,xmm3 orpd xmm0,xmm2 ret So close but so far. pitiful compiler... I am no longer wondering why the render times are so high. ![]() |
|||
![]() |
|
HaHaAnonymous 21 May 2013, 01:02
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 20:24; edited 1 time in total |
|||
![]() |
|
tthsqe 21 May 2013, 01:06
nono, I am talking about the program that you can download from www.mandelbulber.com, at least version 1.21 anyways. Even though I have the C++ source, I wouldn't even dare try to get it to compile. I assume that the programmers of mandelbulber would have turned on optimizations for their release version. I;ll probably write a version of the mandelbox formula in fasm and see if I can get it faster.
|
|||
![]() |
|
seapoint 31 May 2013, 15:34
Some things are only really possible in hll. Are you really going to manually compute multiplicative inverse every time you want to divide an integer by a constant? Obviously not, at best you're going to use macro, which in this case is a local use of hll.
Another example that comes to mind is switch for something more complicated than close integers. A compiler can calculate a perfect hash and make the switch O(1), although I don't think there's a place in c++ for this to happen. Perhaps .net jit can do it? As for general code... don't compare your handwritten assembly to visual's or gcc's output, they're good but not great, compare it to intel's, obviously with all optimisations and relevant instruction sets enabled. |
|||
![]() |
|
HaHaAnonymous 31 May 2013, 16:34
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 20:14; edited 1 time in total |
|||
![]() |
|
seapoint 31 May 2013, 17:34
HaHaAnonymous wrote:
That's only true in the general sense of "a human can theoretically execute any algorithm". Practically, it's only true for very small sets and algorithms with low complexity. Quote:
The complexity of the extended euler algorithm is lg^2(maximum int+1), which in this case refers to iterations of a loop with few calculations. You would also need to make the divisor coprime with 2^32 or 2^64, which again requires about lg^2(divisor) computations. Assuming you're using calculator to compute this, each iteration of both algorithms should take you about 10 seconds, so about 24 hours for 64 bit multiplicative inverse in the worst case... Wasting two/three days would be really unwise when a compiler or some macro can compute this in a milisecond, wouldn't it? Or you could just use i/div and waste a few precious cycles every time the code runs... Quote:
The topic is about "hll" not about gcc. Download intel c++ and see its output. I don't think it can be beat on general code for target intel cpu. Vectored computations can be written using intrinsics making it as a fast as best assembly. |
|||
![]() |
|
HaHaAnonymous 31 May 2013, 18:42
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 20:14; edited 1 time in total |
|||
![]() |
|
matefkr 01 Jun 2013, 14:33
the answer to the question is no of course. why is this topic so long?
|
|||
![]() |
|
HaHaAnonymous 01 Jun 2013, 20:13
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 20:14; edited 1 time in total |
|||
![]() |
|
Bargest 01 Jun 2013, 21:27
I've tested newest MSVC compiler and GCC, both with full optimization (tested on CRC-32 algorithm) They really generare very good code. But I can't say, that it is better, than hand-written (I wrote just the same code several days ago
![]() Of course, it can't be called a serious benchmark. But HLL have another bad thing for me. While writing on C++ I relax and become too lazy to think about HL-optimizations. I think "Oh, it is compiling! It means that everyting is good, I don't need to change anything to improve code." As a result I get fully-optimized code that weighs ~400 kb and works very slow. While writing in assembly, I always try to use best solution I can imagine, because if I decide to write huge and monstrous code, I will write it for a very, very long time. ![]() So the conclusion is: if you write GOOD code in HLL, compiler IS ABLE to produce good asm code, almost like hand-written. But in HLL you do not have to write good code. |
|||
![]() |
|
AsmGuru62 02 Jun 2013, 12:19
C++ code can be written without using the standard C run-time library.
In such a case it will not be more than a few KBytes. You can take a look at C++ options, like exclude RTTI and exception handling. That will make your C++ code much less in size. 400k into 10-12k or even less. |
|||
![]() |
|
HaHaAnonymous 11 Mar 2015, 03:56
Quote:
That is true, while coding in assembly you often try to optimize every instruction, from the highly executed loops to the least used routines. If you are coding seriously... Quote:
Almost, but very far from "hand written asm". Let's have a look at another comparison... Compiler generated code for a random task written by a regular human: Code: 0: 55 push ebp 1: 89 e5 mov ebp,esp 3: 83 ec 10 sub esp,0x10 6: 89 5d f0 mov DWORD PTR [ebp-0x10],ebx 9: 89 75 f4 mov DWORD PTR [ebp-0xc],esi c: 89 7d f8 mov DWORD PTR [ebp-0x8],edi f: 8b 7d 08 mov edi,DWORD PTR [ebp+0x8] 12: c7 45 fc ff ff ff ff mov DWORD PTR [ebp-0x4],0xffffffff 19: bb 00 00 00 00 mov ebx,0x0 1e: 8a 04 1f mov al,BYTE PTR [edi+ebx*1] 21: 84 c0 test al,al 23: 75 08 jne 2d 25: e9 a7 00 00 00 jmp d1 2a: 89 f6 mov esi,esi 2c: 43 inc ebx 2d: 8a 44 1f 01 mov al,BYTE PTR [edi+ebx*1+0x1] 31: 84 c0 test al,al 33: 75 f7 jne 2c 35: ba 00 00 00 00 mov edx,0x0 3a: eb 01 jmp 3d 3c: 42 inc edx 3d: 8a 04 17 mov al,BYTE PTR [edi+edx*1] 40: 3c 30 cmp al,0x30 42: 74 f8 je 3c 44: c7 45 fc 00 00 00 00 mov DWORD PTR [ebp-0x4],0x0 4b: b0 00 mov al,0x0 4d: 8d 76 00 lea esi,[esi+0x0] 50: 0f b6 0c 1f movzx ecx,BYTE PTR [edi+ebx*1] 54: 83 e9 41 sub ecx,0x41 57: 83 f9 06 cmp ecx,0x6 5a: 73 14 jae 70 5c: 0f b6 34 1f movzx esi,BYTE PTR [edi+ebx*1] 60: 83 ee 37 sub esi,0x37 63: 0f b6 c8 movzx ecx,al 66: d3 e6 shl esi,cl 68: 0b 75 fc or esi,DWORD PTR [ebp-0x4] 6b: 89 75 fc mov DWORD PTR [ebp-0x4],esi 6e: eb 49 jmp b9 70: 0f b6 0c 1f movzx ecx,BYTE PTR [edi+ebx*1] 74: 83 e9 61 sub ecx,0x61 77: 83 f9 06 cmp ecx,0x6 7a: 73 14 jae 90 7c: 0f b6 34 1f movzx esi,BYTE PTR [edi+ebx*1] 80: 83 ee 57 sub esi,0x57 83: 0f b6 c8 movzx ecx,al 86: d3 e6 shl esi,cl 88: 0b 75 fc or esi,DWORD PTR [ebp-0x4] 8b: 89 75 fc mov DWORD PTR [ebp-0x4],esi 8e: eb 29 jmp b9 90: 0f b6 0c 1f movzx ecx,BYTE PTR [edi+ebx*1] 94: 83 e9 30 sub ecx,0x30 97: 83 f9 0a cmp ecx,0xa 9a: 73 14 jae b0 9c: 0f b6 34 1f movzx esi,BYTE PTR [edi+ebx*1] a0: 83 ee 30 sub esi,0x30 a3: 0f b6 c8 movzx ecx,al a6: d3 e6 shl esi,cl a8: 0b 75 fc or esi,DWORD PTR [ebp-0x4] ab: 89 75 fc mov DWORD PTR [ebp-0x4],esi ae: eb 09 jmp b9 b0: c7 45 fc ff ff ff ff mov DWORD PTR [ebp-0x4],0xffffffff b7: eb 18 jmp d1 b9: 4b dec ebx ba: 39 d3 cmp ebx,edx bc: 7c 13 jl d1 be: 04 04 add al,0x4 c0: 0f b6 c8 movzx ecx,al c3: 83 f9 20 cmp ecx,0x20 c6: 7c 88 jl 50 c8: c7 45 fc ff ff ff ff mov DWORD PTR [ebp-0x4],0xffffffff cf: eb 00 jmp d1 d1: 8b 45 fc mov eax,DWORD PTR [ebp-0x4] d4: 8b 5d f0 mov ebx,DWORD PTR [ebp-0x10] d7: 8b 75 f4 mov esi,DWORD PTR [ebp-0xc] da: 8b 7d f8 mov edi,DWORD PTR [ebp-0x8] dd: c9 leave de: c3 ret Now a different code with the exactly same purpose "hand written asm" by regular human: Code: 0: 55 push ebp 1: 56 push esi 2: 57 push edi 3: 89 e5 mov ebp,esp 5: 8b 7d 10 mov edi,DWORD PTR [ebp+0x10] 8: 80 3f 00 cmp BYTE PTR [edi],0x0 b: 0f 84 ad 00 00 00 je be 11: 80 3f 30 cmp BYTE PTR [edi],0x30 14: 75 05 jne 1b 16: 83 c7 01 add edi,0x1 19: eb f6 jmp 11 1b: 89 fe mov esi,edi 1d: 80 7f 01 00 cmp BYTE PTR [edi+0x1],0x0 21: 8d 7f 01 lea edi,[edi+0x1] 24: 75 f7 jne 1d 26: b8 00 00 00 00 mov eax,0x0 2b: b9 00 00 00 00 mov ecx,0x0 30: 29 fe sub esi,edi 32: f7 de neg esi 34: 83 ef 01 sub edi,0x1 37: 0f b6 17 movzx edx,BYTE PTR [edi] 3a: 83 fa 61 cmp edx,0x61 3d: 72 25 jb 64 3f: 83 fa 66 cmp edx,0x66 42: 77 70 ja b4 44: 83 ea 57 sub edx,0x57 47: d3 e2 shl edx,cl 49: 83 f9 20 cmp ecx,0x20 4c: 74 66 je b4 4e: 83 c1 04 add ecx,0x4 51: 01 d0 add eax,edx 53: 83 ee 01 sub esi,0x1 56: 75 dc jne 34 58: ba 00 00 00 00 mov edx,0x0 5d: 5f pop edi 5e: 5e pop esi 5f: 5d pop ebp 60: c3 ret 61: 90 nop 62: 90 nop 63: 90 nop 64: 83 fa 41 cmp edx,0x41 67: 72 23 jb 8c 69: 83 fa 46 cmp edx,0x46 6c: 77 46 ja b4 6e: 83 ea 37 sub edx,0x37 71: d3 e2 shl edx,cl 73: 83 f9 20 cmp ecx,0x20 76: 74 3c je b4 78: 83 c1 04 add ecx,0x4 7b: 01 d0 add eax,edx 7d: 83 ee 01 sub esi,0x1 80: 75 b2 jne 34 82: ba 00 00 00 00 mov edx,0x0 87: 5f pop edi 88: 5e pop esi 89: 5d pop ebp 8a: c3 ret 8b: 90 nop 8c: 83 fa 30 cmp edx,0x30 8f: 72 23 jb b4 91: 83 fa 39 cmp edx,0x39 94: 77 1e ja b4 96: 83 ea 30 sub edx,0x30 99: d3 e2 shl edx,cl 9b: 83 f9 20 cmp ecx,0x20 9e: 74 14 je b4 a0: 83 c1 04 add ecx,0x4 a3: 01 d0 add eax,edx a5: 83 ee 01 sub esi,0x1 a8: 75 8a jne 34 aa: ba 00 00 00 00 mov edx,0x0 af: 5f pop edi b0: 5e pop esi b1: 5d pop ebp b2: c3 ret b3: 90 nop b4: b8 ff ff ff ff mov eax,0xffffffff b9: ba ff ff ff ff mov edx,0xffffffff be: 5f pop edi bf: 5e pop esi c0: 5d pop ebp c1: c3 ret Now a simple "benchmark": Quote:
Compilers have many advantages, performance is not one of them. At least in this case. |
|||
![]() |
|
revolution 11 Mar 2015, 04:08
If you are mostly concerned about performance then the hand written code can be made more efficient by replacing "mov reg,0" with "xor reg,reg" and "mov reg,0xffffffff" with "or reg,-1". This may help with cache usage, although proper testing on the target system will show whether or not there is an actual and significant improvement.
|
|||
![]() |
|
HaHaAnonymous 11 Mar 2015, 04:48
Quote: If you are mostly concerned about performance then the hand written code can be made more efficient by replacing "mov reg,0" with "xor reg,reg" and "mov reg,0xffffffff" with "or reg,-1". After replacing the "mov reg,0x00" by "xor reg,reg" added about 200ms to the final result. And about the "mov reg,0xffffffff", that line is never executed unless an error occurs so I guess it will not make a large impact (if any, not sure). Code: ; This result is not comparable with the previous post 2315ms ; xor reg,reg 2153ms ; mov reg,0x00000000 I do not know if that is the expected behavior, but that is how it works here. |
|||
![]() |
|
Goto page Previous 1, 2, 3, 4, 5, 6, 7 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.