flat assembler
Message board for the users of flat assembler.

Index > Main > HLL compilers generate better code than hand written asm?

Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
Author
Thread Post new topic Reply to topic
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 20:51; edited 1 time in total
Post 09 May 2013, 22:28
View user's profile Send private message Reply with quote
Bob++



Joined: 12 Feb 2013
Posts: 92
Bob++
"Well, I think the 13 lines function would be faster than the 22+ lines generated by the compiler. If I am not mistaken. "

But usually,tit means nothing.
Post 09 May 2013, 23:45
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 20:48; edited 1 time in total
Post 10 May 2013, 01:12
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 20:48; edited 1 time in total
Post 10 May 2013, 03:08
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17714
Location: In your JS exploiting you and your system
revolution
HaHaAnonymous wrote:
Code:
      movss  xmm4,[0x6312d4]
      movss  xmm3,[0x6312d4]
      movss  xmm2,[0x6312d4]
      movss  xmm1,[0x6312d4]
      movss  xmm0,[0x6312d4]
      mov    rdi,[0x63b0a0]
      mov    edx,$0x100000
      mov    esi,$0xff
      movabs rax,[0x63f7a0]
      call   qword [rax]
    


Why don't just:
Code:
      movss  xmm4,[0x6312d4]
      movss  xmm3,xmm4
      movss  xmm2,xmm4
      movss  xmm1,xmm4
      movss  xmm0,xmm4
      mov    rdi,[0x63b0a0]
      mov    edx,$0x100000
      mov    esi,$0xff
      movabs rax,[0x63f7a0]
      call   qword [rax]
    

or...
Code:
      movss  xmm4,[0x6312d4]
      movss  xmm3,xmm4
      movss  xmm2,xmm3
      movss  xmm1,xmm2
      movss  xmm0,xmm1
      mov    rdi,[0x63b0a0]
      mov    edx,$0x100000
      mov    esi,$0xff
      movabs rax,[0x63f7a0]
      call   qword [rax]
    


Just tell me what would be faster in your opinion ...
My guess would be the last sequence will be the slowest due to the longest dependency chain. The first two will probably be the same speed due to the read buffer in contemporary CPUs. Probably a better code sequence would be to put the load of RAX first, that way the called function can start executing before the data in the XMM registers has finished loading. But the difference in speed would be rather tiny and hard to measure. Depending upon the surrounding code and other factors there may be no measurable difference, and just posting a small snippet like this doesn't really prove anything.

Is that the answer you were looking for?
HaHaAnonymous wrote:

My definitive answer to this topic is:
Quote:
HLL compilers generate better code than hand written asm?

No.
My answer would be: it depends.

And I know that saying "it depends" is not a satisfying answer but it is the most accurate answer. If the programmer does not know how the CPU really works then there is very little chance to generate well optimised hand written code. Some of the latest compilers generate pretty decent code in most circumstances. And some of the latest programmers generate some pretty horrendous asm code is certain circumstances. There is no one-size-fits-all blanket answer here.
Post 10 May 2013, 04:21
View user's profile Send private message Visit poster's website Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 20:48; edited 1 time in total
Post 10 May 2013, 14:01
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
I was playing around with a very nice program Mandelbulber, and I came across the following situation.
SOURCE:
Code:
double dMax(double a, double b, double c)
{
        if(a > b) {
                if (a > c)
                        return a;
                return c;
        }
        if (b > c)
                return b;
        return c;
}

double dMin(double a, double b, double c)
{
        if(a < b) {
                if (a < c)
                        return a;
                return c;
        }
        if (b < c)
                return b;
        return c;
}    

Now, any human would do
Code:
dMax:
maxsd   xmm0,xmm1
maxsd   xmm0,xmm2
ret    

but take a look at what is in the EXE:
Code:
dMax:      ;0x042D450
movapd  xmm3,xmm2
sub     rsp,8
maxsd   xmm2,xmm0
add     rsp,8
maxsd   xmm3,xmm1
cmpltsd xmm1,xmm0
movapd  xmm0,xmm1
addpd   xmm2,xmm1
andnpd  xmm0,xmm3
orpd    xmm0,xmm2
ret

dMin:      ;0x042D480
movapd  xmm3,xmm2
sub     rsp,8
minsd   xmm2,xmm0
add     rsp,8
minsd   xmm3,xmm1
cmpltsd xmm0,xmm1
movapd  xmm0,xmm1
addpd   xmm2,xmm0
andnpd  xmm0,xmm3
orpd    xmm0,xmm2
ret            

So close but so far. pitiful compiler...
I am no longer wondering why the render times are so high. Mad
Post 21 May 2013, 00:39
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 20:24; edited 1 time in total
Post 21 May 2013, 01:02
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
nono, I am talking about the program that you can download from www.mandelbulber.com, at least version 1.21 anyways. Even though I have the C++ source, I wouldn't even dare try to get it to compile. I assume that the programmers of mandelbulber would have turned on optimizations for their release version. I;ll probably write a version of the mandelbox formula in fasm and see if I can get it faster.
Post 21 May 2013, 01:06
View user's profile Send private message Reply with quote
seapoint



Joined: 25 May 2013
Posts: 8
seapoint
Some things are only really possible in hll. Are you really going to manually compute multiplicative inverse every time you want to divide an integer by a constant? Obviously not, at best you're going to use macro, which in this case is a local use of hll.
Another example that comes to mind is switch for something more complicated than close integers. A compiler can calculate a perfect hash and make the switch O(1), although I don't think there's a place in c++ for this to happen. Perhaps .net jit can do it?

As for general code... don't compare your handwritten assembly to visual's or gcc's output, they're good but not great, compare it to intel's, obviously with all optimisations and relevant instruction sets enabled.
Post 31 May 2013, 15:34
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 20:14; edited 1 time in total
Post 31 May 2013, 16:34
View user's profile Send private message Reply with quote
seapoint



Joined: 25 May 2013
Posts: 8
seapoint
HaHaAnonymous wrote:

I disagree 100% to this. If it can be done in HLL it can be done in assembly in one way or another. And that's more than a fact as HLL compilers translate the high level code to assembly source. And assembly language has a "1:1" correspondence to the generated bits of the executable file as well (if not, please let me know).

That's only true in the general sense of "a human can theoretically execute any algorithm". Practically, it's only true for very small sets and algorithms with low complexity.

Quote:
Quote:

Are you really going to manually compute multiplicative inverse every time you want to divide an integer by a constant?

Obviously yes.

The complexity of the extended euler algorithm is lg^2(maximum int+1), which in this case refers to iterations of a loop with few calculations.
You would also need to make the divisor coprime with 2^32 or 2^64, which again requires about lg^2(divisor) computations.
Assuming you're using calculator to compute this, each iteration of both algorithms should take you about 10 seconds, so about 24 hours for 64 bit multiplicative inverse in the worst case...

Wasting two/three days would be really unwise when a compiler or some macro can compute this in a milisecond, wouldn't it?

Or you could just use i/div and waste a few precious cycles every time the code runs...

Quote:

Impossible to avoid a comparison when opposite part is doing the same. And it is not my fault most people prefer GCC and put it on top of all "benchmarks".

The topic is about "hll" not about gcc. Download intel c++ and see its output. I don't think it can be beat on general code for target intel cpu. Vectored computations can be written using intrinsics making it as a fast as best assembly.
Post 31 May 2013, 17:34
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 20:14; edited 1 time in total
Post 31 May 2013, 18:42
View user's profile Send private message Reply with quote
matefkr



Joined: 02 Sep 2007
Posts: 1291
Location: Ukraine, Beregovo
matefkr
the answer to the question is no of course. why is this topic so long?
Post 01 Jun 2013, 14:33
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 20:14; edited 1 time in total
Post 01 Jun 2013, 20:13
View user's profile Send private message Reply with quote
Bargest



Joined: 09 Feb 2012
Posts: 79
Location: Russia
Bargest
I've tested newest MSVC compiler and GCC, both with full optimization (tested on CRC-32 algorithm) They really generare very good code. But I can't say, that it is better, than hand-written (I wrote just the same code several days ago Smile )
Of course, it can't be called a serious benchmark.
But HLL have another bad thing for me. While writing on C++ I relax and become too lazy to think about HL-optimizations. I think "Oh, it is compiling! It means that everyting is good, I don't need to change anything to improve code." As a result I get fully-optimized code that weighs ~400 kb and works very slow.
While writing in assembly, I always try to use best solution I can imagine, because if I decide to write huge and monstrous code, I will write it for a very, very long time.Smile

So the conclusion is: if you write GOOD code in HLL, compiler IS ABLE to produce good asm code, almost like hand-written. But in HLL you do not have to write good code.
Post 01 Jun 2013, 21:27
View user's profile Send private message Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1419
Location: Toronto, Canada
AsmGuru62
C++ code can be written without using the standard C run-time library.
In such a case it will not be more than a few KBytes.
You can take a look at C++ options, like exclude RTTI and exception handling.
That will make your C++ code much less in size.
400k into 10-12k or even less.
Post 02 Jun 2013, 12:19
View user's profile Send private message Send e-mail Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
Quote:

While writing in assembly, I always try to use best solution I can imagine, because if I decide to write huge and monstrous code, I will write it for a very, very long time.

That is true, while coding in assembly you often try to optimize every instruction, from the highly executed loops to the least used routines. If you are coding seriously...

Quote:

if you write GOOD code in HLL, compiler IS ABLE to produce good asm code, almost like hand-written.

Almost, but very far from "hand written asm".

Let's have a look at another comparison...

Compiler generated code for a random task written by a regular human:
Code:
   0:   55                      push   ebp
   1:   89 e5                   mov    ebp,esp
   3:   83 ec 10                sub    esp,0x10
   6:   89 5d f0                mov    DWORD PTR [ebp-0x10],ebx
   9:   89 75 f4                mov    DWORD PTR [ebp-0xc],esi
   c:   89 7d f8                mov    DWORD PTR [ebp-0x8],edi
   f:   8b 7d 08                mov    edi,DWORD PTR [ebp+0x8]
  12:   c7 45 fc ff ff ff ff    mov    DWORD PTR [ebp-0x4],0xffffffff
  19:   bb 00 00 00 00          mov    ebx,0x0
  1e:   8a 04 1f                mov    al,BYTE PTR [edi+ebx*1]
  21:   84 c0                   test   al,al
  23:   75 08                   jne    2d
  25:   e9 a7 00 00 00          jmp    d1
  2a:   89 f6                   mov    esi,esi
  2c:   43                      inc    ebx
  2d:   8a 44 1f 01             mov    al,BYTE PTR [edi+ebx*1+0x1]
  31:   84 c0                   test   al,al
  33:   75 f7                   jne    2c
  35:   ba 00 00 00 00          mov    edx,0x0
  3a:   eb 01                   jmp    3d
  3c:   42                      inc    edx
  3d:   8a 04 17                mov    al,BYTE PTR [edi+edx*1]
  40:   3c 30                   cmp    al,0x30
  42:   74 f8                   je     3c
  44:   c7 45 fc 00 00 00 00    mov    DWORD PTR [ebp-0x4],0x0
  4b:   b0 00                   mov    al,0x0
  4d:   8d 76 00                lea    esi,[esi+0x0]
  50:   0f b6 0c 1f             movzx  ecx,BYTE PTR [edi+ebx*1]
  54:   83 e9 41                sub    ecx,0x41
  57:   83 f9 06                cmp    ecx,0x6
  5a:   73 14                   jae    70
  5c:   0f b6 34 1f             movzx  esi,BYTE PTR [edi+ebx*1]
  60:   83 ee 37                sub    esi,0x37
  63:   0f b6 c8                movzx  ecx,al
  66:   d3 e6                   shl    esi,cl
  68:   0b 75 fc                or     esi,DWORD PTR [ebp-0x4]
  6b:   89 75 fc                mov    DWORD PTR [ebp-0x4],esi
  6e:   eb 49                   jmp    b9
  70:   0f b6 0c 1f             movzx  ecx,BYTE PTR [edi+ebx*1]
  74:   83 e9 61                sub    ecx,0x61
  77:   83 f9 06                cmp    ecx,0x6
  7a:   73 14                   jae    90
  7c:   0f b6 34 1f             movzx  esi,BYTE PTR [edi+ebx*1]
  80:   83 ee 57                sub    esi,0x57
  83:   0f b6 c8                movzx  ecx,al
  86:   d3 e6                   shl    esi,cl
  88:   0b 75 fc                or     esi,DWORD PTR [ebp-0x4]
  8b:   89 75 fc                mov    DWORD PTR [ebp-0x4],esi
  8e:   eb 29                   jmp    b9
  90:   0f b6 0c 1f             movzx  ecx,BYTE PTR [edi+ebx*1]
  94:   83 e9 30                sub    ecx,0x30
  97:   83 f9 0a                cmp    ecx,0xa
  9a:   73 14                   jae    b0
  9c:   0f b6 34 1f             movzx  esi,BYTE PTR [edi+ebx*1]
  a0:   83 ee 30                sub    esi,0x30
  a3:   0f b6 c8                movzx  ecx,al
  a6:   d3 e6                   shl    esi,cl
  a8:   0b 75 fc                or     esi,DWORD PTR [ebp-0x4]
  ab:   89 75 fc                mov    DWORD PTR [ebp-0x4],esi
  ae:   eb 09                   jmp    b9
  b0:   c7 45 fc ff ff ff ff    mov    DWORD PTR [ebp-0x4],0xffffffff
  b7:   eb 18                   jmp    d1
  b9:   4b                      dec    ebx
  ba:   39 d3                   cmp    ebx,edx
  bc:   7c 13                   jl     d1
  be:   04 04                   add    al,0x4
  c0:   0f b6 c8                movzx  ecx,al
  c3:   83 f9 20                cmp    ecx,0x20
  c6:   7c 88                   jl     50
  c8:   c7 45 fc ff ff ff ff    mov    DWORD PTR [ebp-0x4],0xffffffff
  cf:   eb 00                   jmp    d1
  d1:   8b 45 fc                mov    eax,DWORD PTR [ebp-0x4]
  d4:   8b 5d f0                mov    ebx,DWORD PTR [ebp-0x10]
  d7:   8b 75 f4                mov    esi,DWORD PTR [ebp-0xc]
  da:   8b 7d f8                mov    edi,DWORD PTR [ebp-0x8]
  dd:   c9                      leave
  de:   c3                      ret
    


Now a different code with the exactly same purpose "hand written asm" by regular human:
Code:
   0:   55                      push   ebp
   1:   56                      push   esi
   2:   57                      push   edi
   3:   89 e5                   mov    ebp,esp
   5:   8b 7d 10                mov    edi,DWORD PTR [ebp+0x10]
   8:   80 3f 00                cmp    BYTE PTR [edi],0x0
   b:   0f 84 ad 00 00 00       je     be
  11:   80 3f 30                cmp    BYTE PTR [edi],0x30
  14:   75 05                   jne    1b
  16:   83 c7 01                add    edi,0x1
  19:   eb f6                   jmp    11
  1b:   89 fe                   mov    esi,edi
  1d:   80 7f 01 00             cmp    BYTE PTR [edi+0x1],0x0
  21:   8d 7f 01                lea    edi,[edi+0x1]
  24:   75 f7                   jne    1d
  26:   b8 00 00 00 00          mov    eax,0x0
  2b:   b9 00 00 00 00          mov    ecx,0x0
  30:   29 fe                   sub    esi,edi
  32:   f7 de                   neg    esi
  34:   83 ef 01                sub    edi,0x1
  37:   0f b6 17                movzx  edx,BYTE PTR [edi]
  3a:   83 fa 61                cmp    edx,0x61
  3d:   72 25                   jb     64
  3f:   83 fa 66                cmp    edx,0x66
  42:   77 70                   ja     b4
  44:   83 ea 57                sub    edx,0x57
  47:   d3 e2                   shl    edx,cl
  49:   83 f9 20                cmp    ecx,0x20
  4c:   74 66                   je     b4
  4e:   83 c1 04                add    ecx,0x4
  51:   01 d0                   add    eax,edx
  53:   83 ee 01                sub    esi,0x1
  56:   75 dc                   jne    34
  58:   ba 00 00 00 00          mov    edx,0x0
  5d:   5f                      pop    edi
  5e:   5e                      pop    esi
  5f:   5d                      pop    ebp
  60:   c3                      ret
  61:   90                      nop
  62:   90                      nop
  63:   90                      nop
  64:   83 fa 41                cmp    edx,0x41
  67:   72 23                   jb     8c
  69:   83 fa 46                cmp    edx,0x46
  6c:   77 46                   ja     b4
  6e:   83 ea 37                sub    edx,0x37
  71:   d3 e2                   shl    edx,cl
  73:   83 f9 20                cmp    ecx,0x20
  76:   74 3c                   je     b4
  78:   83 c1 04                add    ecx,0x4
  7b:   01 d0                   add    eax,edx
  7d:   83 ee 01                sub    esi,0x1
  80:   75 b2                   jne    34
  82:   ba 00 00 00 00          mov    edx,0x0
  87:   5f                      pop    edi
  88:   5e                      pop    esi
  89:   5d                      pop    ebp
  8a:   c3                      ret
  8b:   90                      nop
  8c:   83 fa 30                cmp    edx,0x30
  8f:   72 23                   jb     b4
  91:   83 fa 39                cmp    edx,0x39
  94:   77 1e                   ja     b4
  96:   83 ea 30                sub    edx,0x30
  99:   d3 e2                   shl    edx,cl
  9b:   83 f9 20                cmp    ecx,0x20
  9e:   74 14                   je     b4
  a0:   83 c1 04                add    ecx,0x4
  a3:   01 d0                   add    eax,edx
  a5:   83 ee 01                sub    esi,0x1
  a8:   75 8a                   jne    34
  aa:   ba 00 00 00 00          mov    edx,0x0
  af:   5f                      pop    edi
  b0:   5e                      pop    esi
  b1:   5d                      pop    ebp
  b2:   c3                      ret
  b3:   90                      nop
  b4:   b8 ff ff ff ff          mov    eax,0xffffffff
  b9:   ba ff ff ff ff          mov    edx,0xffffffff
  be:   5f                      pop    edi
  bf:   5e                      pop    esi
  c0:   5d                      pop    ebp
  c1:   c3                      ret
    

Now a simple "benchmark":
Quote:

3611ms ; compiler generated code with highest possible optimization
2327ms ; hand written asm code

Compilers have many advantages, performance is not one of them. At least in this case.
Post 11 Mar 2015, 03:56
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17714
Location: In your JS exploiting you and your system
revolution
If you are mostly concerned about performance then the hand written code can be made more efficient by replacing "mov reg,0" with "xor reg,reg" and "mov reg,0xffffffff" with "or reg,-1". This may help with cache usage, although proper testing on the target system will show whether or not there is an actual and significant improvement.
Post 11 Mar 2015, 04:08
View user's profile Send private message Visit poster's website Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
Quote:
If you are mostly concerned about performance then the hand written code can be made more efficient by replacing "mov reg,0" with "xor reg,reg" and "mov reg,0xffffffff" with "or reg,-1".

After replacing the "mov reg,0x00" by "xor reg,reg" added about 200ms to the final result. And about the "mov reg,0xffffffff", that line is never executed unless an error occurs so I guess it will not make a large impact (if any, not sure).
Code:
; This result is not comparable with the previous post
2315ms ; xor reg,reg
2153ms ; mov reg,0x00000000
    

I do not know if that is the expected behavior, but that is how it works here.
Post 11 Mar 2015, 04:48
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.