flat assembler
Message board for the users of flat assembler.

Index > High Level Languages > Human vs GCC -O3

Author
Thread Post new topic Reply to topic
sakeniwefu



Joined: 23 Mar 2008
Posts: 29
sakeniwefu
Today I wrote a program in C and its exact equivalent in FASM libc. The program copies file A to file B bit per bit.

With code hints(inline, register, etc.), -O2 and -O3 beat me at code generation(10% faster, 200 bytes larger, though). Without code hints only -O3 beat me but the resulting file was the same, meaning -O3 is pretty good at optimization.

How many of you fasmers have ever beaten -O3 in a similar test?

Any link on the kind of optimizations that -O3 does to humiliate me? Embarassed

_________________
Warning: C code found, deleting.
Post 31 Mar 2008, 01:31
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
you should generate disassembly (-S switch i think) and take a look at it
Post 31 Mar 2008, 01:45
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17671
Location: In your JS exploiting you and your system
revolution
The last time I tried to optimise some code I found it pretty easy to "beat" the compiler by more than double. But it all depends upon what you are coding. Often cache management makes the largest impact, second would usually be instruction selection and timing, and third would be register handling.

In the non-x86 environment, with ARM, I coded a CRC32 that was 13 times faster than standard compiled code, mainly due to prefetching and cache awareness that the HLL did not know about. Indeed I proved it was the fastest possible way to do CRC because it was completely memory bound and left the memory bus with zero idle cycles.

Of course this all assumes that you already have the right algorithm, which is always the most important thing to improve.
Post 31 Mar 2008, 01:47
View user's profile Send private message Visit poster's website Reply with quote
sakeniwefu



Joined: 23 Mar 2008
Posts: 29
sakeniwefu
Thank you for your replies! I was aware of the -S switch, but the code isn't commented and is very different from my implementation, thus making it difficult to follow.

revolution wrote:
Of course this all assumes that you already have the right algorithm, which is always the most important thing to improve.

Indeed, I could certainly optimize on the algorithm but then it wouldn't be an exact equivalent anymore as I could as well rewrite the C code.

I am more interested in non obvious optimizations of a given algorithm(read compiler writing, and inline assembly algorithms).

Okay, I think I found exactly what I was asking for

_________________
Warning: C code found, deleting.
Post 31 Mar 2008, 11:25
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17671
Location: In your JS exploiting you and your system
revolution
sakeniwefu wrote:
Okay, I think I found exactly what I was asking for
Oh, you hadn't seen that, yeah that is quite good, but don't forget the Intel and AMD optimisation manuals, they also have a few nice little tips and explanations for techniques.

And in case you don't know where to get the optimisation manuals, check out my website, I have all the information there. Wink
Post 31 Mar 2008, 11:34
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17671
Location: In your JS exploiting you and your system
revolution
sakeniwefu wrote:
Indeed, I could certainly optimize on the algorithm but then it wouldn't be an exact equivalent anymore as I could as well rewrite the C code.
Sometimes using algo1 fits nicely into assembly, but algo2 matches with HLL. Try not to be too fixated on keeping exactly the same algo, languages can't always express ideas precisely and they have to be presented to the target language in a best fit manner.
Post 31 Mar 2008, 11:39
View user's profile Send private message Visit poster's website Reply with quote
sakeniwefu



Joined: 23 Mar 2008
Posts: 29
sakeniwefu
revolution wrote:
And in case you don't know where to get the optimisation manuals, check out my website, I have all the information there. Wink
LOL I had AMD's lying around, but Google didn't help with Intel's(Found it in Intels page) and looking for it I stumbled into that link. Very Happy

_________________
Warning: C code found, deleting.
Post 31 Mar 2008, 12:45
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17671
Location: In your JS exploiting you and your system
revolution
sakeniwefu wrote:
...but Google didn't help with Intel's ...
Shocked I thought google was God and knew everything. Oh you have just broken all my beliefs Mad I need to have a lie down now. Crying or Very sad

Actually if you know the right search string then of course google does know everything, but then if you already knew EXACTLY what to search for then why did you need google??? Hmm ... I'll have to think about that for a while.
Post 31 Mar 2008, 12:56
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Umm, you worry about optimization for a file copying program? Code generation shouldn't really matter for such an app, since you're supposed to be 100% I/O bound.
Post 31 Mar 2008, 14:15
View user's profile Send private message Visit poster's website Reply with quote
sakeniwefu



Joined: 23 Mar 2008
Posts: 29
sakeniwefu
f0dder wrote:
Umm, you worry about optimization for a file copying program? Code generation shouldn't really matter for such an app, since you're supposed to be 100% I/O bound.


Did you read the bit by bit part? The idea is to read and write arbitrary-length strings of bits to use in some encryption and compression algos. Also as I said I wasn't optimizing the algorithm but the code. And bad code is noticed all the same(unoptimized C is 4 times slower than optimized C or my assembly).

Both of my implementations ARE slow for the test scenario, but work just as C stdio functions (getb,putb,getbw,putbw,fsetposb) so they allow me to RAD any new algorithm that crosses my mind with random access to an arbitrary amount of data and using 0 bytes of memory. The algorithms(at least my asm version) can be adapted easily to use memory buffers by just stripping the file I/O part when and if I feel inclined to do it. Then I will be battling with GCC again to get the most optimal memory management routines for the task.

I would have liked my fasm algorithm to win as I feel more confident doing data manipulations in assembly than in crippled C syntax(you can write a terrible security hole in less than a command but you need three to roll a dword). This one of the reasons why I am willing to learn advanced optimization. The other one is that I have my own ideas in compiler design.

So, to sum it up, the small important part is 100% not I/O bound and is independent from File I/O.

_________________
Warning: C code found, deleting.
Post 31 Mar 2008, 21:11
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Ah, I didn't think you meant the "bit by bit" part literally... if you had written "arbitrary bit-lengths" from the start it would have been clearer - sorry Razz
Post 31 Mar 2008, 23:01
View user's profile Send private message Visit poster's website Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1901
DOS386
Quote:
With code hints(inline, register, etc.), -O2 and -O3 beat me at code generation(10% faster, 200 bytes larger, though). Without code hints only -O3 beat me but the resulting file was the same, meaning -O3 is pretty good at optimization.


Post "offending" code, otherwise I hardly will believe this Very Happy
Post 03 Apr 2008, 00:31
View user's profile Send private message Reply with quote
sakeniwefu



Joined: 23 Mar 2008
Posts: 29
sakeniwefu
This is the Compiler generated code for the functions. I am sure you fasmers can get something out of it, but I couldn't figure why it was faster and eventually deleted my code.
Sorry for the nonstandard code, it is supposed to be Intel syntax.
Code:
putb:
     push    %ebp
        xor     %edx, %edx
  mov     %ebp, %esp
  sub     %esp, 8
     mov     %eax, DWORD PTR byte.2072
   add     %eax, %eax                              ;
   or      %eax, DWORD PTR [%ebp+12]               ;
   movzx   %ecx, %al                               ;What? why not mov ecx,eax?
 mov     %eax, DWORD PTR curb.2071               ; is movzx faster?
  mov     DWORD PTR byte.2072, %ecx
   sub     %eax, 1
     mov     DWORD PTR curb.2071, %eax
   add     %eax, 1
     je      .L9
.L4:
     leave
       mov     %eax, %edx
  ret
.L9:
     mov     %eax, DWORD PTR [%ebp+8]
    mov     DWORD PTR curb.2071, 7
      mov     DWORD PTR [%esp], %ecx
      mov     DWORD PTR [%esp+4], %eax
    call    fputc
       mov     %edx, -1
    add     %eax, 1
     je      .L4
 xor     %edx, %edx
  mov     DWORD PTR byte.2072, 0
      jmp     .L4

getb:
        push    %ebp
        mov     %ebp, %esp
  sub     %esp, 8
     cmp     DWORD PTR curb.2055, -1
     je      .L16
.L11:
   mov     %eax, DWORD PTR byte.2056
   sub     DWORD PTR curb.2055, 1
      lea     %edx, [%eax+%eax]                       ; WTF is this?
      sar     %eax, 7                                 ; is this supposed to do a
  or      %edx, %eax                              ; rol eax,32??
      movzx   %eax, %dl                               ; and eax,1 
        and     %edx, 1                                 ;
   mov     DWORD PTR byte.2056, %eax
.L14:
      leave
       mov     %eax, %edx
  ret
.L16:
    mov     %eax, DWORD PTR [%ebp+8]
    mov     DWORD PTR curb.2055, 7
      mov     DWORD PTR [%esp], %eax
      call    _IO_getc
    mov     %edx, -1
    mov     DWORD PTR byte.2056, %eax
   add     %eax, 1
     jne     .L11
        jmp     .L14
    

_________________
Warning: C code found, deleting.
Post 03 Apr 2008, 10:46
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3050
Location: vpcmipstrm
bitRAKE
Bit strings are a native type on x86, but when working with strings longer than a few bits shifting/masking is faster. The code you posted seems like some kind of adaptor between the bit strings and character based I/O? If that is the case it's massive overkill as abstractions go. Also, it seems odd that you deleted your code - which would have been very helpful. Maybe you can post the C code, or explain the interface? What are the params passed to the routines? What are the definitions for the data accessed?

_________________
¯\(°_o)/¯ unlicense.org
Post 03 Apr 2008, 16:36
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
sakeniwefu, what cpu family? what version of GCC? what OS version? (All that makes a big difference.)

From the code, I'd weakly bet (data fetch??) alignment and file buffering are what really give the speed, not anything inherently better instruction-wise (although it's pretty hard to really know what you're doing regarding caches, pipelines, etc). GCC 4.x is better at -O2 than 3.x anyways (and latest's -march=native gives best output). Face it, if GCC was completely dumb, nobody'd use it. It for sure has its uses.

http://en.wikipedia.org/wiki/Compiler_optimization

You can try "gcc -Q -v -O3". Or see "_Optimization Options_" in your GCC.INFO file for a complete list.
Post 03 Apr 2008, 20:09
View user's profile Send private message Visit poster's website Reply with quote
sakeniwefu



Joined: 23 Mar 2008
Posts: 29
sakeniwefu
getb(FILE*) putb(FILE*,bit) *I realised afterwards that C putc is (char,FILE*) when called from C.

Basically each function has two variables: byte and curb(current byte) and gets or puts a char through stdio.h getc and putc when the bit counter reaches 0. The functions return 0,1,x or -1 if there was an error.

In the actual implementation I have for WinAPI asm fputb(*,MAGIC_NUMBER) "burns" the last byte. However I left it out of this test because it wasn't needed to copy bytes. As I said I was learning optimization techniques.

This is it. I knew in advance it's not an optimal solution. I suppose that in the same space I could do a fread to a buffer and work from there, but I thought, what if I access a big file randomly? The bigger the buffer the worse, which buffer size would be optimal? malloc or db? So I decided I would care about that once I learned to optimize the opcodes.

I deleted the code because I was very frustrated. I spent a day changing instructions, and changes in performance were only for the worst. Maybe I should have been working on the compiler generated output, but I wanted it to be a fair competition.

My code was doing more or less the same as this code(of course, I designed both to be equal), but passing the arguments through registers(I would use it in fasm after all). Eventually I also inlined the functions through a macro and unrolled the loop four times(empirically this was the fastest). I even tried passing a reference to the variable space to edi so i could mov [edi+x] the variables, but there was no noticeable improvement. I stalled at 15% slower.

Also, I don't believe the disassembly output. This says ret at the end, but the real binary must be inlining the functions because just the call overhead doubles the execution time.
Post 03 Apr 2008, 21:37
View user's profile Send private message Reply with quote
sakeniwefu



Joined: 23 Mar 2008
Posts: 29
sakeniwefu
Thanks rugxulo!
That's my conclusion as well. But I don't know enough to implement it. I'll keep studying and trying.

Oh, and it is GCC 4.3 -O3 running under Debian Unstable in a Pentium 4. With no processor-specific flags.
Post 03 Apr 2008, 21:53
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3050
Location: vpcmipstrm
bitRAKE
sakeniwefu wrote:
This is it. I knew in advance it's not an optimal solution. I suppose that in the same space I could do a fread to a buffer and work from there, but I thought, what if I access a big file randomly? The bigger the buffer the worse, which buffer size would be optimal? malloc or db? So I decided I would care about that once I learned to optimize the opcodes.
That's the crux of the problem. You've approached the code from such a small granularity that any 'comparison' is doomed. Random bit access is the worst case - avoid it like the plague. Refactor the algorithm, compress the data and work on larger chunks, or if you absolutely must then load the whole file and use the bit string instructions - nothing faster.

Okay, so you want to do it anyway. All file I/O reads/writes a sector, 512 bytes or more - like it or not. The OS will cache, but how many layers do you want to go through? The compiler is caching 8 bits and only calling the I/O routines if absolutely required - did you do that in your code?

_________________
¯\(°_o)/¯ unlicense.org
Post 04 Apr 2008, 06:09
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.