flat assembler
Message board for the users of flat assembler.
Index
> High Level Languages > Human vs GCC -O3 |
Author |
|
vid 31 Mar 2008, 01:45
you should generate disassembly (-S switch i think) and take a look at it
|
|||
31 Mar 2008, 01:45 |
|
revolution 31 Mar 2008, 01:47
The last time I tried to optimise some code I found it pretty easy to "beat" the compiler by more than double. But it all depends upon what you are coding. Often cache management makes the largest impact, second would usually be instruction selection and timing, and third would be register handling.
In the non-x86 environment, with ARM, I coded a CRC32 that was 13 times faster than standard compiled code, mainly due to prefetching and cache awareness that the HLL did not know about. Indeed I proved it was the fastest possible way to do CRC because it was completely memory bound and left the memory bus with zero idle cycles. Of course this all assumes that you already have the right algorithm, which is always the most important thing to improve. |
|||
31 Mar 2008, 01:47 |
|
sakeniwefu 31 Mar 2008, 11:25
Thank you for your replies! I was aware of the -S switch, but the code isn't commented and is very different from my implementation, thus making it difficult to follow.
revolution wrote: Of course this all assumes that you already have the right algorithm, which is always the most important thing to improve. Indeed, I could certainly optimize on the algorithm but then it wouldn't be an exact equivalent anymore as I could as well rewrite the C code. I am more interested in non obvious optimizations of a given algorithm(read compiler writing, and inline assembly algorithms). Okay, I think I found exactly what I was asking for _________________ Warning: C code found, deleting. |
|||
31 Mar 2008, 11:25 |
|
revolution 31 Mar 2008, 11:34
sakeniwefu wrote: Okay, I think I found exactly what I was asking for And in case you don't know where to get the optimisation manuals, check out my website, I have all the information there. |
|||
31 Mar 2008, 11:34 |
|
revolution 31 Mar 2008, 11:39
sakeniwefu wrote: Indeed, I could certainly optimize on the algorithm but then it wouldn't be an exact equivalent anymore as I could as well rewrite the C code. |
|||
31 Mar 2008, 11:39 |
|
sakeniwefu 31 Mar 2008, 12:45
revolution wrote: And in case you don't know where to get the optimisation manuals, check out my website, I have all the information there. _________________ Warning: C code found, deleting. |
|||
31 Mar 2008, 12:45 |
|
revolution 31 Mar 2008, 12:56
sakeniwefu wrote: ...but Google didn't help with Intel's ... Actually if you know the right search string then of course google does know everything, but then if you already knew EXACTLY what to search for then why did you need google??? Hmm ... I'll have to think about that for a while. |
|||
31 Mar 2008, 12:56 |
|
f0dder 31 Mar 2008, 14:15
Umm, you worry about optimization for a file copying program? Code generation shouldn't really matter for such an app, since you're supposed to be 100% I/O bound.
|
|||
31 Mar 2008, 14:15 |
|
sakeniwefu 31 Mar 2008, 21:11
f0dder wrote: Umm, you worry about optimization for a file copying program? Code generation shouldn't really matter for such an app, since you're supposed to be 100% I/O bound. Did you read the bit by bit part? The idea is to read and write arbitrary-length strings of bits to use in some encryption and compression algos. Also as I said I wasn't optimizing the algorithm but the code. And bad code is noticed all the same(unoptimized C is 4 times slower than optimized C or my assembly). Both of my implementations ARE slow for the test scenario, but work just as C stdio functions (getb,putb,getbw,putbw,fsetposb) so they allow me to RAD any new algorithm that crosses my mind with random access to an arbitrary amount of data and using 0 bytes of memory. The algorithms(at least my asm version) can be adapted easily to use memory buffers by just stripping the file I/O part when and if I feel inclined to do it. Then I will be battling with GCC again to get the most optimal memory management routines for the task. I would have liked my fasm algorithm to win as I feel more confident doing data manipulations in assembly than in crippled C syntax(you can write a terrible security hole in less than a command but you need three to roll a dword). This one of the reasons why I am willing to learn advanced optimization. The other one is that I have my own ideas in compiler design. So, to sum it up, the small important part is 100% not I/O bound and is independent from File I/O. _________________ Warning: C code found, deleting. |
|||
31 Mar 2008, 21:11 |
|
f0dder 31 Mar 2008, 23:01
Ah, I didn't think you meant the "bit by bit" part literally... if you had written "arbitrary bit-lengths" from the start it would have been clearer - sorry
|
|||
31 Mar 2008, 23:01 |
|
DOS386 03 Apr 2008, 00:31
Quote: With code hints(inline, register, etc.), -O2 and -O3 beat me at code generation(10% faster, 200 bytes larger, though). Without code hints only -O3 beat me but the resulting file was the same, meaning -O3 is pretty good at optimization. Post "offending" code, otherwise I hardly will believe this |
|||
03 Apr 2008, 00:31 |
|
sakeniwefu 03 Apr 2008, 10:46
This is the Compiler generated code for the functions. I am sure you fasmers can get something out of it, but I couldn't figure why it was faster and eventually deleted my code.
Sorry for the nonstandard code, it is supposed to be Intel syntax. Code: putb: push %ebp xor %edx, %edx mov %ebp, %esp sub %esp, 8 mov %eax, DWORD PTR byte.2072 add %eax, %eax ; or %eax, DWORD PTR [%ebp+12] ; movzx %ecx, %al ;What? why not mov ecx,eax? mov %eax, DWORD PTR curb.2071 ; is movzx faster? mov DWORD PTR byte.2072, %ecx sub %eax, 1 mov DWORD PTR curb.2071, %eax add %eax, 1 je .L9 .L4: leave mov %eax, %edx ret .L9: mov %eax, DWORD PTR [%ebp+8] mov DWORD PTR curb.2071, 7 mov DWORD PTR [%esp], %ecx mov DWORD PTR [%esp+4], %eax call fputc mov %edx, -1 add %eax, 1 je .L4 xor %edx, %edx mov DWORD PTR byte.2072, 0 jmp .L4 getb: push %ebp mov %ebp, %esp sub %esp, 8 cmp DWORD PTR curb.2055, -1 je .L16 .L11: mov %eax, DWORD PTR byte.2056 sub DWORD PTR curb.2055, 1 lea %edx, [%eax+%eax] ; WTF is this? sar %eax, 7 ; is this supposed to do a or %edx, %eax ; rol eax,32?? movzx %eax, %dl ; and eax,1 and %edx, 1 ; mov DWORD PTR byte.2056, %eax .L14: leave mov %eax, %edx ret .L16: mov %eax, DWORD PTR [%ebp+8] mov DWORD PTR curb.2055, 7 mov DWORD PTR [%esp], %eax call _IO_getc mov %edx, -1 mov DWORD PTR byte.2056, %eax add %eax, 1 jne .L11 jmp .L14 _________________ Warning: C code found, deleting. |
|||
03 Apr 2008, 10:46 |
|
bitRAKE 03 Apr 2008, 16:36
Bit strings are a native type on x86, but when working with strings longer than a few bits shifting/masking is faster. The code you posted seems like some kind of adaptor between the bit strings and character based I/O? If that is the case it's massive overkill as abstractions go. Also, it seems odd that you deleted your code - which would have been very helpful. Maybe you can post the C code, or explain the interface? What are the params passed to the routines? What are the definitions for the data accessed?
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
03 Apr 2008, 16:36 |
|
rugxulo 03 Apr 2008, 20:09
sakeniwefu, what cpu family? what version of GCC? what OS version? (All that makes a big difference.)
From the code, I'd weakly bet (data fetch??) alignment and file buffering are what really give the speed, not anything inherently better instruction-wise (although it's pretty hard to really know what you're doing regarding caches, pipelines, etc). GCC 4.x is better at -O2 than 3.x anyways (and latest's -march=native gives best output). Face it, if GCC was completely dumb, nobody'd use it. It for sure has its uses. http://en.wikipedia.org/wiki/Compiler_optimization You can try "gcc -Q -v -O3". Or see "_Optimization Options_" in your GCC.INFO file for a complete list. |
|||
03 Apr 2008, 20:09 |
|
sakeniwefu 03 Apr 2008, 21:37
getb(FILE*) putb(FILE*,bit) *I realised afterwards that C putc is (char,FILE*) when called from C.
Basically each function has two variables: byte and curb(current byte) and gets or puts a char through stdio.h getc and putc when the bit counter reaches 0. The functions return 0,1,x or -1 if there was an error. In the actual implementation I have for WinAPI asm fputb(*,MAGIC_NUMBER) "burns" the last byte. However I left it out of this test because it wasn't needed to copy bytes. As I said I was learning optimization techniques. This is it. I knew in advance it's not an optimal solution. I suppose that in the same space I could do a fread to a buffer and work from there, but I thought, what if I access a big file randomly? The bigger the buffer the worse, which buffer size would be optimal? malloc or db? So I decided I would care about that once I learned to optimize the opcodes. I deleted the code because I was very frustrated. I spent a day changing instructions, and changes in performance were only for the worst. Maybe I should have been working on the compiler generated output, but I wanted it to be a fair competition. My code was doing more or less the same as this code(of course, I designed both to be equal), but passing the arguments through registers(I would use it in fasm after all). Eventually I also inlined the functions through a macro and unrolled the loop four times(empirically this was the fastest). I even tried passing a reference to the variable space to edi so i could mov [edi+x] the variables, but there was no noticeable improvement. I stalled at 15% slower. Also, I don't believe the disassembly output. This says ret at the end, but the real binary must be inlining the functions because just the call overhead doubles the execution time. |
|||
03 Apr 2008, 21:37 |
|
sakeniwefu 03 Apr 2008, 21:53
Thanks rugxulo!
That's my conclusion as well. But I don't know enough to implement it. I'll keep studying and trying. Oh, and it is GCC 4.3 -O3 running under Debian Unstable in a Pentium 4. With no processor-specific flags. |
|||
03 Apr 2008, 21:53 |
|
bitRAKE 04 Apr 2008, 06:09
sakeniwefu wrote: This is it. I knew in advance it's not an optimal solution. I suppose that in the same space I could do a fread to a buffer and work from there, but I thought, what if I access a big file randomly? The bigger the buffer the worse, which buffer size would be optimal? malloc or db? So I decided I would care about that once I learned to optimize the opcodes. Okay, so you want to do it anyway. All file I/O reads/writes a sector, 512 bytes or more - like it or not. The OS will cache, but how many layers do you want to go through? The compiler is caching 8 bits and only calling the I/O routines if absolutely required - did you do that in your code? _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
04 Apr 2008, 06:09 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.