flat assembler
Message board for the users of flat assembler.
Index
> High Level Languages > Human vs. compiler Goto page 1, 2, 3 Next |
Author |
|
LocoDelAssembly 18 Dec 2006, 01:00
Quote:
Is to break dependency chain on that shitty processor called Pentium IV. vid, do you don't remember that? It was discussed a lot of times on this forum and on theads that you have participated [edit]though, in this particular case I'm not sure if it is really needed or it's just a static optimization[/edit] |
|||
18 Dec 2006, 01:00 |
|
bubach 18 Dec 2006, 02:43
eh.. i have no idea why this is needed:
Code: lea esp,[esp] ;align align the inner loop, whats that supposed to mean..? |
|||
18 Dec 2006, 02:43 |
|
vid 18 Dec 2006, 06:20
bubach: add the +2 extra bytes, so inner loop starts at address aligned to 16 (or 8, not sure). Something with processor's prefetching
Loco: hm, i don't remember seeing this one exactly, got a link?. I often miss few things, I am really not a very "focused" person. And I am also not some lowlevel optimization (pairing, timing) guy, and haven't even read optimization manuals for few years. |
|||
18 Dec 2006, 06:20 |
|
Raedwulf 18 Dec 2006, 06:37
Codeplay VectorC compiler (Athlon, Vectorize, Microsoft Calling Convention, Optimisation 10, CacheSize 256kb, Auto-Inline 15:
Code: align 32 ; File "Source Window" ; Line 1 _sort: push ebx ; Line 4 femms ; Line 1 push ebp push esi push edi sub esp,4 ; Line 4 mov edx,dword ptr 28[esp] mov ebx,dword ptr 24[esp] cmp edx,2 jle short L0_RETURN mov esi,2 ; Line 5 align 32 __Gen1: test esi,esi jle short L2_FOR_INC xor ebp,ebp ; Line 6 cmp esi,1 jg __UnrollTestBefore3839 __UnrollLast5253: mov al,byte ptr 0[ebx+esi] mov cl,byte ptr 0[ebx+ebp] cmp al,cl jge short L2_FOR_INC ; Line 7 mov al,byte ptr 0[ebx+ebp] ; Line 8 mov cl,byte ptr 0[ebx+esi] mov byte ptr 0[ebx+ebp],cl ; Line 9 mov byte ptr 0[ebx+esi],al ; Line 5 L2_FOR_INC: ; Line 4 inc esi cmp esi,edx jl __Gen1 L0_RETURN: ; Line 13 add esp,4 pop edi femms pop esi pop ebp pop ebx ret L6_FOR_INC: ; Line 6 sub esp,4 movq mm0,mm4 movq mm4,mm5 mov dword ptr [esp],edx mov al,byte ptr 1[ebx+ebp] movq mm5,mm0 ; Line 5 inc ebp ; Line 6 movd edx,mm5 movd mm5,dword ptr [esp] add esp,4 cmp byte ptr 0[ebx+esi],al movd ecx,mm5 jge short __Unrolled6263 movq mm5,mm4 movd mm4,edx mov edx,ecx ; Line 8 mov cl,byte ptr 0[ebx+esi] ; Line 10 sub esp,4 ; Line 7 mov al,byte ptr 0[ebx+ebp] ; Line 8 mov byte ptr 0[ebx+ebp],cl ; Line 9 mov byte ptr 0[ebx+esi],al ; Line 10 mov dword ptr [esp],edx movd edx,mm4 movd mm4,dword ptr [esp] add esp,4 movq mm0,mm4 movq mm4,mm5 movq mm5,mm0 movd ecx,mm5 __Unrolled6263: ; Line 6 sub esp,4 add edi,2 mov dword ptr [esp],ecx movd ecx,mm4 movd mm4,dword ptr [esp] add esp,4 ; Line 5 inc ebp ; Line 6 movd eax,mm4 cmp eax,edi jne short __Gen3 __UnrollLastTest4445: test ecx,ecx je L2_FOR_INC jmp __UnrollLast5253 align 32 __Gen3: movd mm5,ecx mov cl,byte ptr 0[ebx+ebp] movd mm4,edx mov edx,eax mov al,byte ptr 0[ebx+esi] cmp al,cl jge L6_FOR_INC ; Line 7 mov al,byte ptr 0[ebx+ebp] ; Line 8 mov cl,byte ptr 0[ebx+esi] mov byte ptr 0[ebx+ebp],cl ; Line 9 mov byte ptr 0[ebx+esi],al ; Line 10 jmp L6_FOR_INC __UnrollTestBefore3839: ; Line 6 mov ecx,esi and ecx,1 mov eax,esi and eax,-2 je __UnrollLastTest4445 xor edi,edi jmp __Gen3 __EndProcedure117: It seems that VectorC likes unrolling a lot, no idea how good this code is though.... we need a benchmark someone! _________________ Raedwulf Last edited by Raedwulf on 18 Dec 2006, 13:05; edited 1 time in total |
|||
18 Dec 2006, 06:37 |
|
Maverick 18 Dec 2006, 08:49
bubach wrote: eh.. i have no idea why this is needed: It's just a two bytes long NOP that executes two times faster than two regular NOPs in sequence. _________________ Greets, Fabio |
|||
18 Dec 2006, 08:49 |
|
crc 18 Dec 2006, 11:51
GCC 4.1.0 generates this code with no options:
Code: .file "a.c" .text .globl sort .type sort, @function sort: pushl %ebp movl %esp, %ebp subl $16, %esp movl $2, -12(%ebp) jmp .L2 .L3: movl $0, -8(%ebp) jmp .L4 .L5: movl -12(%ebp), %eax addl 8(%ebp), %eax movzbl (%eax), %edx movl -8(%ebp), %eax addl 8(%ebp), %eax movzbl (%eax), %eax cmpb %al, %dl jge .L6 movl -8(%ebp), %eax addl 8(%ebp), %eax movzbl (%eax), %eax movsbl %al,%eax movl %eax, -4(%ebp) movl -8(%ebp), %eax movl %eax, %edx addl 8(%ebp), %edx movl -12(%ebp), %eax addl 8(%ebp), %eax movzbl (%eax), %eax movb %al, (%edx) movl -12(%ebp), %eax movl %eax, %edx addl 8(%ebp), %edx movl -4(%ebp), %eax movb %al, (%edx) .L6: addl $1, -8(%ebp) .L4: movl -8(%ebp), %eax cmpl -12(%ebp), %eax jl .L5 addl $1, -12(%ebp) .L2: movl -12(%ebp), %eax cmpl 12(%ebp), %eax jl .L3 leave ret .size sort, .-sort .ident "GCC: (GNU) 4.1.0 (SUSE Linux)" .section .note.GNU-stack,"",@progbits With -O3, it generates: Code: .file "a.c" .text .p2align 4,,15 .globl sort .type sort, @function sort: pushl %ebp movl %esp, %ebp cmpl $2, 12(%ebp) pushl %edi pushl %esi pushl %ebx jle .L11 movl $2, %edi .L4: movl 8(%ebp), %eax xorl %ebx, %ebx leal (%edi,%eax), %esi .p2align 4,,7 .L5: movzbl (%esi), %ecx movzbl (%eax), %edx cmpb %dl, %cl jge .L6 movb %cl, (%eax) movb %dl, (%esi) .L6: addl $1, %ebx addl $1, %eax cmpl %ebx, %edi jg .L5 addl $1, %edi cmpl %edi, 12(%ebp) jle .L11 .L17: testl %edi, %edi jg .L4 addl $1, %edi cmpl %edi, 12(%ebp) jg .L17 .L11: popl %ebx popl %esi popl %edi popl %ebp .p2align 4,,1 ret .size sort, .-sort .ident "GCC: (GNU) 4.1.0 (SUSE Linux)" .section .note.GNU-stack,"",@progbits |
|||
18 Dec 2006, 11:51 |
|
Raedwulf 18 Dec 2006, 12:59
GAS gives me a headache, so hopefully I've ported it correctly for GCC -O3..., correct me if im wrong, i only have basic knowledge about GAS
Code: align 4 sort: push ebp mov ebp, esp cmp [ebp+12], 12h push edi push esi push ebx jle .L11 mov edi, 2h .L4: mov eax, [ebp+8] xor ebx, ebx lea esi, [edi+eax] align 4 .L5: movzx ecx, byte [esi] movzx edx, byte [eax] cmp cl, dl jge .L6 mov [eax], cl mov [esi], dl .L6: add ebx, 1 add eax, 1 cmp edi, ebx jg .L5 add edi, 1 cmp [ebp+12], edi jle .L11 .L17: test edi, edi jg .L4 add edi, 01h cmp [ebp+12], edi jg .L17 .L11: pop ebx pop esi pop edi pop ebp align 4 ret I don't quite understand what .palign is, im just putting align 4..... --- edited by vid - corrected you movzb a,[b] -> movzx a,byte [b] _________________ Raedwulf |
|||
18 Dec 2006, 12:59 |
|
LocoDelAssembly 18 Dec 2006, 13:43
vid, http://board.flatassembler.net/topic.php?t=2561 . There is some more post about this issue, most of the time revolution has something to do.
|
|||
18 Dec 2006, 13:43 |
|
Raedwulf 18 Dec 2006, 14:45
Ah thanks vid, movzb was MASM , that knowledge comes back to haunt me! .
Cheers. |
|||
18 Dec 2006, 14:45 |
|
vid 18 Dec 2006, 16:02
loco: allright
|
|||
18 Dec 2006, 16:02 |
|
f0dder 18 Dec 2006, 23:39
A few points about the C source:
- use "void" returntype when you're not returning anything - use "temp" of same datatype as your dataset - prefer "unsigned" to "signed" when you don't need sign. vid wrote:
What do you mean here? Keep in mind that compilers aren't 100% free to do what they want, otherwise you might've wound up with a radix sort, quicksort, whatever PS: which version is "MSVC 8.0"? The one from vs.net2005 is Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.42 for 80x86, and generally /Ox or /O1 would be the flags to use (max opts or opt for size). |
|||
18 Dec 2006, 23:39 |
|
crc 19 Dec 2006, 00:21
From the Small C/386 (originally for Windows NT; I ported it to Linux a few years ago. Targets NASM):
Code: ;sort(a, N) SECTION .text SECTION .data SECTION .text GLOBAL $sort $sort: ;char *a; int N; PUSH EBP MOV EBP,ESP ;{ ; int j, k, temp; ; for (j = 2; j < N; j++) { ADD ESP,-12 LEA EAX,[EBP-4] MOV EBX,EAX MOV EAX,2 MOV [EBX],EAX _4: LEA EAX,[EBP-4] MOV EBX,EAX MOV EAX, [EBX] PUSH EAX LEA EAX,[EBP+8] MOV EBX,EAX MOV EAX, [EBX] POP EBX OR EAX,EAX JNE _6 JMP _3 _6: JMP _5 _2: LEA EAX,[EBP-4] MOV EBX,EAX MOV EAX, [EBX] INC EAX MOV [EBX],EAX DEC EAX JMP _4 _5: ; for (k = 0; k < j; k++) { LEA EAX,[EBP-8] MOV EBX,EAX XOR EAX,EAX MOV [EBX],EAX _9: LEA EAX,[EBP-8] MOV EBX,EAX MOV EAX, [EBX] PUSH EAX LEA EAX,[EBP-4] MOV EBX,EAX MOV EAX, [EBX] POP EBX OR EAX,EAX JNE _11 JMP _8 _11: JMP _10 _7: LEA EAX,[EBP-8] MOV EBX,EAX MOV EAX, [EBX] INC EAX MOV [EBX],EAX DEC EAX JMP _9 _10: ; if (a[j] < a[k]) { LEA EAX,[EBP+12] MOV EBX,EAX MOV EAX, [EBX] PUSH EAX LEA EAX,[EBP-4] MOV EBX,EAX MOV EAX, [EBX] POP EBX ADD EAX,EBX MOV EBX,EAX MOVSX EAX,BYTE [EBX] PUSH EAX LEA EAX,[EBP+12] MOV EBX,EAX MOV EAX, [EBX] PUSH EAX LEA EAX,[EBP-8] MOV EBX,EAX MOV EAX, [EBX] POP EBX ADD EAX,EBX MOV EBX,EAX MOVSX EAX,BYTE [EBX] POP EBX OR EAX,EAX JNE _13 JMP _12 _13: ; temp = a[k]; LEA EAX,[EBP-12] PUSH EAX LEA EAX,[EBP+12] MOV EBX,EAX MOV EAX, [EBX] PUSH EAX LEA EAX,[EBP-8] MOV EBX,EAX MOV EAX, [EBX] POP EBX ADD EAX,EBX MOV EBX,EAX MOVSX EAX,BYTE [EBX] POP EBX MOV [EBX],EAX ; a[k] = a[j]; LEA EAX,[EBP+12] MOV EBX,EAX MOV EAX, [EBX] PUSH EAX LEA EAX,[EBP-8] MOV EBX,EAX MOV EAX, [EBX] POP EBX ADD EAX,EBX PUSH EAX LEA EAX,[EBP+12] MOV EBX,EAX MOV EAX, [EBX] PUSH EAX LEA EAX,[EBP-4] MOV EBX,EAX MOV EAX, [EBX] POP EBX ADD EAX,EBX MOV EBX,EAX MOVSX EAX,BYTE [EBX] POP EBX MOV [EBX],AL ; a[j] = temp; LEA EAX,[EBP+12] MOV EBX,EAX MOV EAX, [EBX] PUSH EAX LEA EAX,[EBP-4] MOV EBX,EAX MOV EAX, [EBX] POP EBX ADD EAX,EBX PUSH EAX LEA EAX,[EBP-12] MOV EBX,EAX MOV EAX, [EBX] POP EBX MOV [EBX],AL ; } ; } _12: JMP _7 _8: ; } JMP _2 _3: ;} MOV ESP,EBP POP EBP RET ; END |
|||
19 Dec 2006, 00:21 |
|
crc 19 Dec 2006, 00:25
One last one from me, this is from TCC. I had to do a disassemble of the object file; sorry for the GAS syntax
Code: 00000000 <sort>: 0: 55 push %ebp 1: 89 e5 mov %esp,%ebp 3: 81 ec 0c 00 00 00 sub $0xc,%esp 9: b8 02 00 00 00 mov $0x2,%eax e: 89 45 fc mov %eax,0xfffffffc(%ebp) 11: 8b 45 fc mov 0xfffffffc(%ebp),%eax 14: 8b 4d 0c mov 0xc(%ebp),%ecx 17: 39 c8 cmp %ecx,%eax 19: 0f 8d 90 00 00 00 jge af <sort+0xaf> 1f: e9 0d 00 00 00 jmp 31 <sort+0x31> 24: 8b 45 fc mov 0xfffffffc(%ebp),%eax 27: 89 c1 mov %eax,%ecx 29: 83 c0 01 add $0x1,%eax 2c: 89 45 fc mov %eax,0xfffffffc(%ebp) 2f: eb e0 jmp 11 <sort+0x11> 31: b8 00 00 00 00 mov $0x0,%eax 36: 89 45 f8 mov %eax,0xfffffff8(%ebp) 39: 8b 45 f8 mov 0xfffffff8(%ebp),%eax 3c: 8b 4d fc mov 0xfffffffc(%ebp),%ecx 3f: 39 c8 cmp %ecx,%eax 41: 0f 8d 63 00 00 00 jge aa <sort+0xaa> 47: e9 0d 00 00 00 jmp 59 <sort+0x59> 4c: 8b 45 f8 mov 0xfffffff8(%ebp),%eax 4f: 89 c1 mov %eax,%ecx 51: 83 c0 01 add $0x1,%eax 54: 89 45 f8 mov %eax,0xfffffff8(%ebp) 57: eb e0 jmp 39 <sort+0x39> 59: 8b 45 08 mov 0x8(%ebp),%eax 5c: 8b 4d fc mov 0xfffffffc(%ebp),%ecx 5f: 01 c8 add %ecx,%eax 61: 8b 4d 08 mov 0x8(%ebp),%ecx 64: 8b 55 f8 mov 0xfffffff8(%ebp),%edx 67: 01 d1 add %edx,%ecx 69: 0f be 10 movsbl (%eax),%edx 6c: 0f be 01 movsbl (%ecx),%eax 6f: 39 c2 cmp %eax,%edx 71: 0f 8d 31 00 00 00 jge a8 <sort+0xa8> 77: 8b 45 08 mov 0x8(%ebp),%eax 7a: 8b 4d f8 mov 0xfffffff8(%ebp),%ecx 7d: 01 c8 add %ecx,%eax 7f: 0f be 08 movsbl (%eax),%ecx 82: 89 4d f4 mov %ecx,0xfffffff4(%ebp) 85: 8b 45 08 mov 0x8(%ebp),%eax 88: 8b 4d f8 mov 0xfffffff8(%ebp),%ecx 8b: 01 c8 add %ecx,%eax 8d: 8b 4d 08 mov 0x8(%ebp),%ecx 90: 8b 55 fc mov 0xfffffffc(%ebp),%edx 93: 01 d1 add %edx,%ecx 95: 0f be 11 movsbl (%ecx),%edx 98: 88 10 mov %dl,(%eax) 9a: 8b 45 08 mov 0x8(%ebp),%eax 9d: 8b 4d fc mov 0xfffffffc(%ebp),%ecx a0: 01 c8 add %ecx,%eax a2: 0f be 4d f4 movsbl 0xfffffff4(%ebp),%ecx a6: 88 08 mov %cl,(%eax) a8: eb a2 jmp 4c <sort+0x4c> aa: e9 75 ff ff ff jmp 24 <sort+0x24> af: c9 leave b0: c3 ret |
|||
19 Dec 2006, 00:25 |
|
vid 19 Dec 2006, 12:44
f0dder: that was example code from interet, linked by wikipedia. I consider it "standard way C coders use", eg. declaring "temp" as "int" instead of char etc.
Sorry, it was MSVS 8.0, MSVC version is 14.00.50727.42. You are right about the "moving" of value - it would be different algo. I wanted max speed on purpose. |
|||
19 Dec 2006, 12:44 |
|
f0dder 19 Dec 2006, 14:04
Quote:
It's certainly not the way I'd do it - nor most others, I'd hope. That code was written by a crummy programmer. I'd just use std::sort anyway, until it turned out to be a bottleneck - then I'd pick a properly specialized routine, and perhaps even do an assembly version of it, if it still wasn't fast enough |
|||
19 Dec 2006, 14:04 |
|
vid 19 Dec 2006, 14:10
my point in this thread was just to contradict theory saying that human cannot beat compiler in lowlevel optimizations. I didn't analyze other compilers, but at least i have an proof against MSVC that it's code isn't THAT great
|
|||
19 Dec 2006, 14:10 |
|
f0dder 19 Dec 2006, 14:20
Well, you've chosen a single & simple piece of code, written by somebody obviously isn't very good at writing C code - what did you expect?
Nobodys saying that humans can't be compilers anyway, at least not anybody with a clue. What's more reasonable to say is: "for most code, it's not worth the bloody effort" |
|||
19 Dec 2006, 14:20 |
|
kohlrak 19 Dec 2006, 21:35
The computer can't always beat the human, but the human can always beat the computer. Sadly, most humans don't beat the computer. That's my little point on this one.
|
|||
19 Dec 2006, 21:35 |
|
Maverick 20 Dec 2006, 12:25
kohlrak wrote: The computer can't always beat the human, but the human can always beat the computer. Sure, expecially MIPS wise. In 50 years we'll be useless, and assimilated or replaced. _________________ Greets, Fabio |
|||
20 Dec 2006, 12:25 |
|
Goto page 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.