flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register 
 Profile   Log in to check your private messages   Log in 
flat assembler > Heap > HLL compilers generate better code than hand written asm?

Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
Author
Thread Post new topic Reply to topic
AsmGuru62



Joined: 28 Jan 2004
Posts: 1386
Location: Toronto, Canada
The algorithm change is the only optimization which gives real results.
Post 13 Mar 2015, 12:47
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua

AsmGuru62 wrote:
The algorithm change is the only optimization which gives real results.

Not at all. One can vectorise their code and get a measurable "real" result.
Post 13 Mar 2015, 14:00
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua

JohnFound wrote:
The code B might be optimized for size, instead of for speed. Sometimes the size optimizations are also faster, but not always.

I don't think it is fair to compare different optimisation goals. Apples to oranges.
Post 13 Mar 2015, 14:02
View user's profile Send private message Visit poster's website Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1171
Location: Unknown

Quote:

It is much more complex than chess.


Calm down, it was just a metaphor. D:


Quote:

And that is why, the compilers can't optimize the whole program - it needs understanding.


FPC has the so called "Whole program optimization"... Not that it is worth, but it is there.


Quote:

And here comes the question - is changing the algorithm can be considered optimization?


I think no. Because optimization (in my opinion) is to make the same algorithm faster. Rewriting a completely different one will require further optimization as well (and chances are minimum to get the maximum possible performance out of it).


Quote:

The algorithm change is the only optimization which gives real results.


That is the top #1 (in my opinion) bullshit I hear from many people. Why is it bullshit? My opinion is above and below.


Quote:

Not at all. One can vectorise their code and get a measurable "real" result.


Now that makes sense! Because an algorithm can be implemented in infinite number of ways, you can gain lots of performance just by implementing it the other way.

I would try optimizing your existing algorithm implementations first (unless you find out a clearly more efficient one), and if that was not enough... Try creating a better one (if possible at all, it is harder than doing what I said previously)


Quote:

The algorithm change is the only optimization which gives real results.


It is not simply a change, what if you change to a more inefficient algorithm!? So it is not simply a change, it has to be a "better" change (which is not always possible). Easier said than done.

I guess I have reached my bullshit (said by me) quota for today. D: I apologize for any inconveniences I may have caused.
Post 13 Mar 2015, 14:49
View user's profile Send private message Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA
I think it really boils down to the fact that asm isn't necessary for the bulk of code that's written. Almost nothing is "hot" anymore, you just churn out thousands of lines of "cold" code that only gets executed once per 100+ ms. At that point, it makes more sense to use a descriptive language or a language that allows code reuse (generic programming, polymorphism, etc.) to cut down on the amount of code written and time spent. Of course you can do polymorphism in assembly. I've done it (in C and I can translate anything in C to asm given some time and a reference book). The problem is that it's so trivial that there's no making it any faster. Either way you go, you're stuck having a vtable. And the vtable is the most expensive part about polymorphism. (Having one or two memory reads with every method call.) What you can't have in assembly is generic programming. (Maybe you can with some really awesome macros???) And generic programming is literally the coolest thing ever.

More on topic though, if someone wants to pick a sorting algorithm to test, I can probably write it in C/C++. It seems unfair, IMO, to compare assembly with Pascal. Not that I know anything about Pascal (too young for that Razz). It just doesn't have a reputation for being fast, AFAIK.
Post 16 Mar 2015, 06:09
View user's profile Send private message Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA

revolution wrote:

AsmGuru62 wrote:
The algorithm change is the only optimization which gives real results.

Not at all. One can vectorise their code and get a measurable "real" result.

To add another example, you can flip loop indexes to get better cache performance. And this change doesn't necessarily have to change the algorithm at all, assuming it's a purely double nested loop (nothing in the outer loop).


Code:

for(int i = 0;i < SIZE;i++)
  for(int j = 0;j < SIZE;j++)
    array[j * SIZE + i] += 1; // Should swap the i and j, to access memory linearly.




GCC has an optimizer that does some pretty cool math to do this sort of optimization. https://gcc.gnu.org/wiki/Graphite General info: http://en.wikipedia.org/wiki/Polytope_model
Post 16 Mar 2015, 06:15
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1171
Location: Unknown

Quote:

It just doesn't have a reputation for being fast, AFAIK.


That is not true at all (or for most cases). As a test case, I implemented an algorithm in C (I am not a C* user), Pascal and "hand written asm". The C variant performed worse of them all and I couldn't make it any faster, do not know if it was because of my lack of experience or anything but that is what happened. The "hand written asm" was a bit faster than Pascal (~1.6 seconds difference for a 250 millions input).

I guess it all depends on your experience, as the Pascal compiler (FPC) will do exactly what you wrote while GCC will often catch your stupid code and avoid the unnecessary steps (not inline of functions unless explicitly declared and specified in the command line and known values after the loop are great examples, FPC will generate the loop while GCC will just "hard code" the final result as it is known).

Because of this, I think it is harder to write "faster" Pascal code than in C* (GCC). As the Pascal compiler is more "obedient" (or less smart, as you wish).

But FPC compiles much faster... A giant project written in Pascal will compile from scratch in under 5 minutes while GCC will take tens of minutes to hours...

+the performance difference (if any) is not as catastrophic as many C* users claim (guess that it is because many of them never touched pascal code in their lives). There are other areas where the FPC is better than C*GCC as well, so performance is just 1/10 of the story.

+It doesn't really matter as performance critical routines are VERY OFTEN coded in hand written assembly (video/image/audio encoders and general file compression are a great example) as their high level variant are brutally slower. FPC shines at this one as it provides much better ways of incorporating assembly codes in your program (without including object files) while in GCC you have to type every line in a very uncomfortable and non productive way.

More info at: http://www.freepascal.org/faq.var

I think I reached my bullshit quota for today. Sorry for wasting your time. D:


Last edited by HaHaAnonymous on 16 Mar 2015, 21:06; edited 4 times in total
Post 16 Mar 2015, 16:47
View user's profile Send private message Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA
GCC is largely a kludge (and is so on purpose, even) so compile times aren't that fair to compare. Clang is a lot better in terms of proper design, but slightly less powerful in terms of optimizations. But it still may be slower than FPC, idk. I wrote (aka translated from wikipedia) the algo for heapsort in C++. It sorted 1,000,000 ints in .15s. I can give you a .inc file (in FASM syntax) with the same data, if you're interested. I understand if you're not. I wouldn't want to write this in assembly, at all.) Here's the code:

To compile: clang++ -std=c++11 -march=native -O3 code.cpp
Add "-masm=intel -S" to get assembly.
g++ has similar performance, if you have it more readily available.


Code:

#include <iostream>

#include "data.h"

using namespace std;

void siftDown(intaint startint end){
        int root = start;

        while(root * 2 + 1 <= end){
                int child = root * 2 + 1;
                int swap = root;

                if(a[swap] < a[child])
                        swap = child;
                if(child+1 <= end && a[swap] < a[child+1])
                        swap = child + 1;
                if(swap == root)
                        return;
                else
                        ::swap(a[root], a[swap]);
                        root = swap;
        }
}

void heapify(intaint count){
        int start = (count - 2) / 2;

        while(start >= 0){
                siftDown(astartcount - 1);
                start--;
        }
}

template<size_t N>
void heapsort(int data[N]){
        inta = data;
        int count = N;
        
    heapify(acount);

    int end = count - 1;
    while(end > 0){
        swap(a[end], a[0]);
        end--;
        siftDown(a0end);
    }
}

int main(){
        heapsort<sizeof(data) / sizeof(int)>(data);
        /*for(auto i : data)
                cout << i << endl;*/
        return 0;
}



ASM:

Code:
        .text
        .file   "heapsort.cpp"
        .globl  _Z8siftDownPiii
        .align  160x90
        .type   _Z8siftDownPiii,@function
_Z8siftDownPiii:                        # @_Z8siftDownPiii
        .cfi_startproc
BB#0:
                                        # killESI<defESI<killRSI<def>
        lea     ecxdword ptr [rsi + rsi + 1]
        cmp     ecxedx
        jg      .LBB0_10
BB#1:                                 # %.lr.ph
        movsxd  raxesi
        mov     r8ddword ptr [rdi + 4*rax]
        .align  160x90
.LBB0_2:                                # =>This Inner Loop HeaderDepth=1
        movsxd  r9ecx
        mov     r10decx
        cmp     r8ddword ptr [rdi + 4*r9]
        jl      .LBB0_4
BB#3:                                 # %select.mid
                                        #   in LoopHeader=BB0_2 Depth=1
        mov     r10desi
.LBB0_4:                                # %select.end
                                        #   in LoopHeader=BB0_2 Depth=1
        cmp     ecxedx
        jge     .LBB0_8
BB#5:                                 #   in LoopHeader=BB0_2 Depth=1
        inc     ecx
        movsxd  raxr10d
        mov     r9ddword ptr [rdi + 4*rax]
        movsxd  raxecx
        cmp     r9ddword ptr [rdi + 4*rax]
        jl      .LBB0_7
BB#6:                                 # %select.mid5
                                        #   in LoopHeader=BB0_2 Depth=1
        mov     ecxr10d
.LBB0_7:                                # %select.end4
                                        #   in LoopHeader=BB0_2 Depth=1
        mov     r10decx
.LBB0_8:                                #   in LoopHeader=BB0_2 Depth=1
        cmp     r10desi
        je      .LBB0_10
BB#9:                                 #   in LoopHeader=BB0_2 Depth=1
        movsxd  raxesi
        movsxd  rcxr10d
        mov     esidword ptr [rdi + 4*rcx]
        mov     dword ptr [rdi + 4*rax], esi
        mov     dword ptr [rdi + 4*rcx], r8d
        lea     ecxdword ptr [r10 + r10 + 1]
        mov     esir10d
        cmp     ecxedx
        jle     .LBB0_2
.LBB0_10:                               # %._crit_edge
        ret
.Ltmp0:
        .size   _Z8siftDownPiii.Ltmp0-_Z8siftDownPiii
        .cfi_endproc

        .globl  _Z7heapifyPii
        .align  160x90
        .type   _Z7heapifyPii,@function
_Z7heapifyPii:                          # @_Z7heapifyPii
        .cfi_startproc
BB#0:
                                        # killESI<defESI<killRSI<def>
        test    esiesi
        jle     .LBB1_13
BB#1:                                 # %.lr.ph
        lea     eaxdword ptr [rsi - 2]
        shr     eax31
        lea     eaxdword ptr [rsi + rax - 2]
        sar     eax
        dec     esi
        movsxd  r8eax
        .align  160x90
.LBB1_2:                                # =>This Loop HeaderDepth=1
                                        #     Child Loop BB1_4 Depth 2
        lea     eaxdword ptr [r8 + r8 + 1]
        cmp     eaxesi
        jg      .LBB1_12
BB#3:                                 # %.lr.ph.i
                                        #   in LoopHeader=BB1_2 Depth=1
        mov     r9ddword ptr [rdi + 4*r8]
        mov     r11dr8d
        .align  160x90
.LBB1_4:                                #   Parent Loop BB1_2 Depth=1
                                        # =>  This Inner Loop HeaderDepth=2
        movsxd  r10eax
        mov     ecxeax
        cmp     r9ddword ptr [rdi + 4*r10]
        jl      .LBB1_6
BB#5:                                 # %select.mid
                                        #   in LoopHeader=BB1_4 Depth=2
        mov     ecxr11d
.LBB1_6:                                # %select.end
                                        #   in LoopHeader=BB1_4 Depth=2
        cmp     eaxesi
        jge     .LBB1_10
BB#7:                                 #   in LoopHeader=BB1_4 Depth=2
        inc     eax
        movsxd  rdxecx
        mov     r10ddword ptr [rdi + 4*rdx]
        movsxd  rdxeax
        cmp     r10ddword ptr [rdi + 4*rdx]
        jl      .LBB1_9
BB#8:                                 # %select.mid3
                                        #   in LoopHeader=BB1_4 Depth=2
        mov     eaxecx
.LBB1_9:                                # %select.end2
                                        #   in LoopHeader=BB1_4 Depth=2
        mov     ecxeax
.LBB1_10:                               #   in LoopHeader=BB1_4 Depth=2
        cmp     ecxr11d
        je      .LBB1_12
BB#11:                                #   in LoopHeader=BB1_4 Depth=2
        movsxd  r10r11d
        movsxd  rdxecx
        mov     eaxdword ptr [rdi + 4*rdx]
        mov     dword ptr [rdi + 4*r10], eax
        mov     dword ptr [rdi + 4*rdx], r9d
        lea     eaxdword ptr [rcx + rcx + 1]
        mov     r11decx
        cmp     eaxesi
        jle     .LBB1_4
.LBB1_12:                               # %_Z8siftDownPiii.exit
                                        #   in LoopHeader=BB1_2 Depth=1
        mov     raxr8
        dec     rax
        test    r8dr8d
        mov     r8rax
        jg      .LBB1_2
.LBB1_13:                               # %._crit_edge
        ret
.Ltmp1:
        .size   _Z7heapifyPii.Ltmp1-_Z7heapifyPii
        .cfi_endproc

        .globl  main
        .align  160x90
        .type   main,@function
main:                                   # @main
        .cfi_startproc
BB#0:
        push    rax
.Ltmp2:
        .cfi_def_cfa_offset 16
        mov     edidata
        call    _Z8heapsortILm1000000EEvPi
        xor     eaxeax
        pop     rdx
        ret
.Ltmp3:
        .size   main.Ltmp3-main
        .cfi_endproc

        .section        .text._Z8heapsortILm1000000EEvPi,"axG",@progbits,_Z8heapsortILm1000000EEvPi,comdat
        .weak   _Z8heapsortILm1000000EEvPi
        .align  160x90
        .type   _Z8heapsortILm1000000EEvPi,@function
_Z8heapsortILm1000000EEvPi:             # @_Z8heapsortILm1000000EEvPi
        .cfi_startproc
BB#0:
        mov     r8d499999
        mov     r9d999999
        .align  160x90
.LBB3_1:                                # =>This Loop HeaderDepth=1
                                        #     Child Loop BB3_3 Depth 2
        lea     ecxdword ptr [r8 + r8 + 1]
        cmp     ecx999999
        jg      .LBB3_11
BB#2:                                 # %.lr.ph.i.i
                                        #   in LoopHeader=BB3_1 Depth=1
        mov     r10ddword ptr [rdi + 4*r8]
        mov     esir8d
        .align  160x90
.LBB3_3:                                #   Parent Loop BB3_1 Depth=1
                                        # =>  This Inner Loop HeaderDepth=2
        movsxd  rdxecx
        mov     eaxecx
        cmp     r10ddword ptr [rdi + 4*rdx]
        jl      .LBB3_5
BB#4:                                 # %select.mid
                                        #   in LoopHeader=BB3_3 Depth=2
        mov     eaxesi
.LBB3_5:                                # %select.end
                                        #   in LoopHeader=BB3_3 Depth=2
        cmp     ecx999998
        jg      .LBB3_9
BB#6:                                 #   in LoopHeader=BB3_3 Depth=2
        inc     ecx
        movsxd  rdxeax
        mov     r11ddword ptr [rdi + 4*rdx]
        movsxd  rdxecx
        cmp     r11ddword ptr [rdi + 4*rdx]
        jl      .LBB3_8
BB#7:                                 # %select.mid4
                                        #   in LoopHeader=BB3_3 Depth=2
        mov     ecxeax
.LBB3_8:                                # %select.end3
                                        #   in LoopHeader=BB3_3 Depth=2
        mov     eaxecx
.LBB3_9:                                #   in LoopHeader=BB3_3 Depth=2
        cmp     eaxesi
        je      .LBB3_11
BB#10:                                #   in LoopHeader=BB3_3 Depth=2
        movsxd  rcxesi
        movsxd  rdxeax
        mov     esidword ptr [rdi + 4*rdx]
        mov     dword ptr [rdi + 4*rcx], esi
        mov     dword ptr [rdi + 4*rdx], r10d
        lea     ecxdword ptr [rax + rax + 1]
        mov     esieax
        cmp     ecx1000000
        jl      .LBB3_3
.LBB3_11:                               # %_Z8siftDownPiii.exit.i
                                        #   in LoopHeader=BB3_1 Depth=1
        mov     raxr8
        dec     rax
        test    r8dr8d
        mov     r8rax
        jg      .LBB3_1
        .align  160x90
.LBB3_12:                               # %_Z7heapifyPii.exit.preheader
                                        # =>This Loop HeaderDepth=1
                                        #     Child Loop BB3_14 Depth 2
        mov     r8ddword ptr [rdi + 4*r9]
        mov     eaxdword ptr [rdi]
        mov     dword ptr [rdi + 4*r9], eax
        mov     dword ptr [rdi], r8d
        mov     eaxr9d
        lea     r9qword ptr [r9 - 1]
        cmp     eax1
        jle     .LBB3_22
BB#13:                                #   in LoopHeader=BB3_12 Depth=1
        mov     esi1
        xor     edxedx
        .align  160x90
.LBB3_14:                               # %.lr.ph.i
                                        #   Parent Loop BB3_12 Depth=1
                                        # =>  This Inner Loop HeaderDepth=2
        movsxd  rcxesi
        mov     eaxesi
        cmp     r8ddword ptr [rdi + 4*rcx]
        jl      .LBB3_16
BB#15:                                # %select.mid9
                                        #   in LoopHeader=BB3_14 Depth=2
        mov     eaxedx
.LBB3_16:                               # %select.end8
                                        #   in LoopHeader=BB3_14 Depth=2
        cmp     esir9d
        jge     .LBB3_20
BB#17:                                #   in LoopHeader=BB3_14 Depth=2
        inc     esi
        movsxd  rcxeax
        mov     r10ddword ptr [rdi + 4*rcx]
        movsxd  rcxesi
        cmp     r10ddword ptr [rdi + 4*rcx]
        jl      .LBB3_19
BB#18:                                # %select.mid11
                                        #   in LoopHeader=BB3_14 Depth=2
        mov     esieax
.LBB3_19:                               # %select.end10
                                        #   in LoopHeader=BB3_14 Depth=2
        mov     eaxesi
.LBB3_20:                               #   in LoopHeader=BB3_14 Depth=2
        cmp     eaxedx
        je      .LBB3_22
BB#21:                                #   in LoopHeader=BB3_14 Depth=2
        movsxd  rcxedx
        movsxd  rdxeax
        mov     esidword ptr [rdi + 4*rdx]
        mov     dword ptr [rdi + 4*rcx], esi
        mov     dword ptr [rdi + 4*rdx], r8d
        lea     esidword ptr [rax + rax + 1]
        mov     edxeax
        cmp     esir9d
        jle     .LBB3_14
.LBB3_22:                               # %_Z7heapifyPii.exit.backedge
                                        #   in LoopHeader=BB3_12 Depth=1
        test    r9dr9d
        jg      .LBB3_12
BB#23:
        ret
.Ltmp4:
        .size   _Z8heapsortILm1000000EEvPi.Ltmp4-_Z8heapsortILm1000000EEvPi
        .cfi_endproc

        .section        .text.startup,"ax",@progbits
        .align  160x90
        .type   _GLOBAL__sub_I_heapsort.cpp,@function
_GLOBAL__sub_I_heapsort.cpp:            # @_GLOBAL__sub_I_heapsort.cpp
        .cfi_startproc
BB#0:
        push    rax
.Ltmp5:
        .cfi_def_cfa_offset 16
        mov     edi_ZStL8__ioinit
        call    _ZNSt8ios_base4InitC1Ev
        mov     edi_ZNSt8ios_base4InitD1Ev
        mov     esi_ZStL8__ioinit
        mov     edx__dso_handle
        pop     rax
        jmp     __cxa_atexit            # TAILCALL
.Ltmp6:
        .size   _GLOBAL__sub_I_heapsort.cpp.Ltmp6-_GLOBAL__sub_I_heapsort.cpp
        .cfi_endproc

        .type   _ZStL8__ioinit,@object  # @_ZStL8__ioinit
        .local  _ZStL8__ioinit
        .comm   _ZStL8__ioinit,1,1
        .type   data,@object            # @data
        .data
        .globl  data
        .align  16
data:
; constants

Post 16 Mar 2015, 20:51
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1171
Location: Unknown

Quote:

I can give you a .inc file (in FASM syntax) with the same data, if you're interested.


I am interested. D:
Post 16 Mar 2015, 21:15
View user's profile Send private message Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA
It was too big to upload. Here's a link: http://thardin.name/data.inc.
Post 17 Mar 2015, 14:36
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1171
Location: Unknown

Tyler wrote:
It was too big to upload. Here's a link: http://thardin.name/data.inc.



Quote:
error: illegal instruction.


Renaming "data" to "data0" worked... If it worked for you then there's a bug somewhere (in FASM). D:

+a few tips:
1.In binary the result was just 4MB, +compressed = 3.1MB. Or plain text compressed = 3.6MB.

That could save tens of minutes for dial-up users. D:

Thanks for the input data! :D

I apologize for any inconvenience, if any.
Post 17 Mar 2015, 16:09
View user's profile Send private message Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
Tyler

Quote:
I wrote (aka translated from wikipedia) the algo for heapsort in C++


After you mentioned how cool generic programming is I must say, I rarely see such a perverted way of applying it. Aside from making the sorting even less generic than it ever appeared (meaning it's only usable for arrays with the size known at compile time) you'll end up with a replicated copy of the binary code for each individual array you sort provided they have different lengths.

As a side note by cleverly inlining the helper functions (heapify and sift) you could achieve about 20% speedup on average.

_________________
Faith is a superposition of knowledge and fallacy
Post 17 Mar 2015, 23:57
View user's profile Send private message Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA

l_inc wrote:
Tyler

Quote:
I wrote (aka translated from wikipedia) the algo for heapsort in C++


After you mentioned how cool generic programming is I must say, I rarely see such a perverted way of applying it. Aside from making the sorting even less generic than it ever appeared (meaning it's only usable for arrays with the size known at compile time) you'll end up with a replicated copy of the binary code for each individual array you sort provided they have different lengths.

As a side note by cleverly inlining the helper functions (heapify and sift) you could achieve about 20% speedup on average.

Templates aren't only for generic programming. There also used for metaprogramming and compile-time constant arguments. My intention in doing it was to explicitly make the size known at compile time (I guess I could have used constexpr), hoping maybe the compiler would do some constant propagation. I would guess that it did and also inlined the functions, and propagated the length to them too.

Now that I look at it, the asm for the heapsort instantiation doesn't contain any calls to the helper functions, so I guess it did do what I was hoping. (Though I didn't check if it propagated the size to them too. Nor did I check if it actually used the known size to do any useful loop transformations/unrolling. That would take too much effort.)

Is it dirty, premature optimization? Yes. But I had an intention in doing it.

One could change it to:

Code:

void heapsort(intdatasize_t count){
        heapify(acount);

        int end = count - 1;
        while(end > 0){
            swap(a[end], a[0]);
            end--;
            siftDown(a0end);
        }
}

template<size_t N>
void heapsort(int data[N]){
        heapsort(dataN);
}


and probably still get the same benefits (if there are any). Exe size may be bigger if a lot of instantiations are made, but I don't care so much about that. My guess is that the constant prop, if it leads to inlining and loop unrolling, will more than cancel out the larger code size (considering the whole program could fit in L2 anyway).

But now that I think about it, I suppose it was a waste of time anyway. At O3, the compiler probably would have inlined the whole heapsort function into main and did all the same constant propogation anyway.
Post 18 Mar 2015, 04:38
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua

Tyler wrote:
... hoping maybe the compiler would ... I would guess that it did ... I guess it did do ... and probably still get the same ... My guess is that the ... At O3, the compiler probably would have ...

This is the problem with most compilers. No one ever knows what they will do, or even if they will do the same thing after it is updated, or if a different one will do something different. So much uncertainty, so much doubt, so much guessing. And if the compiler has been poisoned to introduce malware (yes it is a real thing) then no can really know without a significant amount of effort needed to examine the output.
Post 18 Mar 2015, 06:30
View user's profile Send private message Visit poster's website Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA

revolution wrote:

Tyler wrote:
... hoping maybe the compiler would ... I would guess that it did ... I guess it did do ... and probably still get the same ... My guess is that the ... At O3, the compiler probably would have ...

This is the problem with most compilers. No one ever knows what they will do, or even if they will do the same thing after it is updated, or if a different one will do something different. So much uncertainty, so much doubt, so much guessing. And if the compiler has been poisoned to introduce malware (yes it is a real thing) then no can really know without a significant amount of effort needed to examine the output.

Given that there are infinitely many ways to write something and that finding the most efficient is an undecidable problem (even for assembly programmers), I resign myself to the fact that the compiler is aware of certain transformations it can perform on certain forms and that there are other forms on which potential transformations are missed because of the complier's ignorance. If something is missed, it's not that big of a problem unless it is actually a problem (as in it is measurably slower than my goal). In which case I can drag out the reference book and write a small amount of assembly or rework the HLL into a form the compiler can work with.

The infected compiler thing is interesting, but I already have my fair share of tin foil hat theories, so I think I'll leave that one for you. Smile I'll trust Apple and GCC to do reasonable code review just like I trust every other dev team of software I use. It's a calculated risk, of course, but I think I'll take my chances instead of writing my own OS in assembly. Though I don't think it would be that hard to check for, if you assume that the malicious stub calls at least one syscall, since those are traceable.
Post 18 Mar 2015, 10:48
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1171
Location: Unknown

Quote:

Yes, the same is my opinion and observations. Contrary to the public opinion. But maybe we are some exception from the rule. IMO, some objective research is needed.


I guess I found out why... D: At least in my case as described below:

I noticed that when I am coding programs in a "High Level Language" I usually code tens to thousands of lines without testing... The result!? A heap of bugs. D:

And when I code in assembly I code in "checkpoints" and the code is highly tested at every checkpoint to ensure everything is correct.
Sometimes bugs emerge, of course. But when it does, it was (most of the time) not considered in the "checkpoint" test. D:

And I disagree with the following statement (found on the web):

Quote:

there is a common myth among programmers that...takes much longer to code the same program in assembly language than in a HLL. This, indeed, is just a myth...


o_O'

I know from experience this is not a myth at all. It really takes much longer... Unless you are using macros... But then that's not "pure" assembly anymore as you are using macros to write very often repetitive code and, sometimes, complete "statements" that resembles more a sophisticated high level language than anything else. D:

Again, sorry for any inconvenience (if any).
Post 19 Mar 2015, 00:31
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua

Tyler wrote:
... rework the HLL into a form the compiler can work with.

Sure, that is a potential "solution", but it is only a temporary measure since the next iteration of the compiler might mess it all up again.

Writing-HLL-code-to-match-the-compiler-in-use strikes me as being very similar to writing-assembly-code-to-match-the-CPU-in-use. The concept is the same, only the finer details and execution are changed.
Post 19 Mar 2015, 00:44
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua

HaHaAnonymous wrote:
And I disagree with the following statement (found on the web):

Quote:

there is a common myth among programmers that...takes much longer to code the same program in assembly language than in a HLL. This, indeed, is just a myth...


o_O'

I know from experience this is not a myth at all. It really takes much longer...

I think it depends upon the familiarity of the programmer. Some people will be more comfortable with assembly and others with HLLs.
Post 19 Mar 2015, 01:31
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
Tyler

Quote:
hoping maybe the compiler would do some constant propagation


Well, you know this is normally not what one would want from a sorting algorithm, cause the array size is in most cases dynamically retrieved. You are obviously not prohibited from doing useless things on purpose, but that doesn't automatically make those things more useful.


Quote:
Now that I look at it, the asm for the heapsort instantiation doesn't contain any calls to the helper functions, so I guess it did do what I was hoping.


I must admit I did the measurements the first time using x86 and VS2010. To make a clean experiment I allowed myself to spend some time configuring clang, and the results from O3 are following (for 0x800000 elements):

Code:
HeapSortBad:         2006029 us
HeapSort:            1848703 us


The second result is my inlined implementation. I passed pointers to HeapSortBad and HeapSort to a measuring function with a noinline attribute, so that there's guaranteed no inlining into the main. I left the array size propagation for your implementation only.
So you can see the keyword really was "cleverly".


Quote:
My guess is that the constant prop, if it leads to inlining and loop unrolling, will more than cancel out the larger code size (considering the whole program could fit in L2 anyway).


What do you mean by "canceling out"? Inlining and loop unrolling, which is luckily not done here, would make it even worse. Increased program size, no matter if it fits into L2 or not, significantly correlates with performance degradation because of the increased CPU resources footprint. Even if we forget about L1 (though it matters quite a lot) for such a long running code there will definitely be context switches resulting in cache lines eviction: the larger the cache footprint the more cache misses will follow for both scheduling out and scheduling back. Additionally there are also other resources wasted such as TLB and branch predictor entries: the larger your code the more CPU resources you waste the more significant performance degradation you'll experience.

For that reason loop unrolling and inlining increases performance to a certain extent when done carefully, but careless code replication is a sign of sloppy coding.

_________________
Faith is a superposition of knowledge and fallacy
Post 19 Mar 2015, 02:26
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1171
Location: Unknown
Considering compilers are more efficient at code generation than humans, here is a question:

Just because the compilers are more efficient at code generation it does not mean it is easy or that everyone writes more efficient code using a compiler than assembly language.

As an example... Me.

I recently came to the conclusion that I fail to write high level code that is "faster" than my assembly coded variant. And unfortunately, it is not because I am an amazing assembly "coder", it is because I am a (much) worse high level coder. D:

Given that fact, is it better to write my programs in assembly or high level language!? I guess I will leave my HLL coded programs for secondary use (i.e. when not running on x86+Linux)... D:

Defense against possible "bullshits":
1.But it is the whole program performance that matters.

Answer: No, it is not. It is the algorithm that finish first that opens the door faster. And having a slower one that will open it "smoother" will not help much! D:

Note, the comparison above may be a bit misleading depending on how you interpret it.

That's my thought of today!

I apologize for any inconveniences, if any.
Post 19 Mar 2015, 21:49
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next

< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2005 phpBB Group.

Main index   Download   Documentation   Examples   Message board
Copyright © 2004-2016, Tomasz Grysztar.