flat assembler
Message board for the users of flat assembler.

Index > Main > HLL compilers generate better code than hand written asm?

Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
Author
Thread Post new topic Reply to topic
AsmGuru62



Joined: 28 Jan 2004
Posts: 1671
Location: Toronto, Canada
AsmGuru62 13 Mar 2015, 12:47
The algorithm change is the only optimization which gives real results.
Post 13 Mar 2015, 12:47
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 13 Mar 2015, 14:00
AsmGuru62 wrote:
The algorithm change is the only optimization which gives real results.
Not at all. One can vectorise their code and get a measurable "real" result.
Post 13 Mar 2015, 14:00
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 13 Mar 2015, 14:02
JohnFound wrote:
The code B might be optimized for size, instead of for speed. Sometimes the size optimizations are also faster, but not always.
I don't think it is fair to compare different optimisation goals. Apples to oranges.
Post 13 Mar 2015, 14:02
View user's profile Send private message Visit poster's website Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 13 Mar 2015, 14:49
Quote:

It is much more complex than chess.

Calm down, it was just a metaphor. D:

Quote:

And that is why, the compilers can't optimize the whole program - it needs understanding.

FPC has the so called "Whole program optimization"... Not that it is worth, but it is there.

Quote:

And here comes the question - is changing the algorithm can be considered optimization?

I think no. Because optimization (in my opinion) is to make the same algorithm faster. Rewriting a completely different one will require further optimization as well (and chances are minimum to get the maximum possible performance out of it).

Quote:

The algorithm change is the only optimization which gives real results.

That is the top #1 (in my opinion) bullshit I hear from many people. Why is it bullshit? My opinion is above and below.

Quote:

Not at all. One can vectorise their code and get a measurable "real" result.

Now that makes sense! Because an algorithm can be implemented in infinite number of ways, you can gain lots of performance just by implementing it the other way.

I would try optimizing your existing algorithm implementations first (unless you find out a clearly more efficient one), and if that was not enough... Try creating a better one (if possible at all, it is harder than doing what I said previously)

Quote:

The algorithm change is the only optimization which gives real results.

It is not simply a change, what if you change to a more inefficient algorithm!? So it is not simply a change, it has to be a "better" change (which is not always possible). Easier said than done.

I guess I have reached my bullshit (said by me) quota for today. D: I apologize for any inconveniences I may have caused.
Post 13 Mar 2015, 14:49
View user's profile Send private message Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA
Tyler 16 Mar 2015, 06:09
I think it really boils down to the fact that asm isn't necessary for the bulk of code that's written. Almost nothing is "hot" anymore, you just churn out thousands of lines of "cold" code that only gets executed once per 100+ ms. At that point, it makes more sense to use a descriptive language or a language that allows code reuse (generic programming, polymorphism, etc.) to cut down on the amount of code written and time spent. Of course you can do polymorphism in assembly. I've done it (in C and I can translate anything in C to asm given some time and a reference book). The problem is that it's so trivial that there's no making it any faster. Either way you go, you're stuck having a vtable. And the vtable is the most expensive part about polymorphism. (Having one or two memory reads with every method call.) What you can't have in assembly is generic programming. (Maybe you can with some really awesome macros???) And generic programming is literally the coolest thing ever.

More on topic though, if someone wants to pick a sorting algorithm to test, I can probably write it in C/C++. It seems unfair, IMO, to compare assembly with Pascal. Not that I know anything about Pascal (too young for that Razz). It just doesn't have a reputation for being fast, AFAIK.
Post 16 Mar 2015, 06:09
View user's profile Send private message Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA
Tyler 16 Mar 2015, 06:15
revolution wrote:
AsmGuru62 wrote:
The algorithm change is the only optimization which gives real results.
Not at all. One can vectorise their code and get a measurable "real" result.
To add another example, you can flip loop indexes to get better cache performance. And this change doesn't necessarily have to change the algorithm at all, assuming it's a purely double nested loop (nothing in the outer loop).

Code:
for(int i = 0;i < SIZE;i++)
  for(int j = 0;j < SIZE;j++)
    array[j * SIZE + i] += 1; // Should swap the i and j, to access memory linearly.
    


GCC has an optimizer that does some pretty cool math to do this sort of optimization. https://gcc.gnu.org/wiki/Graphite General info: http://en.wikipedia.org/wiki/Polytope_model
Post 16 Mar 2015, 06:15
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 16 Mar 2015, 16:47
Quote:

It just doesn't have a reputation for being fast, AFAIK.

That is not true at all (or for most cases). As a test case, I implemented an algorithm in C (I am not a C* user), Pascal and "hand written asm". The C variant performed worse of them all and I couldn't make it any faster, do not know if it was because of my lack of experience or anything but that is what happened. The "hand written asm" was a bit faster than Pascal (~1.6 seconds difference for a 250 millions input).

I guess it all depends on your experience, as the Pascal compiler (FPC) will do exactly what you wrote while GCC will often catch your stupid code and avoid the unnecessary steps (not inline of functions unless explicitly declared and specified in the command line and known values after the loop are great examples, FPC will generate the loop while GCC will just "hard code" the final result as it is known).

Because of this, I think it is harder to write "faster" Pascal code than in C* (GCC). As the Pascal compiler is more "obedient" (or less smart, as you wish).

But FPC compiles much faster... A giant project written in Pascal will compile from scratch in under 5 minutes while GCC will take tens of minutes to hours...

+the performance difference (if any) is not as catastrophic as many C* users claim (guess that it is because many of them never touched pascal code in their lives). There are other areas where the FPC is better than C*GCC as well, so performance is just 1/10 of the story.

+It doesn't really matter as performance critical routines are VERY OFTEN coded in hand written assembly (video/image/audio encoders and general file compression are a great example) as their high level variant are brutally slower. FPC shines at this one as it provides much better ways of incorporating assembly codes in your program (without including object files) while in GCC you have to type every line in a very uncomfortable and non productive way.

More info at: http://www.freepascal.org/faq.var

I think I reached my bullshit quota for today. Sorry for wasting your time. D:


Last edited by HaHaAnonymous on 16 Mar 2015, 21:06; edited 4 times in total
Post 16 Mar 2015, 16:47
View user's profile Send private message Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA
Tyler 16 Mar 2015, 20:51
GCC is largely a kludge (and is so on purpose, even) so compile times aren't that fair to compare. Clang is a lot better in terms of proper design, but slightly less powerful in terms of optimizations. But it still may be slower than FPC, idk. I wrote (aka translated from wikipedia) the algo for heapsort in C++. It sorted 1,000,000 ints in .15s. I can give you a .inc file (in FASM syntax) with the same data, if you're interested. I understand if you're not. I wouldn't want to write this in assembly, at all.) Here's the code:

To compile: clang++ -std=c++11 -march=native -O3 code.cpp
Add "-masm=intel -S" to get assembly.
g++ has similar performance, if you have it more readily available.

Code:
#include <iostream>

#include "data.h"

using namespace std;

void siftDown(int* a, int start, int end){
        int root = start;

        while(root * 2 + 1 <= end){
                int child = root * 2 + 1;
                int swap = root;

                if(a[swap] < a[child])
                        swap = child;
                if(child+1 <= end && a[swap] < a[child+1])
                        swap = child + 1;
                if(swap == root)
                        return;
                else
                        ::swap(a[root], a[swap]);
                        root = swap;
        }
}

void heapify(int* a, int count){
        int start = (count - 2) / 2;

        while(start >= 0){
                siftDown(a, start, count - 1);
                start--;
        }
}

template<size_t N>
void heapsort(int data[N]){
        int* a = data;
        int count = N;
        
    heapify(a, count);

    int end = count - 1;
    while(end > 0){
        swap(a[end], a[0]);
        end--;
        siftDown(a, 0, end);
    }
}

int main(){
        heapsort<sizeof(data) / sizeof(int)>(data);
        /*for(auto i : data)
                cout << i << endl;*/
        return 0;
}    


ASM:
Code:
        .text
        .file   "heapsort.cpp"
        .globl  _Z8siftDownPiii
        .align  16, 0x90
        .type   _Z8siftDownPiii,@function
_Z8siftDownPiii:                        # @_Z8siftDownPiii
        .cfi_startproc
# BB#0:
                                        # kill: ESI<def> ESI<kill> RSI<def>
        lea     ecx, dword ptr [rsi + rsi + 1]
        cmp     ecx, edx
        jg      .LBB0_10
# BB#1:                                 # %.lr.ph
        movsxd  rax, esi
        mov     r8d, dword ptr [rdi + 4*rax]
        .align  16, 0x90
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        movsxd  r9, ecx
        mov     r10d, ecx
        cmp     r8d, dword ptr [rdi + 4*r9]
        jl      .LBB0_4
# BB#3:                                 # %select.mid
                                        #   in Loop: Header=BB0_2 Depth=1
        mov     r10d, esi
.LBB0_4:                                # %select.end
                                        #   in Loop: Header=BB0_2 Depth=1
        cmp     ecx, edx
        jge     .LBB0_8
# BB#5:                                 #   in Loop: Header=BB0_2 Depth=1
        inc     ecx
        movsxd  rax, r10d
        mov     r9d, dword ptr [rdi + 4*rax]
        movsxd  rax, ecx
        cmp     r9d, dword ptr [rdi + 4*rax]
        jl      .LBB0_7
# BB#6:                                 # %select.mid5
                                        #   in Loop: Header=BB0_2 Depth=1
        mov     ecx, r10d
.LBB0_7:                                # %select.end4
                                        #   in Loop: Header=BB0_2 Depth=1
        mov     r10d, ecx
.LBB0_8:                                #   in Loop: Header=BB0_2 Depth=1
        cmp     r10d, esi
        je      .LBB0_10
# BB#9:                                 #   in Loop: Header=BB0_2 Depth=1
        movsxd  rax, esi
        movsxd  rcx, r10d
        mov     esi, dword ptr [rdi + 4*rcx]
        mov     dword ptr [rdi + 4*rax], esi
        mov     dword ptr [rdi + 4*rcx], r8d
        lea     ecx, dword ptr [r10 + r10 + 1]
        mov     esi, r10d
        cmp     ecx, edx
        jle     .LBB0_2
.LBB0_10:                               # %._crit_edge
        ret
.Ltmp0:
        .size   _Z8siftDownPiii, .Ltmp0-_Z8siftDownPiii
        .cfi_endproc

        .globl  _Z7heapifyPii
        .align  16, 0x90
        .type   _Z7heapifyPii,@function
_Z7heapifyPii:                          # @_Z7heapifyPii
        .cfi_startproc
# BB#0:
                                        # kill: ESI<def> ESI<kill> RSI<def>
        test    esi, esi
        jle     .LBB1_13
# BB#1:                                 # %.lr.ph
        lea     eax, dword ptr [rsi - 2]
        shr     eax, 31
        lea     eax, dword ptr [rsi + rax - 2]
        sar     eax
        dec     esi
        movsxd  r8, eax
        .align  16, 0x90
.LBB1_2:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB1_4 Depth 2
        lea     eax, dword ptr [r8 + r8 + 1]
        cmp     eax, esi
        jg      .LBB1_12
# BB#3:                                 # %.lr.ph.i
                                        #   in Loop: Header=BB1_2 Depth=1
        mov     r9d, dword ptr [rdi + 4*r8]
        mov     r11d, r8d
        .align  16, 0x90
.LBB1_4:                                #   Parent Loop BB1_2 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        movsxd  r10, eax
        mov     ecx, eax
        cmp     r9d, dword ptr [rdi + 4*r10]
        jl      .LBB1_6
# BB#5:                                 # %select.mid
                                        #   in Loop: Header=BB1_4 Depth=2
        mov     ecx, r11d
.LBB1_6:                                # %select.end
                                        #   in Loop: Header=BB1_4 Depth=2
        cmp     eax, esi
        jge     .LBB1_10
# BB#7:                                 #   in Loop: Header=BB1_4 Depth=2
        inc     eax
        movsxd  rdx, ecx
        mov     r10d, dword ptr [rdi + 4*rdx]
        movsxd  rdx, eax
        cmp     r10d, dword ptr [rdi + 4*rdx]
        jl      .LBB1_9
# BB#8:                                 # %select.mid3
                                        #   in Loop: Header=BB1_4 Depth=2
        mov     eax, ecx
.LBB1_9:                                # %select.end2
                                        #   in Loop: Header=BB1_4 Depth=2
        mov     ecx, eax
.LBB1_10:                               #   in Loop: Header=BB1_4 Depth=2
        cmp     ecx, r11d
        je      .LBB1_12
# BB#11:                                #   in Loop: Header=BB1_4 Depth=2
        movsxd  r10, r11d
        movsxd  rdx, ecx
        mov     eax, dword ptr [rdi + 4*rdx]
        mov     dword ptr [rdi + 4*r10], eax
        mov     dword ptr [rdi + 4*rdx], r9d
        lea     eax, dword ptr [rcx + rcx + 1]
        mov     r11d, ecx
        cmp     eax, esi
        jle     .LBB1_4
.LBB1_12:                               # %_Z8siftDownPiii.exit
                                        #   in Loop: Header=BB1_2 Depth=1
        mov     rax, r8
        dec     rax
        test    r8d, r8d
        mov     r8, rax
        jg      .LBB1_2
.LBB1_13:                               # %._crit_edge
        ret
.Ltmp1:
        .size   _Z7heapifyPii, .Ltmp1-_Z7heapifyPii
        .cfi_endproc

        .globl  main
        .align  16, 0x90
        .type   main,@function
main:                                   # @main
        .cfi_startproc
# BB#0:
        push    rax
.Ltmp2:
        .cfi_def_cfa_offset 16
        mov     edi, data
        call    _Z8heapsortILm1000000EEvPi
        xor     eax, eax
        pop     rdx
        ret
.Ltmp3:
        .size   main, .Ltmp3-main
        .cfi_endproc

        .section        .text._Z8heapsortILm1000000EEvPi,"axG",@progbits,_Z8heapsortILm1000000EEvPi,comdat
        .weak   _Z8heapsortILm1000000EEvPi
        .align  16, 0x90
        .type   _Z8heapsortILm1000000EEvPi,@function
_Z8heapsortILm1000000EEvPi:             # @_Z8heapsortILm1000000EEvPi
        .cfi_startproc
# BB#0:
        mov     r8d, 499999
        mov     r9d, 999999
        .align  16, 0x90
.LBB3_1:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB3_3 Depth 2
        lea     ecx, dword ptr [r8 + r8 + 1]
        cmp     ecx, 999999
        jg      .LBB3_11
# BB#2:                                 # %.lr.ph.i.i
                                        #   in Loop: Header=BB3_1 Depth=1
        mov     r10d, dword ptr [rdi + 4*r8]
        mov     esi, r8d
        .align  16, 0x90
.LBB3_3:                                #   Parent Loop BB3_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        movsxd  rdx, ecx
        mov     eax, ecx
        cmp     r10d, dword ptr [rdi + 4*rdx]
        jl      .LBB3_5
# BB#4:                                 # %select.mid
                                        #   in Loop: Header=BB3_3 Depth=2
        mov     eax, esi
.LBB3_5:                                # %select.end
                                        #   in Loop: Header=BB3_3 Depth=2
        cmp     ecx, 999998
        jg      .LBB3_9
# BB#6:                                 #   in Loop: Header=BB3_3 Depth=2
        inc     ecx
        movsxd  rdx, eax
        mov     r11d, dword ptr [rdi + 4*rdx]
        movsxd  rdx, ecx
        cmp     r11d, dword ptr [rdi + 4*rdx]
        jl      .LBB3_8
# BB#7:                                 # %select.mid4
                                        #   in Loop: Header=BB3_3 Depth=2
        mov     ecx, eax
.LBB3_8:                                # %select.end3
                                        #   in Loop: Header=BB3_3 Depth=2
        mov     eax, ecx
.LBB3_9:                                #   in Loop: Header=BB3_3 Depth=2
        cmp     eax, esi
        je      .LBB3_11
# BB#10:                                #   in Loop: Header=BB3_3 Depth=2
        movsxd  rcx, esi
        movsxd  rdx, eax
        mov     esi, dword ptr [rdi + 4*rdx]
        mov     dword ptr [rdi + 4*rcx], esi
        mov     dword ptr [rdi + 4*rdx], r10d
        lea     ecx, dword ptr [rax + rax + 1]
        mov     esi, eax
        cmp     ecx, 1000000
        jl      .LBB3_3
.LBB3_11:                               # %_Z8siftDownPiii.exit.i
                                        #   in Loop: Header=BB3_1 Depth=1
        mov     rax, r8
        dec     rax
        test    r8d, r8d
        mov     r8, rax
        jg      .LBB3_1
        .align  16, 0x90
.LBB3_12:                               # %_Z7heapifyPii.exit.preheader
                                        # =>This Loop Header: Depth=1
                                        #     Child Loop BB3_14 Depth 2
        mov     r8d, dword ptr [rdi + 4*r9]
        mov     eax, dword ptr [rdi]
        mov     dword ptr [rdi + 4*r9], eax
        mov     dword ptr [rdi], r8d
        mov     eax, r9d
        lea     r9, qword ptr [r9 - 1]
        cmp     eax, 1
        jle     .LBB3_22
# BB#13:                                #   in Loop: Header=BB3_12 Depth=1
        mov     esi, 1
        xor     edx, edx
        .align  16, 0x90
.LBB3_14:                               # %.lr.ph.i
                                        #   Parent Loop BB3_12 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        movsxd  rcx, esi
        mov     eax, esi
        cmp     r8d, dword ptr [rdi + 4*rcx]
        jl      .LBB3_16
# BB#15:                                # %select.mid9
                                        #   in Loop: Header=BB3_14 Depth=2
        mov     eax, edx
.LBB3_16:                               # %select.end8
                                        #   in Loop: Header=BB3_14 Depth=2
        cmp     esi, r9d
        jge     .LBB3_20
# BB#17:                                #   in Loop: Header=BB3_14 Depth=2
        inc     esi
        movsxd  rcx, eax
        mov     r10d, dword ptr [rdi + 4*rcx]
        movsxd  rcx, esi
        cmp     r10d, dword ptr [rdi + 4*rcx]
        jl      .LBB3_19
# BB#18:                                # %select.mid11
                                        #   in Loop: Header=BB3_14 Depth=2
        mov     esi, eax
.LBB3_19:                               # %select.end10
                                        #   in Loop: Header=BB3_14 Depth=2
        mov     eax, esi
.LBB3_20:                               #   in Loop: Header=BB3_14 Depth=2
        cmp     eax, edx
        je      .LBB3_22
# BB#21:                                #   in Loop: Header=BB3_14 Depth=2
        movsxd  rcx, edx
        movsxd  rdx, eax
        mov     esi, dword ptr [rdi + 4*rdx]
        mov     dword ptr [rdi + 4*rcx], esi
        mov     dword ptr [rdi + 4*rdx], r8d
        lea     esi, dword ptr [rax + rax + 1]
        mov     edx, eax
        cmp     esi, r9d
        jle     .LBB3_14
.LBB3_22:                               # %_Z7heapifyPii.exit.backedge
                                        #   in Loop: Header=BB3_12 Depth=1
        test    r9d, r9d
        jg      .LBB3_12
# BB#23:
        ret
.Ltmp4:
        .size   _Z8heapsortILm1000000EEvPi, .Ltmp4-_Z8heapsortILm1000000EEvPi
        .cfi_endproc

        .section        .text.startup,"ax",@progbits
        .align  16, 0x90
        .type   _GLOBAL__sub_I_heapsort.cpp,@function
_GLOBAL__sub_I_heapsort.cpp:            # @_GLOBAL__sub_I_heapsort.cpp
        .cfi_startproc
# BB#0:
        push    rax
.Ltmp5:
        .cfi_def_cfa_offset 16
        mov     edi, _ZStL8__ioinit
        call    _ZNSt8ios_base4InitC1Ev
        mov     edi, _ZNSt8ios_base4InitD1Ev
        mov     esi, _ZStL8__ioinit
        mov     edx, __dso_handle
        pop     rax
        jmp     __cxa_atexit            # TAILCALL
.Ltmp6:
        .size   _GLOBAL__sub_I_heapsort.cpp, .Ltmp6-_GLOBAL__sub_I_heapsort.cpp
        .cfi_endproc

        .type   _ZStL8__ioinit,@object  # @_ZStL8__ioinit
        .local  _ZStL8__ioinit
        .comm   _ZStL8__ioinit,1,1
        .type   data,@object            # @data
        .data
        .globl  data
        .align  16
data:
; constants    
Post 16 Mar 2015, 20:51
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 16 Mar 2015, 21:15
Quote:

I can give you a .inc file (in FASM syntax) with the same data, if you're interested.

I am interested. D:
Post 16 Mar 2015, 21:15
View user's profile Send private message Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA
Tyler 17 Mar 2015, 14:36
It was too big to upload. Here's a link: http://thardin.name/data.inc.
Post 17 Mar 2015, 14:36
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 17 Mar 2015, 16:09
Tyler wrote:
It was too big to upload. Here's a link: http://thardin.name/data.inc.

Quote:
error: illegal instruction.

Renaming "data" to "data0" worked... If it worked for you then there's a bug somewhere (in FASM). D:

+a few tips:
1.In binary the result was just 4MB, +compressed = 3.1MB. Or plain text compressed = 3.6MB.

That could save tens of minutes for dial-up users. D:

Thanks for the input data! :D

I apologize for any inconvenience, if any.
Post 17 Mar 2015, 16:09
View user's profile Send private message Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 17 Mar 2015, 23:57
Tyler
Quote:
I wrote (aka translated from wikipedia) the algo for heapsort in C++

After you mentioned how cool generic programming is I must say, I rarely see such a perverted way of applying it. Aside from making the sorting even less generic than it ever appeared (meaning it's only usable for arrays with the size known at compile time) you'll end up with a replicated copy of the binary code for each individual array you sort provided they have different lengths.

As a side note by cleverly inlining the helper functions (heapify and sift) you could achieve about 20% speedup on average.

_________________
Faith is a superposition of knowledge and fallacy
Post 17 Mar 2015, 23:57
View user's profile Send private message Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA
Tyler 18 Mar 2015, 04:38
l_inc wrote:
Tyler
Quote:
I wrote (aka translated from wikipedia) the algo for heapsort in C++

After you mentioned how cool generic programming is I must say, I rarely see such a perverted way of applying it. Aside from making the sorting even less generic than it ever appeared (meaning it's only usable for arrays with the size known at compile time) you'll end up with a replicated copy of the binary code for each individual array you sort provided they have different lengths.

As a side note by cleverly inlining the helper functions (heapify and sift) you could achieve about 20% speedup on average.
Templates aren't only for generic programming. There also used for metaprogramming and compile-time constant arguments. My intention in doing it was to explicitly make the size known at compile time (I guess I could have used constexpr), hoping maybe the compiler would do some constant propagation. I would guess that it did and also inlined the functions, and propagated the length to them too.

Now that I look at it, the asm for the heapsort instantiation doesn't contain any calls to the helper functions, so I guess it did do what I was hoping. (Though I didn't check if it propagated the size to them too. Nor did I check if it actually used the known size to do any useful loop transformations/unrolling. That would take too much effort.)

Is it dirty, premature optimization? Yes. But I had an intention in doing it.

One could change it to:
Code:
void heapsort(int* data, size_t count){
        heapify(a, count);

        int end = count - 1;
        while(end > 0){
            swap(a[end], a[0]);
            end--;
            siftDown(a, 0, end);
        }
}

template<size_t N>
void heapsort(int data[N]){
        heapsort(data, N);
}
    
and probably still get the same benefits (if there are any). Exe size may be bigger if a lot of instantiations are made, but I don't care so much about that. My guess is that the constant prop, if it leads to inlining and loop unrolling, will more than cancel out the larger code size (considering the whole program could fit in L2 anyway).

But now that I think about it, I suppose it was a waste of time anyway. At O3, the compiler probably would have inlined the whole heapsort function into main and did all the same constant propogation anyway.
Post 18 Mar 2015, 04:38
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 18 Mar 2015, 06:30
Tyler wrote:
... hoping maybe the compiler would ... I would guess that it did ... I guess it did do ... and probably still get the same ... My guess is that the ... At O3, the compiler probably would have ...
This is the problem with most compilers. No one ever knows what they will do, or even if they will do the same thing after it is updated, or if a different one will do something different. So much uncertainty, so much doubt, so much guessing. And if the compiler has been poisoned to introduce malware (yes it is a real thing) then no can really know without a significant amount of effort needed to examine the output.
Post 18 Mar 2015, 06:30
View user's profile Send private message Visit poster's website Reply with quote
Tyler



Joined: 19 Nov 2009
Posts: 1216
Location: NC, USA
Tyler 18 Mar 2015, 10:48
revolution wrote:
Tyler wrote:
... hoping maybe the compiler would ... I would guess that it did ... I guess it did do ... and probably still get the same ... My guess is that the ... At O3, the compiler probably would have ...
This is the problem with most compilers. No one ever knows what they will do, or even if they will do the same thing after it is updated, or if a different one will do something different. So much uncertainty, so much doubt, so much guessing. And if the compiler has been poisoned to introduce malware (yes it is a real thing) then no can really know without a significant amount of effort needed to examine the output.
Given that there are infinitely many ways to write something and that finding the most efficient is an undecidable problem (even for assembly programmers), I resign myself to the fact that the compiler is aware of certain transformations it can perform on certain forms and that there are other forms on which potential transformations are missed because of the complier's ignorance. If something is missed, it's not that big of a problem unless it is actually a problem (as in it is measurably slower than my goal). In which case I can drag out the reference book and write a small amount of assembly or rework the HLL into a form the compiler can work with.

The infected compiler thing is interesting, but I already have my fair share of tin foil hat theories, so I think I'll leave that one for you. Smile I'll trust Apple and GCC to do reasonable code review just like I trust every other dev team of software I use. It's a calculated risk, of course, but I think I'll take my chances instead of writing my own OS in assembly. Though I don't think it would be that hard to check for, if you assume that the malicious stub calls at least one syscall, since those are traceable.
Post 18 Mar 2015, 10:48
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 19 Mar 2015, 00:31
Quote:

Yes, the same is my opinion and observations. Contrary to the public opinion. But maybe we are some exception from the rule. IMO, some objective research is needed.

I guess I found out why... D: At least in my case as described below:

I noticed that when I am coding programs in a "High Level Language" I usually code tens to thousands of lines without testing... The result!? A heap of bugs. D:

And when I code in assembly I code in "checkpoints" and the code is highly tested at every checkpoint to ensure everything is correct.
Sometimes bugs emerge, of course. But when it does, it was (most of the time) not considered in the "checkpoint" test. D:

And I disagree with the following statement (found on the web):
Quote:

there is a common myth among programmers that...takes much longer to code the same program in assembly language than in a HLL. This, indeed, is just a myth...

o_O'

I know from experience this is not a myth at all. It really takes much longer... Unless you are using macros... But then that's not "pure" assembly anymore as you are using macros to write very often repetitive code and, sometimes, complete "statements" that resembles more a sophisticated high level language than anything else. D:

Again, sorry for any inconvenience (if any).
Post 19 Mar 2015, 00:31
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 19 Mar 2015, 00:44
Tyler wrote:
... rework the HLL into a form the compiler can work with.
Sure, that is a potential "solution", but it is only a temporary measure since the next iteration of the compiler might mess it all up again.

Writing-HLL-code-to-match-the-compiler-in-use strikes me as being very similar to writing-assembly-code-to-match-the-CPU-in-use. The concept is the same, only the finer details and execution are changed.
Post 19 Mar 2015, 00:44
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 19 Mar 2015, 01:31
HaHaAnonymous wrote:
And I disagree with the following statement (found on the web):
Quote:

there is a common myth among programmers that...takes much longer to code the same program in assembly language than in a HLL. This, indeed, is just a myth...

o_O'

I know from experience this is not a myth at all. It really takes much longer...
I think it depends upon the familiarity of the programmer. Some people will be more comfortable with assembly and others with HLLs.
Post 19 Mar 2015, 01:31
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 19 Mar 2015, 02:26
Tyler
Quote:
hoping maybe the compiler would do some constant propagation

Well, you know this is normally not what one would want from a sorting algorithm, cause the array size is in most cases dynamically retrieved. You are obviously not prohibited from doing useless things on purpose, but that doesn't automatically make those things more useful.

Quote:
Now that I look at it, the asm for the heapsort instantiation doesn't contain any calls to the helper functions, so I guess it did do what I was hoping.

I must admit I did the measurements the first time using x86 and VS2010. To make a clean experiment I allowed myself to spend some time configuring clang, and the results from O3 are following (for 0x800000 elements):
Code:
HeapSortBad:         2006029 us
HeapSort:            1848703 us    

The second result is my inlined implementation. I passed pointers to HeapSortBad and HeapSort to a measuring function with a noinline attribute, so that there's guaranteed no inlining into the main. I left the array size propagation for your implementation only.
So you can see the keyword really was "cleverly".

Quote:
My guess is that the constant prop, if it leads to inlining and loop unrolling, will more than cancel out the larger code size (considering the whole program could fit in L2 anyway).

What do you mean by "canceling out"? Inlining and loop unrolling, which is luckily not done here, would make it even worse. Increased program size, no matter if it fits into L2 or not, significantly correlates with performance degradation because of the increased CPU resources footprint. Even if we forget about L1 (though it matters quite a lot) for such a long running code there will definitely be context switches resulting in cache lines eviction: the larger the cache footprint the more cache misses will follow for both scheduling out and scheduling back. Additionally there are also other resources wasted such as TLB and branch predictor entries: the larger your code the more CPU resources you waste the more significant performance degradation you'll experience.

For that reason loop unrolling and inlining increases performance to a certain extent when done carefully, but careless code replication is a sign of sloppy coding.

_________________
Faith is a superposition of knowledge and fallacy
Post 19 Mar 2015, 02:26
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 19 Mar 2015, 21:49
Considering compilers are more efficient at code generation than humans, here is a question:

Just because the compilers are more efficient at code generation it does not mean it is easy or that everyone writes more efficient code using a compiler than assembly language.

As an example... Me.

I recently came to the conclusion that I fail to write high level code that is "faster" than my assembly coded variant. And unfortunately, it is not because I am an amazing assembly "coder", it is because I am a (much) worse high level coder. D:

Given that fact, is it better to write my programs in assembly or high level language!? I guess I will leave my HLL coded programs for secondary use (i.e. when not running on x86+Linux)... D:

Defense against possible "bullshits":
1.But it is the whole program performance that matters.

Answer: No, it is not. It is the algorithm that finish first that opens the door faster. And having a slower one that will open it "smoother" will not help much! D:

Note, the comparison above may be a bit misleading depending on how you interpret it.

That's my thought of today!

I apologize for any inconveniences, if any.
Post 19 Mar 2015, 21:49
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.