flat assembler
Message board for the users of flat assembler.

Index > Heap > Intel C/C++ compiler discussion

Goto page Previous  1, 2, 3, 4, 5  Next
Author
Thread Post new topic Reply to topic
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
tom tobias wrote:

Frankly, I don't care if you especially, or anyone else for that matter, agrees with me or not.

Tom, you should be ashamed of yourself for not having been able to shred Igor any better than this. You can tell from his posts, both here and on the Intel forum, that he isn't very capable of optimizing x86 assembly. Just take as an example the thread he referred to, let's see... it was http://www.intel.com/cd/ids/developer/asmo-na/eng/dc/code/languages/194751.htm where he takes 6 pages to brag about 3 lines of code which were of course rather obvious and not optimal.

Anyone who tells you that today's compilers are so good that you can't normally beat them with handwritten assembly just isn't any good at handwritten assembly. Code generation is not like chess where there is a stationary target for programmers to aim at for decades. Processors change over time and new code sequences become optimal. It takes years, more than a single processor generation, for compiler writers to catch up. I can think of one optimization that was good for P4 and even better for Core 2 Duo, yet still isn't implemented by Intel's compiler. When the compiler gets within a factor of 2 in performance and it isn't because something obvious like throughput to memory is the unworkaroundable limiting factor, that's a miracle. Factors of 3 or even 5 or more are more typical.

All that said, xor eax, eax is something you get attached to, just like your grandmother got attached to all the knicknacks she kept on her shelf: it's not a question of whether they offered her any discernable utility, but they made her feel at home. In much the same way xor eax, eax looks more like home cooking than what an HLL compiler would spit out.

A sequence of XORs seems to be the most efficient way to copy eax into edx without disturbing the high bits of rdx. Also for SSE there is no movapd xmm0, immediate 0, so we have to fall back on xorpd xmm0, xmm0 (or xorps if we want one byte less.) I recall when I was learning machine language that it was a rewarding learning experience when I hit the chapter on logical and shift instructions. I reached the conclusion that it would be necessary to become fluent in these instructions which treat registers as buckets of bits rather than values to understand machine- or assembly-level code.

Oh yes, why were people recommending cmp eax, 0 or or eax, eax in this thread when test eax, eax seems more natural than either? Maybe that was covered in one of the similar threads referenced upthread, but it didn't seem to get touched on here.
Post 07 Aug 2007, 07:06
View user's profile Send private message Visit poster's website Reply with quote
FrozenKnight



Joined: 24 Jun 2005
Posts: 128
FrozenKnight
if i were looking from tom's point of view I'd have to accept that. Readability is paramount in what is considered the hardest language to program in by some. and that in order to make things easier for those few who dont want to take 5 mins to look up or calculate what a command does, we need to sacerfice performance.

if i were to accept this method of thinking then i would use bubble sort for all my sorts, why because it's the most readable sort. or how about writing bloated API wrappers with long descriptive names, just to make things easier for others. who cares if the code wont run on my friends 500 mhz server. it's more readable and thus much better for development.

my view is that, I want my code to leave 99% or more of the cpu in idle so that everyone can enjoy using as many apps as they like. i dont care if i create a mess of spaghetti for their coders I'll leave at least a few comments. anyone who really needs to understand my code can spend a few hours reading it. this is hobby code, unless a job lets me program that way (for performance sake of course.) then I'm just building job security (as it will cost them more to get rid of me then.)
Post 07 Aug 2007, 10:04
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Smile A lot of knocked out teeth at the conference Wink

I wonder what other people might say (I mean un-ASM-aware) when they hear us screaming out "XOR is better!!!", "MOV is better!!!"
Post 07 Aug 2007, 10:29
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
levicki



Joined: 29 Jul 2007
Posts: 26
Location: Belgrade, Serbia
levicki
LocoDelAssembly wrote:
Note that the preprocessor layer and interpreter layer are so powerful that you could add SSE4 ISA via macros if you want.


I know that is possible. I already have such macros for ml.exe because latest version which comes with VS2005 has partially broken support for some instructions.

Xorpd! wrote:
Just take as an example the thread he referred to, let's see... it was http://www.intel.com/cd/ids/developer/asmo-na/eng/dc/code/languages/194751.htm where he takes 6 pages to brag about 3 lines of code which were of course rather obvious and not optimal.


If you ever wrote an article for a wide audience you would know that such article has to have certain form and length, and it has to explain the subject without presuming that everybody knows what you are talking about.

By the way, if you are so good at code optimization, you should consider bragging about it on their site -- they pay very well for those articles if you have the nerves to adhere to Intel Developer Services Writer's Guide and if they accept your article for publication.

Xorpd! wrote:
You can tell from his posts, both here and on the Intel forum, that he isn't very capable of optimizing x86 assembly.


If you always jump to conclusions based on a few posts, then you aren't so smart and capable as you want other people to believe.

Xorpd! wrote:
Anyone who tells you that today's compilers are so good that you can't normally beat them with handwritten assembly just isn't any good at handwritten assembly.


If you are the same Xorpd who wrote memcpy.asm, then you should compare the speed of your memcpy() with Intel's memcpy() from their compiler's runtime library.

Lets just say that I did, I converted that code to 32-bit in order to do so because I didn't have x64 OS installed at the moment, and it wasn't any faster for me. That about sums what I am trying to say.

Xorpd! wrote:
Code generation is not like chess where there is a stationary target for programmers to aim at for decades. Processors change over time and new code sequences become optimal. It takes years, more than a single processor generation, for compiler writers to catch up.


Code generation is like chess because algorithms are more or less stationary. Compilers learn to optimize common HLL code sequences, and not only to generate code suited for different CPU architectures.

Sure processors change, but you are wrong about catching up because things are changing there for the better -- for example latest Intel compiler can already generate SSE4.1 code before the Penryn/Yorkfield/Wolfdale has hit the market.

Xorpd! wrote:
When the compiler gets within a factor of 2 in performance and it isn't because something obvious like throughput to memory is the unworkaroundable limiting factor, that's a miracle.


Factor of 2 compared to what? To another compiler or to handwritten assembler? Because writing assembler which is 2-5 times faster than the MSVC or GCC generated code is not exactly a miracle, it is pretty common. If you do compare yourself, compare with the best optimizing compiler.

Xorpd! wrote:
I recall when I was learning machine language that it was a rewarding learning experience when I hit the chapter on logical and shift instructions. I reached the conclusion that it would be necessary to become fluent in these instructions which treat registers as buckets of bits rather than values to understand machine- or assembly-level code.


If you like shifting then you should know that any instruction reading flags after shift with immediate or CL > 1 gives you partial flags stall. Penalty is approx. 4 cycles. For example:

Code:
    shr     eax, 2
      jz      somewhere
    


Creates partial flags stall. Another example of partial flags stall but for different reason is:

Code:
        cmp     eax, ebx
    inc     ecx
 jbe     somewhere
    


That is one of the reasons why INC and DEC are deprecated and ADD reg, 1 and SUB reg, 1 are recommended instead.

Xorpd! wrote:
Oh yes, why were people recommending cmp eax, 0 or or eax, eax in this thread when test eax, eax seems more natural than either? Maybe that was covered in one of the similar threads referenced upthread, but it didn't seem to get touched on here.


Perhaps for readability? Very Happy

Just kidding. But everyone should know that AND reg, reg is sometimes prefered instead of TEST reg, reg. Although they should behave exactly the same they sometimes don't:

Code:
     test    ebx, ebx
    lahf
    


Gives you partial flags stall while:

Code:
 and     ebx, ebx
    lahf
    


Does not.

Lets get back on topic -- I forgot to say that XOR does not break dependence on earlier architectures such as Pentium Pro, Pentium II and Pentium III, only on Netburst (Pentium IV) and later.

Take a look at this example I found in a manual written by Agner Fog:
Code:
    div     ebx
 mov     [mem], eax
  mov     eax, 0
      xor     eax, eax
    mov     al, cl
      add     ebx, eax
    

It seems redundant to clear eax twice, doesn't it? According to him, MOV eax, 0 breaks dependence on PPro, P2, P3 while XOR eax, eax does it for P4 and later. So the MOV allows the CPU to execute last two instructions out-of-order without waiting for the slow DIV to finish, and XOR prevents partial register stall in last instruction (reading whole register after writing only part of it).

By the way, do you guys know that there is another non obvious way to clear a register -- SUB reg, reg Wink
Post 07 Aug 2007, 12:19
View user's profile Send private message Visit poster's website MSN Messenger Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
levicki wrote:

If you are the same Xorpd who wrote memcpy.asm, then you should compare the speed of your memcpy() with Intel's memcpy() from their compiler's runtime library.

Lets just say that I did, I converted that code to 32-bit in order to do so because I didn't have x64 OS installed at the moment, and it wasn't any faster for me. That about sums what I am trying to say.

It's nice to see that someone has dared to run my code for once. However my browser seems to have some difficulty with reading attachments in this forum so I can't see your test code. Perhaps you can just tell me about it instead: what processor is it running on? I presume Core 2 Duo because it's the only one available that can handle SSSE3. Does your test code compare Intel's code with mine for what it is trying to do, which is to be fast for all alignments, where speed is measured for in-cache datasets -- if the dataset is out of cache then it's too hard to see any improvement for CPU-directed code. What does your table of clock tick counts for all alignments look like on your processor for my code and for Intel's? My code is not optimal and there are a couple of improvements that could bump up the speed perhaps 10% or so but I never got around to implementing them because I didn't think there was any interest out there in this exercise.
levicki wrote:

Code generation is like chess because algorithms are more or less stationary. Compilers learn to optimize common HLL code sequences, and not only to generate code suited for different CPU architectures.

Sure processors change, but you are wrong about catching up because things are changing there for the better -- for example latest Intel compiler can already generate SSE4.1 code before the Penryn/Yorkfield/Wolfdale has hit the market.

If code generation is like chess, then like go, where as far as I am aware no impressive program has yet been written, when the rules are changed the compiler gets much weaker. Intel may be capable of generating SSE4.1 code, but generating good SSE2 code is quite beyond their compiler's abilities. SSE2 has been around for some time, too. The snippet you beat up on in your article was a relatively good one for Intel; the bad ones are total clunkers.
levicki wrote:


Xorpd! wrote:
When the compiler gets within a factor of 2 in performance and it isn't because something obvious like throughput to memory is the unworkaroundable limiting factor, that's a miracle.




Factor of 2 compared to what? To another compiler or to handwritten assembler? Because writing assembler which is 2-5 times faster than the MSVC or GCC generated code is not exactly a miracle, it is pretty common. If you do compare yourself, compare with the best optimizing compiler.

You seem to have taken the meaning of my statement backwards. I was saying that for the compiler to come within a factor of 2 to handwritten code is a miracle for the compiler!
Post 07 Aug 2007, 17:27
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
Hmmm, the tone of some of these messages is a bit too serious, even confrontational. So, let's avoid that, shall we?

Anyways, first of all, levicki, welcome to the forum! We hope you take a deeper look at FASM and that it serves you well in your endeavors!

N.B. I am only a hobbyist, not a pro. I like trying new compilers, successfully compiling stuff (which is rare, stupid makefiles, shell scripts, broken mess needing billions of weird specific crud), etc. C is still the most popular but not the "best" by any means. Supporting C (and all its variants) well is very hard, so optimizing is probably not the first priority for most compiler writers.

levicki wrote:

Like MSVC 2005, GCC4 and especially Intel C/C++ 10.0. And while MSVC and GCC produce reasonable code for today standards, Intel compiler is ahead of them because it has auto-vectorizer and auto-parallelizer plus it can perform numerous loop transformations in order to squeeze maximum efficiency and keep the pipeline busy. Happy now?


I get the impression that all your talks are about speed optimization, not size or compatibility or whatever. That's fine, but that's not exclusively what I meant. And these three compilers are all completely different animals, so it's pretty impossible to compare them anyways.

GCC4 -- How is it better than before? Faster C++ (but C++ object incompatible)?? Tell it to the MinGW peeps!! They still use GCC3, as do others (and GCC4 keeps breaking older code, postponing FreeBSD 7 a bit). But when you say "vectorisation" (which I am not familiar with at all), I assume you mean SSE, etc. Does GCC4 do well for it and GCC3 suck? (BTW, GCC is the only one of these three that runs on DOS, currently at 4.2.1.) This compiler is "free" in every sense (open source, no cost, etc.) unless you hate GPLv2 (soon to be exclusively GPLv3).

MSVC -- This takes 500 MB of space for Express 2k5, last I checked. And be sure not to use a third-party hack for that plugin architecture (or whatever) or they'll sue you. And don't use it for any other OS besides Windows. Ignore the fact that it needs a fairly new cpu just to run. But hey, I'm sure it's a good compiler. It's no cost, for hobbyists at least.

Intel -- Mea culpa, it's not $600. It's only $599 (professional) and $449 (standard). And you need a lot of cpu resources, XP or better, and MSVC SDK (or whatever). But surprise surprise, they can optimize for their own chips! (Too bad Linux only had 9.0 for non-commercial use last I checked: supposedly pretty broken AMD / SSE support for that without third-party hacks.) But yeah, you can try an evaluation version ONE TIME (non-renewable) if you think $449 makes it a thousand times less braindead than GCC4.

And even that's ignoring all of the following (Win32 is your forte, no?): Digital Mars, LCC-Win32, Pelles C, OpenWatcom, Cygwin, CC386, BCC5, Turbo C++ Explorer, TinyC, etc. (no evaluation required, heh).

Compilers are a means to an end, a necessary evil if you will. But if they are super slow and require super duper resources just to run, maybe they need to recompile themselves!! Laughing (so true!)

levicki wrote:

Try to compile some simple code and take a look at assembler listing, you may end up surprised and may even learn some neat tricks.


I don't think I'll be learning more from a compiler than a forum full of skilled assembly programmers (Dex4u, Tomasz, vid, revolution, crc, ronware, Madis, MCD, LocoDelAssembly, DJ Mauretto, asmfan, ssp, ATV, f0dder, etc).

levicki wrote:

Compilers have come a long way and they have really improved considerably


In what way? Which ones are "better" now that sucked before? Which ones were SO bad that they were useless for any "real" work? (BTW, Linus still uses GCC 2.95 for the kernel, last I heard.)
Post 07 Aug 2007, 19:12
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
levicki: first of all, thanks for lot of insight into things rarely mentioned on this board.

Quote:
By the way, if you are so good at code optimization, you should consider bragging about it on their site -- they pay very well for those articles if you have the nerves to adhere to Intel Developer Services Writer's Guide and if they accept your article for publication.
could you provide me some link to "Intel Developer Services Writer's Guide"? I failed to find it anywhere.

Not that i would feel i know enough about optimizing, i am just curious.
Post 07 Aug 2007, 19:23
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
rugxulo wrote:

skilled assembly programmers (..., LocoDelAssembly, ...)


WTF???
Post 07 Aug 2007, 19:42
View user's profile Send private message Reply with quote
levicki



Joined: 29 Jul 2007
Posts: 26
Location: Belgrade, Serbia
levicki
Xorpd! wrote:
However my browser seems to have some difficulty with reading attachments in this forum so I can't see your test code.


That is because I haven't posted it. I tested that and since it wasn't usefull I deleted it.

Xorpd! wrote:
Perhaps you can just tell me about it instead:


Sure.

- CPU is Core 2 Duo E6300.
- It compares Intel's memcpy() with yours in real application, inside of 3D Gaussian blur function. memcpy() was used for copying processed rows back into the source because operation has to be inplace and you can't write previous two rows back until you finish the third one or you will get wrong results for the third row.
- That means I haven't used your test driver so I do not have clock table but I had measured time spent in your memcpy() and in Intel's and it was about the same or a bit worse.

Xorpd! wrote:
My code is not optimal and there are a couple of improvements that could bump up the speed perhaps 10% or so but I never got around to implementing them because I didn't think there was any interest out there in this exercise.


I understand, but I did not mean to judge your abilities only by looking at that piece of code you wrote. I just wanted you to take a look at their memcpy() and the other code the compiler is able to generate and to compare your code with that before concluding that humans are always better and that writing in assembly always pays off.

Xorpd! wrote:
Intel may be capable of generating SSE4.1 code, but generating good SSE2 code is quite beyond their compiler's abilities. SSE2 has been around for some time, too.


For me it produces quite decent code. Are you saying you tried and compared your own results? Can we see some then?

Xorpd! wrote:
The snippet you beat up on in your article was a relatively good one for Intel; the bad ones are total clunkers.


The snippet was inspired by real-life code. If you read the article carefully you would have noticed that compiler was able to vectorize the code when the data was signed short. It wasn't the fault of the compiler that x86 SIMD instruction set lacked proper instructions to support unsigned transformation of the same code.

So, what I have done was just a work-around for that limitation, nothing more. They have probably included that trick in the next compiler version given that the code I submitted with the article is now their property.

Xorpd! wrote:
I was saying that for the compiler to come within a factor of 2 to handwritten code is a miracle for the compiler!


Again I am not sure if you mean 2x slower than handwritten or 2x faster?

In my opinion, it is a miracle for any compiler to produce code that runs at the same speed as your (or mine) handwritten code.

That is a miracle because it allows us to concentrate on those most critical parts of the code and let the compiler deal with the rest, less important code paths.

rugxulo wrote:
Hmmm, the tone of some of these messages is a bit too serious, even confrontational. So, let's avoid that, shall we?


I agree.

rugxulo wrote:
Anyways, first of all, levicki, welcome to the forum! We hope you take a deeper look at FASM and that it serves you well in your endeavors!


Thanks, I will check it out.

rugxulo wrote:
GCC4 -- How is it better than before? Faster C++ (but C++ object incompatible)?? Tell it to the MinGW peeps!! They still use GCC3, as do others (and GCC4 keeps breaking older code, postponing FreeBSD 7 a bit).


It generates faster code. That is all I meant to say.

rugxulo wrote:
But when you say "vectorisation" (which I am not familiar with at all), I assume you mean SSE, etc. Does GCC4 do well for it and GCC3 suck?


I am not aware of any other vectorizing compiler save Intel's C/C++ and Codeplay's VectorC. By "vectorization" I mean automatically exploiting SIMD (SSE, SSE2, etc) capabilities of modern CPUs for common loop structures used in C/C++ code.

rugxulo wrote:
MSVC -- This takes 500 MB of space for Express 2k5, last I checked. And be sure not to use a third-party hack for that plugin architecture (or whatever) or they'll sue you.


The compiler itself doesn't take that much space. You probably meant Platform SDK (include files and DLL import libraries for Windows).

You can't write Windows GUI programs without those, just like you can't write decent Linux GUI applications without having Gnome or KDE and probably X server libraries installed not to mention g++ and its own libraries plus perhaps some high-level GUI toolkit, so lets leave size comparisons out of this discussion.

Plugins are something you don't need for simpler projects and you can always use command line (cl.exe, link.exe) and makefiles. If you don't need Visual Studio IDE, then just get Platform SDK which has compiler, libraries and includes you need.

rugxulo wrote:
Intel -- Mea culpa, it's not $600. It's only $599 (professional) and $449 (standard).


Professional costs that much because it includes Math Kernel Library (BLAS, etc), Thread Building Blocks (runtime library with support for easy code threading), and Intel Performance Primitives Library which has optimized code for image, sound and video processing, including encoding/decoding support for mpeg, mpeg2, mp3, h264, and as of latest version HD profiles for those. Oh yeah, JPEG baseline and extended for 12-bit images (DICOM anyone), JPEG2000, Wavelet transforms, FFT (faster than FFTW), FIR, IIR, Huffman, VLC, RLE, gzip, bzip2, etc, etc in the same package, all heavily optimized for various CPUs from plain Pentium to latest Core 2 and threaded on top of that. But some people like to reinvent the wheel I guess.

Anyway, just the compiler costs $449 which is not all that much if you own a software development company or even if you work alone and plan to sell your programs.

Moreover, there is reduced pricing for students and academic institutions and a free non-commercial license for Linux which is IMO pretty cool.

rugxulo wrote:
(Too bad Linux only had 9.0 for non-commercial use last I checked: supposedly pretty broken AMD / SSE support for that without third-party hacks.) But yeah, you can try an evaluation version ONE TIME (non-renewable) if you think $449 makes it a thousand times less braindead than GCC4.


Linux should have 10.x available too, check it out. And of course you can try evaluation version as many times as you want. Just use different email address when you request evaluation. Wink

rugxulo wrote:
And even that's ignoring all of the following (Win32 is your forte, no?): Digital Mars, LCC-Win32, Pelles C, OpenWatcom, Cygwin, CC386, BCC5, Turbo C++ Explorer, TinyC, etc. (no evaluation required, heh).


Yes I mostly work in Windows, all those compilers most likely can't even match MSVC 2005 code performance.

rugxulo wrote:
But if they are super slow and require super duper resources just to run, maybe they need to recompile themselves!! Laughing (so true!)


Intel C/C++ compiler is compiled with (a bit older version of) itself. It is slower but that is because it performs much better code analysis resulting in better code performance.

rugxulo wrote:
I don't think I'll be learning more from a compiler than a forum full of skilled assembly programmers (Dex4u, Tomasz, vid, revolution, crc, ronware, Madis, MCD, LocoDelAssembly, DJ Mauretto, asmfan, ssp, ATV, f0dder, etc).


Suit yourself. But I think the compiler writers might have some tricks up their sleeves too. It doesn't hurt to check.

rugxulo wrote:
In what way? Which ones are "better" now that sucked before?


In terms of code performance (not size or compatibility) -- MSVC 2005, much better than MSVC 2003 and especially than MSVC 6.0. GCC4, better than GCC3, better than GCC2.95. Codeplay VectorC probably better than all of them (haven't tested it recently) was a bit unstable and finally Intel C/C++ which is IMO the best at the moment when it comes to x86 and x64.

Feel free to post any short piece of C/C++ code and we can compare the output of various compilers for that code. I can chip in with Intel 10.0.026 and MSVC 2005 output if you want and others are free to use whatever they want. Finally we can write it by hand in assembler to compare with machine generated code but it would be neccessary to state the time taken for such work so we can judge by comparing performance / effort ratio.
Post 07 Aug 2007, 20:36
View user's profile Send private message Visit poster's website MSN Messenger Reply with quote
levicki



Joined: 29 Jul 2007
Posts: 26
Location: Belgrade, Serbia
levicki
vid wrote:
could you provide me some link to "Intel Developer Services Writer's Guide"? I failed to find it anywhere.


I can't find it either, but I have a (probably a bit outdated) copy. Here you go.


Description: Intel Developer Services Writer's Guide 2.0
Download
Filename: IDS Writer's Guide 2.0.zip
Filesize: 238.78 KB
Downloaded: 490 Time(s)

Post 07 Aug 2007, 21:43
View user's profile Send private message Visit poster's website MSN Messenger Reply with quote
MichaelH



Joined: 03 May 2005
Posts: 402
MichaelH
Vid can you split off this debate about the value of the Intel C/C++ compiler. I'm very interested in levicki and Xorpd! continuing any discussion about memory optimisation but I also would like to get to the bottom of why mov eax, 0 beat xor eax, eax on newer processors as shown by the tests you posted earlier.
Post 07 Aug 2007, 22:48
View user's profile Send private message Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
levicki wrote:

- CPU is Core 2 Duo E6300.
- It compares Intel's memcpy() with yours in real application, inside of 3D Gaussian blur function. memcpy() was used for copying processed rows back into the source because operation has to be inplace and you can't write previous two rows back until you finish the third one or you will get wrong results for the third row.
- That means I haven't used your test driver so I do not have clock table but I had measured time spent in your memcpy() and in Intel's and it was about the same or a bit worse.

Ummm, where do I start with this? So instead of testing my code, you wrote code that you said was equivalent to my code when in reality you aren't competent at x86 optimization as seen by your failure to distinguish between the garbage generated by Intel's compiler and good code. OK, then you tested it in a context where the alignment isn't obvious to the readers of your post, and alignment is critical. If you look at the memcpy.txt file on my website, you can see that even for the worst cases of alignment it takes under 1200 clocks to copy 8000 bytes. Using string moves, the best cases of alignment took 1100 clocks and the worst cases took about 13000, an order of magnitude performance ratio even without subtracting the 64 clocks for rdtsc latency and further code improvements alluded to in my previous post.

Then the data size... for small data sizes, the improvements envisioned might be more significant because the lion's share of the improvement lay in loop setup overhead. For larger datasets, the performance would be limited by throughput to L2 cache and beyond, diluting the effect of all the fancy optimizations.

And of course we don't have any quantitative data, we just have to take your word for it that my code 'was a bit worse' than Intel's. Permit me to take your public assesment of my work as an insult, rather than an objective and quantitative criticism. Why, if you even provided some cycle counts at least the reader could have made some kind of guess as to the alignment and size of your data sets, but we don't even get that.
levicki wrote:

For me it produces quite decent code. Are you saying you tried and compared your own results? Can we see some then?

Ah, how about my memcpy code, with a test open to public scrutiny, instead of just giving us a rumor about it? And what about my oft maligned Mandelbrot code? Nobody seems to have run it even though it's been around all year. It does a little under 2.0e9 iterations per second which I think compares quite favorably with the slightly over 7.0e8 attained by the Kummer et al code or quickman, which both, BTW, use assembly language optimizations. You are free to insert any unwholesome C code you desire into one of those benchmarks and let us know how much improvement you get over my results.
levicki wrote:

The snippet was inspired by real-life code. If you read the article carefully you would have noticed that compiler was able to vectorize the code when the data was signed short. It wasn't the fault of the compiler that x86 SIMD instruction set lacked proper instructions to support unsigned transformation of the same code.

So, what I have done was just a work-around for that limitation, nothing more. They have probably included that trick in the next compiler version given that the code I submitted with the article is now their property.

I must not have read the article very carefully because I thought the abortion listed on the bottom of the first page was Intel's pitiful attempt at vectorization of the signed short version. Just load/pminsw/pmaxsw/store/update pointer/jcc = 6 instructions would be good code. I guess I must have misinterpreted what that code did, then. Wish you could have written the article more clearly in this regard.

Some of the more horrible examples of Intel's poor SSE code generation I have are proprietary, but if you take any even slightly complex example of optimized SSE3 code and try to get Intel's compiler to spit it out, you will see what I mean.
Post 08 Aug 2007, 00:54
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Xorpd!, why do you create objects instead of PE64 executables directly?
Post 08 Aug 2007, 01:23
View user's profile Send private message Reply with quote
levicki



Joined: 29 Jul 2007
Posts: 26
Location: Belgrade, Serbia
levicki
Xorpd! wrote:
So instead of testing my code, you wrote code that you said was equivalent to my code...


Where have I said that I wrote code equivalent to yours?!?

I just said that I changed your memcpy() code to 32-bit and called it from within real-world application.

Xorpd! wrote:
OK, then you tested it in a context where the alignment isn't obvious to the readers of your post, and alignment is critical.


If I said that the function which calls your memcpy() performs 3D Gaussian blur isn't it obvious that all voxels in a cube cannot be aligned?

Xorpd! wrote:
If you look at the memcpy.txt file on my website, you can see that even for the worst cases of alignment it takes under 1200 clocks to copy 8000 bytes.


I suppose that means it should be faster if data is not aligned? That condition was fulfiled most of the time since data copied was variable amount of bytes starting at various offsets within said cube.

Xorpd! wrote:
And of course we don't have any quantitative data, we just have to take your word for it that my code 'was a bit worse' than Intel's.


On the contrary. You are completely free to download trial Intel compiler and write a test to your liking where you compare your own and Intel's memcpy() code so you can see the quantitative data for yourself because I am to lazy to write that test for you and to convert your code to 32-bit again because I haven't kept it around.

Xorpd! wrote:
Permit me to take your public assesment of my work as an insult, rather than an objective and quantitative criticism.


I already said that I am not trying to judge either you or your work. On the other hand you keep saying that I am not competent at x86 optimization without proving it in any way. I would rather let others judge both of us.

Xorpd! wrote:
And what about my oft maligned Mandelbrot code? Nobody seems to have run it even though it's been around all year.


Has it perhaps crossed your superior mind that people do not run x64 version of Windows that much? I would be the first one to try it if you provide 32-bit version. One adaptation of your 64-bit code was enough unpaid work for me.

Xorpd! wrote:
I must not have read the article very carefully because I thought the abortion listed on the bottom of the first page was Intel's pitiful attempt at vectorization of the signed short version.


That was generated code. Is calling it "abortion" an insult meant for me or for the compiler?

Frankly, I do not understand why are you so agressively offensive towards compiler generated code and towards me as well.

The amount of hate you are showing certainly isn't good for your health and what is worse it shows how immature you really are. You must think of yourself as of some kind of assembler God or something.

Xorpd! wrote:
Just load/pminsw/pmaxsw/store/update pointer/jcc = 6 instructions would be good code.


How can you claim that some other code would be better if you haven't measured the performance of this one and compared?

Do you take into account how various CPUs execute instructions you would use instead of those compiler used?

Have you perhasp taken into account instruction scheduling in the whole executable (which you haven't seen) and deemed that compiler has chosen wrong instructions?

Sorry, but that is simply too many assumptions for me to take any further claims from you seriously.

Xorpd! wrote:
but if you take any even slightly complex example of optimized SSE3 code and try to get Intel's compiler to spit it out, you will see what I mean.


I cannot take SSE3 code and make any compiler spit out better SSE3 code because compilers do not optimize assembler listings. That is what Dalsoft x86 Optimizer does. On the other hand if you give me some C/C++ code then I can try.
Post 08 Aug 2007, 02:18
View user's profile Send private message Visit poster's website MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Xorpd! wrote:
Ah, how about my memcpy code, with a test open to public scrutiny, instead of just giving us a rumor about it? And what about my oft maligned Mandelbrot code? Nobody seems to have run it even though it's been around all year. It does a little under 2.0e9 iterations per second which I think compares quite favorably with the slightly over 7.0e8 attained by the Kummer et al code or quickman, which both, BTW, use assembly language optimizations. You are free to insert any unwholesome C code you desire into one of those benchmarks and let us know how much improvement you get over my results.

As Levicki said, your adaption of the Mandelbrot benchmark is 64-bit code and hard to compare to the 32-bit one as the additional available registers give you may be almost double the speed. I had a talk to the author of Quickman (which is also like double fast per core than the KM 0.53 but lacks of Multi-core support). He said when he will include Multi-core support then his code will slow down. And the problem for me was also the lack of a 64-bit OS to test your code.

Anyway, your optimizations were very impressing and for me that just shows the power of assembly language and a very good coder compared to a compiler. I doubt that even Intel's compiler would think of implementing nicely multi-core support and good SSE2 code. And these 2 things are essential for a fast mandelbrot code. I think I remember a website where one tries to implement a similar approach in C and was less than half of the iteration of the original KM 0.53 version.

So in my opinion any kind of code (readable or not) within such time critical algorithms should be done, but of course commenting it even for your own makes a *lot* of sense. I remember me looking at old code and asking myself WTF have I done there... Wink
Post 08 Aug 2007, 18:06
View user's profile Send private message Visit poster's website Reply with quote
mandeep



Joined: 11 Aug 2007
Posts: 6
mandeep
Hello Mr. levicki !

I read a few of your posts in this board, and they were
eye opener for me (to say the least)! I am amazed at compiler's
capabilities and excited to quit assembly altogether! Now,
only one doubt remains. I agree and wonder at the same time
at improved compiler technologies. I came to know that they can
be considered (best or thereabouts) for 90% (may be more) of
programming tasks. But is it really justified to dig into assembly
language just in those rare cases where a compiler generated
solution is not good enough (may be it is not bad either).
I want to get advice from you on this, can we just rely on compilers
and just stop using assembly for better of it. And let the assembly
learning task be remained for the compiler writers and OS writers
(who can not avoid it) only!

Your advice and comments will be highly appreciated.
Thanks for your earlier posts.

_________________
Mandeep Singh
Post 11 Aug 2007, 09:42
View user's profile Send private message Reply with quote
sleepsleep



Joined: 05 Oct 2006
Posts: 8897
Location: ˛                             ⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣Posts: 334455
sleepsleep
/me hobby programmer.

just my opinion is, until the computer/system/pc is "creative" enough to draw a picture by giving it a title, until then, imho, compiler would never beat human coz the best/main part of us is "creativity"
Post 11 Aug 2007, 12:53
View user's profile Send private message Reply with quote
mandeep



Joined: 11 Aug 2007
Posts: 6
mandeep
Look nobody can deny the marvels of human mind and no matter how
elegant the compiler is it is written by humans.
So guys do not pamper yourself just by proving that you are more intelligent
than a compiler written by some fellow humans. It will take you no where.
Just a personal satisfaction. Ok, that may work as an art or
hobby but for real practical things ummm ...

PS: I do not want to start a debate here. I just wanted a friendly advice
from the expert (been there, done all guy).

_________________
Mandeep Singh
Post 11 Aug 2007, 13:01
View user's profile Send private message Reply with quote
FrozenKnight



Joined: 24 Jun 2005
Posts: 128
FrozenKnight
mandeep wrote:
can we just rely on compilers
and just stop using assembly for better of it. And let the assembly
learning task be remained for the compiler writers and OS writers
(who can not avoid it) only!

Your advice and comments will be highly appreciated.
Thanks for your earlier posts.


well i have a couple of things to say regarding this area. 1st, Compiler writers have to know this stuff well as do OS writers. For this reason it is needed to have a good base of ASM coders. 2nd ASM is used in more than OS and Compielrs, it is also used in IC's. 3rd Some Programmers actually find it easier to program in ASM than in C, C++, VB, Basic, Cobol, etc... (like myself, yes, i know ASM is a harder language but i don't like not knowing what my computer is doing. and when I write C and C++ code, i Feel as though i have passed part of my control over the computer to the compiler. when I need really big speed improvements I compile small code segments in C and examine them using a debugger then optimize them further. (a tedious task but often times worth it. Especially when your program executes tasks faster than a human can click so everything happens instantly. Makes me feel as though my apps are more professional than the real professional apps.)

Besides, I have come to like the level of ease fASM gives when compiling. i hate having to make a project in VC++ just to compile a simple source file. and I'm not entirely fond of the command line options in G++ either.
Post 11 Aug 2007, 20:32
View user's profile Send private message Reply with quote
mandeep



Joined: 11 Aug 2007
Posts: 6
mandeep
Hello FrozenKnight!
Look liking is okay. It is not a problem if a person likes it to do this way
or that way. You want to program in asm for liking of it. That is okay and
respectful in its own way of doing.
But I am reasoning for practical aspects of not knowing or using asm
for programming at all. And just leave it to those only who have to for any practical reasons.

PS: I am also waiting for levicki's comments and advice.

_________________
Mandeep Singh
Post 12 Aug 2007, 08:48
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.