flat assembler
Message board for the users of flat assembler.

Index > Main > Clear top byte of 64bit GPR?

Goto page Previous  1, 2, 3
Author
Thread Post new topic Reply to topic
Azu



Joined: 16 Dec 2008
Posts: 1159
Azu
"2 loops"


It's more accurate in the tighter loops (e.g. 1000000).


Look just benchmark them yourself if you want I'm getting tired of explaining my results to you. I posted the 3 versions I tested, just plug them in to your own test. And say what CPU please.
Post 30 Aug 2009, 15:02
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
Code:
1000000 loops
181ms empty32bitFunc
222ms empty64bitFunc
305ms cmps way (64bit)
226ms cmp way (64bit)
371ms Borsuc's way (64bit)    
from yours of course -- it's not even 2 times faster in tighter loops. Can you give me the code to test it all in one piece?

and sorry I didn't intend to annoy you.

_________________
Previously known as The_Grey_Beast
Post 30 Aug 2009, 15:05
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1159
Azu
Borsuc wrote:
Code:
1000000 loops
181ms empty32bitFunc
222ms empty64bitFunc
305ms cmps way (64bit)
226ms cmp way (64bit)
371ms Borsuc's way (64bit)    
from yours of course -- it's not even 2 times faster in tighter loops


Subtract empty64bitFunc from all the 64bit results.. Razz

Borsuc wrote:
Can you give me the code to test it all in one piece?

and sorry I didn't intend to annoy you.

Here you go, be warned though it's an ugly hacked up piece of confusion, also I didn't really make it.. just heavily edited it.. and I don't remember who I originally got it from, sorry.


P.S. only works on 64bit windows


Description:
Download
Filename: num2dec.asm
Filesize: 39.03 KB
Downloaded: 58 Time(s)

Post 30 Aug 2009, 15:15
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
oh damn, I don't have 64-bit.
Anyway, thanks for telling me I have to subtract the empty func.
I don't think it's accurate to do that though, because the processor like I said, unrolls loops and optimizes them -- of course by subtracting that your method would be at an advantage, because the majority is in the "measuring loop" (in that case).

But in real life scenarios you will have a loop anyway, and it's not that faster with a loop. (if you don't have a loop, why do you care for speed? I would choose size any day if it's not in a tight loop Razz)

Measuring clocks is harder these days due to processor micro ops and unrolling and advanced capabilities Sad

_________________
Previously known as The_Grey_Beast
Post 30 Aug 2009, 15:23
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1159
Azu
Well normally it would be inline not a function, so what is the advantage? ;/
I think removing function call overhead is completely fair in this test.
Post 30 Aug 2009, 15:25
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
No I mean the loop itself, 222ms is b/c of the loop (call doesn't have such an overhead Very Happy).

The speed should be IMO tested in real conditions -- in this case, WITH the loop, because after all, you would use a loop in real conditions (again, if no loop, why care about a minuscule speed increase compared to size which can be measured much easier without a loop? Razz) -- if you subtract the EmptyFunc, you are (wrongly, because CPUs are complex these days) subtracting the loop overhead itself, but that overhead is present in real life situations!

but anyway I'm just giving my 2 cents of course, I'm not forcing you to use my style. Have fun Smile

_________________
Previously known as The_Grey_Beast
Post 30 Aug 2009, 15:32
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17662
Location: In your JS exploiting you and your system
revolution
Borsuc: I am glad to see that you are "seeing the light" with regard to these benchmark tests.

Yes, folks, all these synthetic benchmarks are not so useful (that means useless), because real programs have all sorts of other shit happening that cannot possibly be accounted for by a limited synthetic test on any modern x86 CPU.

The same old adage applies as usual, get it working then get it fast. Benchmark with your real program handling real data in real time, not some synthetic mock-up. The results will be much more rewarding.
Post 30 Aug 2009, 15:33
View user's profile Send private message Visit poster's website Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1159
Azu
Borsuc wrote:
No I mean the loop itself, 222ms is b/c of the loop (call doesn't have such an overhead Very Happy).

The speed should be IMO tested in real conditions -- in this case, WITH the loop, because after all, you would use a loop in real conditions (again, if no loop, why care about a minuscule speed increase compared to size which can be measured much easier without a loop? Razz) -- if you subtract the EmptyFunc, you are (wrongly, because CPUs are complex these days) subtracting the loop overhead itself, but that overhead is present in real life situations!

but anyway I'm just giving my 2 cents of course, I'm not forcing you to use my style. Have fun Smile
??
If you aren't even going to look at the code, and don't even have a 64bit CPU to test it on, why do you make assumptions of how it works? Confused

The overhead is from the intersegment call to and return from 64bit mode.. the loop takes place outside of the clock counting. Only the milliseconds test at the end is outside the loop..



revolution wrote:
Borsuc: I am glad to see that you are "seeing the light" with regard to these benchmark tests.

Yes, folks, all these synthetic benchmarks are not so useful (that means useless), because real programs have all sorts of other shit happening that cannot possibly be accounted for by a limited synthetic test on any modern x86 CPU.

The same old adage applies as usual, get it working then get it fast. Benchmark with your real program handling real data in real time, not some synthetic mock-up. The results will be much more rewarding.
Yes and if you take all of that other shit into account in your testing, you aren't seeing which version of something is faster than the other, since unless it makes up most of the program, it is going to be outweighed by the rest of the program.. =/
Post 30 Aug 2009, 15:38
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17662
Location: In your JS exploiting you and your system
revolution
Azu wrote:
Yes and if you take all of that other shit into account in your testing, you aren't seeing which version of something is faster than the other, since unless it makes up most of the program, it is going to be outweighed by the rest of the program.. =/
I fully agree, And this strengthens my point very well. If you can't see any significant benefit from optimising with the real program then don't waste your time doing it. Borsuc's note about loops is 100% spot on, only frequently executes loops need to be optimised. If in a real program you don't have anything that is done at least a few trillion times then don't bother to optimise it.
Post 30 Aug 2009, 15:51
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
No I'm talking about "profile_loop". The loop itself has overhead, not just the call.

But you see, the issue here is that in real code, the code is gonna look like that -- if the loop + call overhead takes 90% of the speed, the difference between the two versions would be very small because it is not themselves that take up that speed!

Suppose, for example, that the first version takes 1 clocks, the latter 10, but the "common code" takes 500. Then the difference is not 10 to 1, because the CPU isn't "isolated" without the external factors. It would be 510/501, which is very small indeed.


Normally, I am an optimizing freak and do care about speed. That is, if the 501 version was more superior in all aspects then I would choose it, but it isn't, because of what I outlined (i.e size, expandable, elegance, etc...).

I'm sorry if I didn't clarify myself previously, hope you understand what I mean now. Smile

_________________
Previously known as The_Grey_Beast
Post 30 Aug 2009, 15:54
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1159
Azu
I think I get what you're saying now.

But that is just because the loop is very unoptimized, no?

The performance difference between these three kinds of coding is still undeniable. If you only make a small amount of code and put it in a very slow loop, then yes, the loop takes up most of the time. But if you make a faster loop, or need a lot more of this code, then it becomes more important. And anyways, it really isn't the loop taking most of the time here, it is the far call and far return. Razz
Post 30 Aug 2009, 15:58
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.