flat assembler
Message board for the users of flat assembler.
Index
> Main > Clear top byte of 64bit GPR? Goto page Previous 1, 2, 3 |
Author |
|
Borsuc 30 Aug 2009, 15:05
Code: 1000000 loops 181ms empty32bitFunc 222ms empty64bitFunc 305ms cmps way (64bit) 226ms cmp way (64bit) 371ms Borsuc's way (64bit) and sorry I didn't intend to annoy you. _________________ Previously known as The_Grey_Beast |
|||
30 Aug 2009, 15:05 |
|
Azu 30 Aug 2009, 15:15
Borsuc wrote:
Subtract empty64bitFunc from all the 64bit results.. Borsuc wrote: Can you give me the code to test it all in one piece? Here you go, be warned though it's an ugly hacked up piece of confusion, also I didn't really make it.. just heavily edited it.. and I don't remember who I originally got it from, sorry. P.S. only works on 64bit windows
|
|||||||||||
30 Aug 2009, 15:15 |
|
Borsuc 30 Aug 2009, 15:23
oh damn, I don't have 64-bit.
Anyway, thanks for telling me I have to subtract the empty func. I don't think it's accurate to do that though, because the processor like I said, unrolls loops and optimizes them -- of course by subtracting that your method would be at an advantage, because the majority is in the "measuring loop" (in that case). But in real life scenarios you will have a loop anyway, and it's not that faster with a loop. (if you don't have a loop, why do you care for speed? I would choose size any day if it's not in a tight loop ) Measuring clocks is harder these days due to processor micro ops and unrolling and advanced capabilities _________________ Previously known as The_Grey_Beast |
|||
30 Aug 2009, 15:23 |
|
Azu 30 Aug 2009, 15:25
Well normally it would be inline not a function, so what is the advantage? ;/
I think removing function call overhead is completely fair in this test. |
|||
30 Aug 2009, 15:25 |
|
Borsuc 30 Aug 2009, 15:32
No I mean the loop itself, 222ms is b/c of the loop (call doesn't have such an overhead ).
The speed should be IMO tested in real conditions -- in this case, WITH the loop, because after all, you would use a loop in real conditions (again, if no loop, why care about a minuscule speed increase compared to size which can be measured much easier without a loop? ) -- if you subtract the EmptyFunc, you are (wrongly, because CPUs are complex these days) subtracting the loop overhead itself, but that overhead is present in real life situations! but anyway I'm just giving my 2 cents of course, I'm not forcing you to use my style. Have fun _________________ Previously known as The_Grey_Beast |
|||
30 Aug 2009, 15:32 |
|
revolution 30 Aug 2009, 15:33
Borsuc: I am glad to see that you are "seeing the light" with regard to these benchmark tests.
Yes, folks, all these synthetic benchmarks are not so useful (that means useless), because real programs have all sorts of other shit happening that cannot possibly be accounted for by a limited synthetic test on any modern x86 CPU. The same old adage applies as usual, get it working then get it fast. Benchmark with your real program handling real data in real time, not some synthetic mock-up. The results will be much more rewarding. |
|||
30 Aug 2009, 15:33 |
|
Azu 30 Aug 2009, 15:38
Borsuc wrote: No I mean the loop itself, 222ms is b/c of the loop (call doesn't have such an overhead ). If you aren't even going to look at the code, and don't even have a 64bit CPU to test it on, why do you make assumptions of how it works? The overhead is from the intersegment call to and return from 64bit mode.. the loop takes place outside of the clock counting. Only the milliseconds test at the end is outside the loop.. revolution wrote: Borsuc: I am glad to see that you are "seeing the light" with regard to these benchmark tests. |
|||
30 Aug 2009, 15:38 |
|
revolution 30 Aug 2009, 15:51
Azu wrote: Yes and if you take all of that other shit into account in your testing, you aren't seeing which version of something is faster than the other, since unless it makes up most of the program, it is going to be outweighed by the rest of the program.. =/ |
|||
30 Aug 2009, 15:51 |
|
Borsuc 30 Aug 2009, 15:54
No I'm talking about "profile_loop". The loop itself has overhead, not just the call.
But you see, the issue here is that in real code, the code is gonna look like that -- if the loop + call overhead takes 90% of the speed, the difference between the two versions would be very small because it is not themselves that take up that speed! Suppose, for example, that the first version takes 1 clocks, the latter 10, but the "common code" takes 500. Then the difference is not 10 to 1, because the CPU isn't "isolated" without the external factors. It would be 510/501, which is very small indeed. Normally, I am an optimizing freak and do care about speed. That is, if the 501 version was more superior in all aspects then I would choose it, but it isn't, because of what I outlined (i.e size, expandable, elegance, etc...). I'm sorry if I didn't clarify myself previously, hope you understand what I mean now. _________________ Previously known as The_Grey_Beast |
|||
30 Aug 2009, 15:54 |
|
Azu 30 Aug 2009, 15:58
I think I get what you're saying now.
But that is just because the loop is very unoptimized, no? The performance difference between these three kinds of coding is still undeniable. If you only make a small amount of code and put it in a very slow loop, then yes, the loop takes up most of the time. But if you make a faster loop, or need a lot more of this code, then it becomes more important. And anyways, it really isn't the loop taking most of the time here, it is the far call and far return. |
|||
30 Aug 2009, 15:58 |
|
Goto page Previous 1, 2, 3 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.