flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2 |
Author |
|
l_inc 21 Oct 2017, 23:48
Somewhat different results on P4:
Code: l_inc.0: 19813978 usec Furs.0: 12301931 usec SergeASM.0: 12064554 usec Furs.1: 11127560 usec _shura.0: 11972406 usec _________________ Faith is a superposition of knowledge and fallacy |
|||
![]() |
|
revolution 22 Oct 2017, 01:56
All I see are bunch of useless simulated tests. These will have almost no correlation to what happens in the final application.
|
|||
![]() |
|
l_inc 22 Oct 2017, 10:41
revolution
Would you give any quantification for "almost no correlation" or is this just a useless blind guess? ![]() _________________ Faith is a superposition of knowledge and fallacy |
|||
![]() |
|
revolution 22 Oct 2017, 12:38
l_inc wrote: revolution Even if we ignore all of the other factors even just changing the CPU will change the results. And even if we keep the CPU constant then all the other factors related to the surrounding code can make a large difference in the results. There are also times where the same code run in different circumstances can show variances in timings. So any simulated tests will show us nothing of value. We need to run whatever tests in the final code. As for this code, we have no idea of the run system it is used in, so we have no idea how it will perform in the final application. Maybe a lookup table will be best in one circumstance, and a computed result will be best in another, perhaps a branchless code path is best in some cases. Maybe the real application has a limited number of input values and we don't even need to have code that works for any arbitrary value. Maybe the code only runs at startup and the timing difference is meaningless. Maybe the code only makes a difference of 0.01% and it would be best to put more time into some other bottleneck. The code sequence is very short and deals with unknown data inputs, running on an unknown system, within an unknown application, in an unknown OS, so any correlation you might get could be at best a coincidence if you are lucky. |
|||
![]() |
|
l_inc 22 Oct 2017, 13:19
revolution
Quote: So any simulated tests will show us nothing of value. The value of initial simulated tests is not to show how the final application would perform. The value is to provide a notion about what should be taken into account while writing the final application even before the final application is ready for testing. Large variation in simulated tests on different machines is already a valuable result. I personally, of course, do not have sufficient information in this particular case to estimate whether the optimization attempt here is of any significant value let alone to simulate representative microarchitectural conditions, but I can give a push for applying educated testing. _________________ Faith is a superposition of knowledge and fallacy |
|||
![]() |
|
Furs 22 Oct 2017, 15:38
Sorry but I agree with revolution. Your test uses a counter that increases sequentially as the value. This means that "two branches" don't have much penalty because for most values they'll always be the same 15 times in a row.
It's a simulated case where the predictor has a very easy job. That's the problem. Try with random sequence of numbers in al instead (cache them in memory so the overhead of random generation is probably not included or something like that but "touch" the memory first before the first test to make sure it's in cache). BTW don't use 'xor ah, ah' if it's come to that, use 'movzx eax, al' instead, it probably is faster on AMD at least (I don't know about Intel if they still split the register with zero penalties). Or even better, 'movzx eax, bl' since in real code it's likely you'll be able to do that if it comes from another register or memory operand, to make it more real. |
|||
![]() |
|
l_inc 22 Oct 2017, 20:41
Furs
Quote: This means that "two branches" don't have much penalty because for most values they'll always be the same 15 times in a row. You are right. I should have thought about that. But that's not the revolution's point to agree with, as he (*) completely discourages any testing outside of the final environment. Quote: BTW don't use 'xor ah, ah' if it's come to that, use 'movzx eax, al' instead My preference would be to leave it out completely for cases that do not depend on it. But you can make your own suggestion in form of an improved or another listing. My idea was to provide a basis and see whether and how it evolves. movzx eax,al seems to establish a dependency chain for the test with lahf, which makes it significantly slower than with xor ah,ah (or without it). ![]() Quote: I don't know about Intel if they still split the register with zero penalties The penalty is supposed to be just an additional microop: Optimization reference manual wrote: Starting with Intel microarchitecture code name Sandy Bridge and all subsequent generations of Intel Core microarchitecture, partial register access is handled in hardware by inserting a micro-op that merges the partial register with the full register in the following cases... (*) No, revolution, I don't buy that potential "s" in front. ![]() _________________ Faith is a superposition of knowledge and fallacy |
|||
![]() |
|
_shura 22 Oct 2017, 21:07
1. do not have windows, so I cannot check: whats about my second approach?
2. you did not check, whether the result is correct or not? 3. Performance-Tests on algorithms are worthless, when they are not testes separately. If you test it in the final application you have to many side effects. what you can do is to use the values you have to handle in the final application. But is this really the bottle-neck of the application? 4. setting ah to zero each run is not fair. when my code does not need it, it is faster than code, that does need it. If you cannot assume ah is null, you had to set it to zero and then it is part of the code. 5. what about random values and multiple runs? |
|||
![]() |
|
revolution 23 Oct 2017, 00:24
_shura wrote: 3. Performance-Tests on algorithms are worthless, when they are not testes separately. If you test it in the final application you have to many side effects. what you can do is to use the values you have to handle in the final application. But is this really the bottle-neck of the application? Also, testing in the final code allows one to properly focus upon the real points of performance problems. This can only realistically be done when everything is interacting together and you can measure where the CPU spends most of its time. Edit: It is easy to trick oneself by testing some small portion of code in isolation and determining that version A is faster than B. Then you insert A into the final code. Thinking to yourself that A is the best possible result. But later when testing the final code it could be that version B is better when all the other surrounding code is also active. So all that time spent testing each part in isolation was wasted. |
|||
![]() |
|
_shura 23 Oct 2017, 18:06
The questions was how o quickly recognize whether AL has a zero bit in each nibble or not. Like: Which Sorting-Algorithm is the fastest. It is worthless to answer this questions, when you test the whole game, where you have to wait for user-input, etc. If you want to optimise your software optimising one part of the software may a good idea, but as I asked: Is this really the bottle-neck of the application?
|
|||
![]() |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.