flat assembler
Message board for the users of flat assembler.

Index > Main > A question on performance

Goto page 1, 2, 3  Next
Author
Thread Post new topic Reply to topic
system error



Joined: 01 Sep 2013
Posts: 670
system error 25 Aug 2016, 23:28
Ok, i tested this code

this clocks around 0.05xxx sec
Code:
mov rcx,10000000  ;10 million
spin:
     mulsd xmm0,xmm1
loop spin    


But when I add one more line, it clocks faster, around 0.04xxx sec
Code:
mov rcx,10000000  ;10 million
spin:
     mulsd xmm0,xmm1
     cvtsd2si rax,xmm0
loop spin    


What's really happening here? AFAIK cvtsd2si is a big instruction and should be slower than the first version, once added. I am thinking of taking up a new hobby (cooking, painting, ballet classes, piano, mountain climbing) if what I been reading about performance this long was proven wrong altogether.

I am on Atom.
Post 25 Aug 2016, 23:28
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 25 Aug 2016, 23:43
No, aligning the spin won't help the first version either. You just have to add one more instruction before or after mulsd to make it faster. A simple nop should do, actually. What I don't understand is, WHY?
Post 25 Aug 2016, 23:43
View user's profile Send private message Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 26 Aug 2016, 02:14
(N.B. I know effectively nothing about optimization. It's still interesting but indeed a minefield. Sorry, just giving some general tips. Caveat emptor!)

system error wrote:
I am on Atom.

This tells us nothing. There are too many Atoms. (BTW, I have none.) Twisted Evil

Wikipedia mentions "Silverthorne, Diamondville, Pineview, Lincroft, Centerton, Avoton, Silvermont".

While I don't see anything obvious from a quick check of newer GCC docs, IIRC, -march/-mtune "atom" was introduced with GCC 4.5.x (2010). I assume "native" would also work (if self-hosted).

Quote:

this clocks around 0.05xxx sec
...
But when I add one more line, it clocks faster, around 0.04xxx sec


It's very inaccurate to test (or trust) such short-run benchmarks. Anything less than a few seconds is unreliable, IMHO.

Power management can also adversely affect speed.

Quote:

What's really happening here? AFAIK cvtsd2si is a big instruction and should be slower than the first version, once added.


fetch, decode, execute ... clocks, hertz ... latency, thoroughput ... pipelining, pairing, cpu cache(s) size, prefetching, alignment ... multi-core, power management, threads ... FPU + SSE ... removing dead code, unrolling loops, strength reduction ... register spilling, stalls ... static or dynamic branch prediction ... micro ops, macro-op fusion ... WHO THE HECK KNOWS??

(Although I've never bothered, you may need to use Linux "perf stat" or "oprofile" or something commercial like Intel vTune or AMD CodeAnalyst or whatever. I doubt RDTSCP is enough help by itself.)

Quote:

I am thinking of taking up a new hobby (cooking, painting, ballet classes, piano, mountain climbing) if what I been reading about performance this long was proven wrong altogether.


The only guaranteed optimizations are for size! Smaller is always smaller, no matter which family / model / stepping!

Anyways, it's all a very complex subject. Read Agner Fog's manuals.

To be honest, you should first make sure your program runs correctly (including verifying by test suite), then improve the overall (high-level) algorithm, reduce unnecessary slow disk access, and then worry about micro optimizations at the cpu level. (Often times speed isn't important anyways, it's only executed occasionally. There are other concerns like readability or portability.)

As much as I really hate to say it, but sometimes hand-written assembly is slower than HLLs. GCC is fairly complicated and smart by now.

(For instance, I still haven't proven it, but I think one silly program of mine is tons slower because it's not disk-buffered behind the scenes automatically. So the C version is always tons faster, even with the oldest and dumbest compilers. But mine is smaller, of course.)

Quote:
No, aligning the spin won't help the first version either. You just have to add one more instruction before or after mulsd to make it faster. A simple nop should do, actually. What I don't understand is, WHY?


AFAIK, "pause" is an alias for "rep nop", meant for spin loops. Though I don't know if that will help much here.
Post 26 Aug 2016, 02:14
View user's profile Send private message Visit poster's website Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 26 Aug 2016, 04:14
Thanks for the great answers rugx, although I don't understand half of what you saying because I a complete noob on optimization.

What puzzles me is that the second snippet runs a lot faster than the first despite the fact that;

1. there's an obvious dependency on XMM0
2. cvtsd2si is a 23-clocker

So just by looking at the code, most people would believe that the first one runs faster than the other (who can blame them?), while it actually the opposite.

It breaks my heart to know that at some point, all else equal, adding more instructions can actually make your code runs faster, contrary to my previous belief. Nobody told me this before -_- Even the honorable Mr Agner Fog never mention this or never really put enough emphasis on this type of optimization technique. I guess he's getting balder and older to keep up with the latest offerings. hehehe
Post 26 Aug 2016, 04:14
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20757
Location: In your JS exploiting you and your system
revolution 26 Aug 2016, 07:46
Welcome the the world of optimisation. Optimisation is hard. Really hard. And sometimes counter-intuitive.

But your test above is not representative of real world usage. You can't substitute a contrived benchmark and expect it to relate to actual use in a real app.

My suggestion is to not bother with specialised "performance" loops and similar things, they will tell you nothing useful when you want to apply it to the final product.

Also, testing on only one system tells you nothing about performance on other systems.

I guess all of the above comes across as negative, but really, this type of testing above gives no valuable results in almost all situations.

Your statement above "adding more instructions can actually make your code runs faster" is an example of the result of the internal complexity of contemporary CPUs. But it is not always true of course, and it will only be true in certain specific situations. But once again, contriving a specific example is not too hard, but it won't necessarily still be true when code it transplanted into another situation. So in conclusion: Always do your testing on the real world application.
Post 26 Aug 2016, 07:46
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 26 Aug 2016, 19:36
system error wrote:

What puzzles me is that the second snippet runs a lot faster than the first despite the fact that;

1. there's an obvious dependency on XMM0


Who knows? Speculative execution? Register renaming?

Again, pipelining roughly means that it can start one instruction and finish in the background while also starting newer ones.

Or maybe "loop" is flushing the instruction queue prematurely?

Quote:

2. cvtsd2si is a 23-clocker


Which is still blazingly fast on multi-ghz cpus.

Quote:

So just by looking at the code, most people would believe that the first one runs faster than the other (who can blame them?), while it actually the opposite.


Supposedly the 8086 was more efficient for small and tight code, same as even 386. But for the 486 (pipelining!), it was faster to use simpler RISC-y instructions (mov [di],al // inc di) instead of CISC instructions (stosb). Even the Pentium / 586 (or, should I say 587) was pipelined, allegedly working faster if you interleave FMULs and FADDs (or whatever) to allow them to cooperate, not to mention the U (full) and V (weak) pipes, which was a big deal for compilers back then (e.g. GCC 2.8.x), requiring a recompile to really give significant speedups. 486 itself was allegedly very sensitive to alignment.

(I can also guarantee you that ENTER and LEAVE are much much slower on this [Westmere] Core i5 than simple 8086 instructions. Supposedly they were faster on an actual 186 [clone?], but not anymore.)

Quote:

It breaks my heart to know that at some point, all else equal, adding more instructions can actually make your code runs faster, contrary to my previous belief.


Modern cpus are very very sophisticated. They try insanely hard to figure things out on their own.

I found an old (1999) Dr. Dobbs article the other day on MMX. Here's just a small excerpt to prove my point:

Max I. Fomitchev wrote:

Both Intel's Pentium II and AMD's K6-2 are sophisticated CPUs with complex internal structures. Both CPU families employ superscalar pipelining, dynamic execution, and branch prediction -- and both can execute up to 6 m-operations per cycle.

Of course, there is a difference in the internal architecture. The Pentium II, for instance, has three instruction decoders and the K6-2 has two.


See? Even back then it was complicated. Nowadays it's even MORE complicated! Ugh. Confused

Quote:

Nobody told me this before -_- Even the honorable Mr Agner Fog never mention this or never really put enough emphasis on this type of optimization technique. I guess he's getting balder and older to keep up with the latest offerings. hehehe


There's a lot of reasons. It just takes further study. Don't stress too hard, most things aren't so extremely timing sensitive.

Treat it as a hobby, learn as you go, and have fun. Just explore and investigate.

At worst, like I said, you learn new instructions or better ways to optimize for size, which (unless it horribly slows down everything, but that's rare) is always good in a pinch.
Post 26 Aug 2016, 19:36
View user's profile Send private message Visit poster's website Reply with quote
cod3b453



Joined: 25 Aug 2004
Posts: 618
cod3b453 26 Aug 2016, 22:06
It's very implementation specific, but most recent CPUs have pipelines with depth around 20~30 stages and, depending on the implementation, any branch could incur a flush delay in the order of as many cycles. Hence, you can add an instruction with similar delay without incurring negative effects in the overall time.

In this particular case, the initial tight loop on the same instruction could be hiding further delay due to higher-than-usual instruction fetches, causing additional stalls, and saturation of the bus, which is likely to be wide enough for several instructions and so adding more instructions has no negative effect (possibly even a positive effect).
Post 26 Aug 2016, 22:06
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 10 Sep 2016, 06:06
Thanks for the valuable input, guys. Now here's another test I run on standard PUSH-POP operations. And this one not even mentioned anywhere (agner fog, textbooks etc)

Code:
mov rcx,10000000
spin:

;Standard RTFM version
;clocks 0.029xx - 0.039xx, average 0.033xx
;stalled cycles backends
push rax
push rbx
push rcx
push rdx
pop rdx
pop rcx
pop rbx
pop rax

;Not-mentioned-in-any-AgnerFog's version
;clocks 0.023xx - 0.033xx, average 0.027xx
;No stalled cycles on both ends
;sub rsp,8*4
;mov [rsp+24],rdx
;mov [rsp+16],rcx
;mov [rsp+8],rbx
;mov [rsp],rax
;mov [rsp+24],rdx
;mov [rsp+16],rcx
;mov [rsp+8],rbx
;mov [rsp],rax
;add rsp,8*4

sub rcx,1
jnz spin    


I am starting to believe that the "read agner fog's" song is getting pretty obsolete, at least down at the very basic level.
Post 10 Sep 2016, 06:06
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 10 Sep 2016, 06:10
Oh btw, do avoid LOOP instruction like a plague.
Post 10 Sep 2016, 06:10
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20757
Location: In your JS exploiting you and your system
revolution 10 Sep 2016, 10:25
system error wrote:
Oh btw, do avoid LOOP instruction like a plague.
Why?
Post 10 Sep 2016, 10:25
View user's profile Send private message Visit poster's website Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1738
Location: Toronto, Canada
AsmGuru62 10 Sep 2016, 13:21
Why, indeed.
I use LOOP all the time and never seen a slowdown.
Post 10 Sep 2016, 13:21
View user's profile Send private message Send e-mail Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1228
Location: Belarus
DimonSoft 10 Sep 2016, 14:25
revolution wrote:
Why?

AsmGuru62 wrote:
Why, indeed.
I use LOOP all the time and never seen a slowdown.

Might be something about the fact that complex CISC-style instructions generally tend to have bad impact on modern processors’ pipelining than simpler RISC-style instructions.

On the other hand smaller code might be better in terms of caches and processors’ internal instruction prefetch buffers. This plus the fact that most pieces of code are not bottlenecks, and in most cases we don’t have any visible difference.
Post 10 Sep 2016, 14:25
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20757
Location: In your JS exploiting you and your system
revolution 10 Sep 2016, 14:43
Indeed, it is situation dependant. I don't believe any rules-of-thumb will be correct is every situation.
Post 10 Sep 2016, 14:43
View user's profile Send private message Visit poster's website Reply with quote
fragment



Joined: 11 Jan 2017
Posts: 3
Location: Berlin
fragment 12 Jan 2017, 10:12
good morning, folx ...
yep, I am new!

Revolution/AsmGuru62
'loop' is very handy but it seems to be unfortunately significant slow on many many platforms: http://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently
I am writing at this very moment on a typical/cheap and relatively recent mass market laptop: Lenovo/Silvermont/N2830 with a similar experience: the 'loop' instruction on this computer is ~100% slower than some other instruction alternatives. My performance test code is quasi-compatible with the one from Intel which uses in difference to Agner Fog also the newer RDTSCP/CPUID combination. But anyway, it doesn't really matter in this case. Even if I just started/stopped manually testLoops by key watching the kitchen clock: the general pattern were obvious. However, on my other computer (that's an I7/i860/Lynnfield/Nehalem) the 'loop'-instruction behaves simliar to the other ones - so no huge problem here ...

My conclusion so far is, that if you want write more optimized code then you should avoid using 'loop'. It seems to be never the fastest (at best not slower) on the relevant computers in use but on many definitely slower ... and I might have overseen it, but if I recall it correctly then neither Intel ("64-ia-32-architectures-optimization-manual") nor Agner Fog in his opti-guides are ever using 'loop'. People like Mark Larson advice explicitly not to use it: http://www.mark.masmcode.com/ (please look for: "3. complex instructions")
Post 12 Jan 2017, 10:12
View user's profile Send private message Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1738
Location: Toronto, Canada
AsmGuru62 12 Jan 2017, 13:47
Welcome to the forum!
When we talk about the performance - we should always consider the need for
optimization. I will give you two examples:

#1:
Lets say I am displaying a dialog box with a list inside filled with some items.
Depending on the amount of items - I will use LOOP or I will use the more performant
option:
Code:
@@:
        ...
        sub     ecx, 1
        jnz     @r
    


#2:
You are writing a text editor and you need to open 100's of files in one shot.
Those files must be all parsed line-by-line for some features, etc.
In this case I will definitely be optimizing right away, without even measuring the code.
I will align all labels and will not use any LOOP's, because amount of processing is large.

In cases where you suspect that LOOP is slowing you down - you need to measure how
much time your code takes and make a decision based on that, just like revolution pointed out.

Edited:
I must add also that your figure of ~100% is probably not correct.
I measured once my code where I used LOOP vs SUB/JNZ and I came up with ~15% slowdown.
As I mentioned a case of a dialog - this is the time taken for a dialog to be filled with
items and shown to user. Human perception will fail to see a difference in case of
small amount of items.
Post 12 Jan 2017, 13:47
View user's profile Send private message Send e-mail Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 12 Jan 2017, 14:46
Here's my conclusions on performance. Not much but not wrong either.

If you're considering to optimize your code, treat all complex, CISC-style instructions with extreme prejudice and replace them with their plain counterparts.

Instructions like LOOP, STOS, PUSH, POP, and even CALL and RET are generally slower on modern CPUs. Wait, did I mention CALL / RET? Very Happy

Reasons;

- These high instructions are just a wrapper instructions to their plain siblings when looong time ago memory space was very little and scarce. So intel decided to come up with shorter instructions mnemonics to save up space at the expense of speed.

- These high instructions share the same circuitry with their plain RISC-style siblings. Well, unless AMD/Intel dedicate special circuitry to these instructions, there's no reason to favor these instructions over the others.

- Complex instructions commit more MICROCODE time if compared to their plain RISC siblings. Here's the path taken by the RET instruction. So while people talk about instructio cache, they tend to forget about MICROCODE.

Faster RISC-style instruction for RET
Code:
add esp,4
jmp dword[esp-4]    



Slower CISC-based instruction RET (microcode)
Code:
switch(Instruction) {
        case NearReturn:
                if(OperandSize == 32 && !IsWithinStackLimits(TopStackBytes(12))) Exception(SS(0)); //top 12 bytes of stack not within stack limits
                //OperandSize == 16
                else if(!IsWithinStackLimits(TopStackBytes(6)) Exception(SS(0)); //IF top 6 bytes of stack not within stack limits
                TemporaryEIP = Pop();
                TemporaryEIP = TemporaryEIP & 0xFFFF;
                if(!IsWithinCodeSegmentLimits(TemporaryEIP)) Exception(GP(0));
                EIP = TemporaryEIP;
                if(HasImmediateOperand()) { //instruction has immediate operand
                        if(StackAddressSize == 32) ESP = ESP + Source; //release parameters from stack
                        //StackAddressSize == 16
                        else SP = SP + Source; //release parameters from stack
                }
                break;
        case FarReturn:
                //Real-address mode or virtual-8086 mode
                if(PE == 0 || (PE == 1 && VM == 1)) {
                        if(OperandSize == 32) {
                                if(!IsWithinStackLimits(TopStackBytes(12)) Exception(SS(0)); //top 12 bytes of stack not within stack limits
                                EIP = Pop();
                                CS = Pop(); //32-bit pop, high-order 16 bits discarded
                        }
                        else { //OperandSize == 16
                                if(!IsWithinStackLimits(TopStackBytes(6)) Exception(SS(0)); //top 6 bytes of stack not within stack limits
                                TemporaryEIP = Pop();
                                TemporaryEIP = TemporaryEIP & 0xFFFF;
                                if(!IsWithinCodeSegmentLimits(TemporaryEIP)) Exception(GP(0));
                                EIP = TemporaryEIP;
                                CS = Pop(); //16-bit pop
                        }
                        if(HasImmediateOperand()) else SP = SP + Source; //instruction has immediate operand; release parameters from stack
                }
                //Protected mode, not virtual-8086 mode
                else if(PE == 1 && VM == 0) {
                        if(OperandSize == 32 && !IsWithinStackLimits(OffsetStackBytes(4, 4)) Exception(SS(0)); //second doubleword on stack is not within stack limits
                        //OperandSize == 16
                        else if(!IsWithinStackLimits(OffsetStackBytes(2, 2))) Exception(SS(0)); //second word on stack is not within stack limits
                        if(ReturnCode.SegmentSelector == 0) Exception(GP(Selector));
                        if(!IsWithinDescriptorTableLimits(ReturnCode.SegmentSelector)) Exception(GP(Selector));
                        ReturnCode.SegmentDescriptor = ObtainSegmentDescriptor(); //Obtain descriptor to which return code segment selector points from descriptor table
                        if(!IsCodeSegment(ReturnCode.SegmentDescriptor)) Exception(GP(Selector));
                        if(ReturnCode.SegmentSelector.RPL < CPL) Exception(GP(Selector));
                        if(IsConforming(ReturnCode.SegmentDescriptor && ReturnCode.Segment.DPL > ReturnCode.SegmentSelector.RPL) Exception(GP(Selector));
                        if(!IsPresent(ReturnCode.SegmentDescriptor)) Exception(NP(Selector));
                        if(ReturnCode.SegmentSelector.RPL > CPL) {
                                //Return outer privilege level
                                if(OperandSize == 32 && !IsWithinStackLimits(TopStackBytes(16 + Source)) Exception(SS(0)); //top 16 + Source bytes of stack not within stack limits
                                //OperandSize == 16
                                else if(!IsWithinStackLimits(TopStackBytes(8 + Source)) Exception(SS(0)); //top 8 + Source bytes of stack not within stack limits
                                ReturnSegmentSelector = ReadReturnSegmentSelector();
                                if(StackSegmentSelector == 0) Exception(GP(0));
                                if(!IsWithinDescriptorTableLimits(ReturnStack.SegmentSelector.Index)) Exception(GP(Selector));
                                if(StackSegmentSelector.RPL != ReturnCode.SegmentSelector.RPL || !IsWritableDataSegment(StackSegment) || StackSegmentDescriptor.DPL != ReturnCode.SegmentSelector.RPL) Exception(GP(Selector));
                                if(!IsPresent(StackSegment)) Exception(SS(StackSegmentSelector));
                                if(!IsWithinLimits(ReturnCode.SegmentLimit, ReturnInstructionPointer)) Exception(GP(0));
                                CPL = ReturnCode.SegmentSelector.RPL;
                                if(OperandSize == 32) {
                                        EIP = Pop();
                                        CS = Pop(); //32-bit pop, high-order 16 bits discarded; segment descriptor information also loaded
                                        CS.RPL = CPL;
                                        ESP = ESP + Source; //release parameters from called procedure's stack
                                        TemporaryESP = Pop();
                                        TemporarySS = Pop(); //32-bit pop, high-order 16 bits discarded; segment descriptor information also loaded
                                        ESP = TemporaryESP;
                                        SS = TemporarySS;
                                }
                                //OperandSize == 16
                                else {
                                        EIP = Pop();
                                        EIP = EIP & 0xFFFF;
                                        CS = Pop(); //16-bit pop; segment descriptor information also loaded
                                        CS.RPL = CPL;
                                        ESP = ESP + Source; //release parameters from called procedure's stack
                                        TemporaryESP = Pop();
                                        TemporarySS = Pop(); //16-bit pop; segment descriptor information also loaded
                                        ESP = TemporaryESP;
                                        SS = TemporarySS;
                                }
                                SegmentRegisters[] = {ES, FS, GS, DS};
                                while(SegmentRegister = NextSegmentRegister(SegmentRegisters)) {
                                        if(IsDataPointer(SegmentRegister)  || !IsConformingCodeSegment(SegmentRegister) && CPL > SegmentDescriptor.DPL /*DPL in hidden part of segment register*/) SegmentSelector = 0; //segment register is invalid, null segment selector
                                        if(!IsWithinDescriptorTableLimits(SegmentSelector.Index) || (!IsData(SegmentDescriptor) && !IsReadableCodeSegment(SegmentDescriptor)) || (IsData(SegmentDescriptor) && !IsConformingCodeSegment(SegmentDescriptor) && SegmentDescriptor.DPL < CPL && SegmentDescriptor.DPL < CodeSegment.SegmentSelector.RPL)) SegmentSelectorRegister = NullSelector;
                                        ESP = ESP + Source; //release parameters from called procedure's stack
                                }
                        }
                        else {
                                //Return to same privilege level
                                if(!IsWithinLimits(ReturnCode.SegmentLimit, ReturnInstructionPointer)) Exception(GP(0));
                                if(OperandSize == 32) {
                                        EIP = Pop();
                                        CS = Pop(); //32-bit pop, high-order 16 bits are discarded
                                        ESP = ESP + Source; //Release parameters from stack
                                }
                                else { //OperandSize == 16
                                        EIP = Pop();
                                        EIP = EIP & 0xFFFF;
                                        ESP = ESP + Source; //Release parameters from stack
                                }
                        }
                }
                break;
}    


So before you worry about cache and pipeline, take a good look at the microcode implementation of such instruction first.

Let me hear your opinion.
Post 12 Jan 2017, 14:46
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 12 Jan 2017, 14:52
This is what I love about this board. While people busy talking about cache and pipes, we go even deeper to microcode programming. Why don't you people just die already?
Post 12 Jan 2017, 14:52
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 12 Jan 2017, 14:54
fragment wrote:
good morning, folx ...
yep, I am new!

Revolution/AsmGuru62
'loop' is very handy but it seems to be unfortunately significant slow on many many platforms: http://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently
I am writing at this very moment on a typical/cheap and relatively recent mass market laptop: Lenovo/Silvermont/N2830 with a similar experience: the 'loop' instruction on this computer is ~100% slower than some other instruction alternatives. My performance test code is quasi-compatible with the one from Intel which uses in difference to Agner Fog also the newer RDTSCP/CPUID combination. But anyway, it doesn't really matter in this case. Even if I just started/stopped manually testLoops by key watching the kitchen clock: the general pattern were obvious. However, on my other computer (that's an I7/i860/Lynnfield/Nehalem) the 'loop'-instruction behaves simliar to the other ones - so no huge problem here ...

My conclusion so far is, that if you want write more optimized code then you should avoid using 'loop'. It seems to be never the fastest (at best not slower) on the relevant computers in use but on many definitely slower ... and I might have overseen it, but if I recall it correctly then neither Intel ("64-ia-32-architectures-optimization-manual") nor Agner Fog in his opti-guides are ever using 'loop'. People like Mark Larson advice explicitly not to use it: http://www.mark.masmcode.com/ (please look for: "3. complex instructions")


It's time for the good guys like us to enlighten our sinful bretheren like AsmGuru and revolution.
Post 12 Jan 2017, 14:54
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 12 Jan 2017, 15:00
Oh btw, the true philosophy behind Microsoft fastcall convention and ABI is not to give you brain tumor, but simply to avoid those slower complex instructions like PUSH and POP.

Take it or leave it.
Post 12 Jan 2017, 15:00
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20757
Location: In your JS exploiting you and your system
revolution 12 Jan 2017, 15:00
system error wrote:
Faster RISC-style instruction for RET
Code:
add esp,4
jmp dword[esp-4]    
Have you tested this? You will mess up the special call/ret stack in the CPU and might make things worse!
Quote:
Slower CISC-based instruction RET (microcode)
Code:
switch(Instruction) {
        case NearReturn:
                if(OperandSize == 32 && !IsWithinStackLimits(TopStackBytes(12))) Exception(SS(0)); //top 12 bytes of stack not within stack limits
                //OperandSize == 16
                else if(!IsWithinStackLimits(TopStackBytes(6)) Exception(SS(0)); //IF top 6 bytes of stack not within stack limits
                TemporaryEIP = Pop();
                TemporaryEIP = TemporaryEIP & 0xFFFF;
                if(!IsWithinCodeSegmentLimits(TemporaryEIP)) Exception(GP(0));
                EIP = TemporaryEIP;
                if(HasImmediateOperand()) { //instruction has immediate operand
                        if(StackAddressSize == 32) ESP = ESP + Source; //release parameters from stack
                        //StackAddressSize == 16
                        else SP = SP + Source; //release parameters from stack
                }
                break;
        case FarReturn:
                //Real-address mode or virtual-8086 mode
                if(PE == 0 || (PE == 1 && VM == 1)) {
                        if(OperandSize == 32) {
                                if(!IsWithinStackLimits(TopStackBytes(12)) Exception(SS(0)); //top 12 bytes of stack not within stack limits
                                EIP = Pop();
                                CS = Pop(); //32-bit pop, high-order 16 bits discarded
                        }
                        else { //OperandSize == 16
                                if(!IsWithinStackLimits(TopStackBytes(6)) Exception(SS(0)); //top 6 bytes of stack not within stack limits
                                TemporaryEIP = Pop();
                                TemporaryEIP = TemporaryEIP & 0xFFFF;
                                if(!IsWithinCodeSegmentLimits(TemporaryEIP)) Exception(GP(0));
                                EIP = TemporaryEIP;
                                CS = Pop(); //16-bit pop
                        }
                        if(HasImmediateOperand()) else SP = SP + Source; //instruction has immediate operand; release parameters from stack
                }
                //Protected mode, not virtual-8086 mode
                else if(PE == 1 && VM == 0) {
                        if(OperandSize == 32 && !IsWithinStackLimits(OffsetStackBytes(4, 4)) Exception(SS(0)); //second doubleword on stack is not within stack limits
                        //OperandSize == 16
                        else if(!IsWithinStackLimits(OffsetStackBytes(2, 2))) Exception(SS(0)); //second word on stack is not within stack limits
                        if(ReturnCode.SegmentSelector == 0) Exception(GP(Selector));
                        if(!IsWithinDescriptorTableLimits(ReturnCode.SegmentSelector)) Exception(GP(Selector));
                        ReturnCode.SegmentDescriptor = ObtainSegmentDescriptor(); //Obtain descriptor to which return code segment selector points from descriptor table
                        if(!IsCodeSegment(ReturnCode.SegmentDescriptor)) Exception(GP(Selector));
                        if(ReturnCode.SegmentSelector.RPL < CPL) Exception(GP(Selector));
                        if(IsConforming(ReturnCode.SegmentDescriptor && ReturnCode.Segment.DPL > ReturnCode.SegmentSelector.RPL) Exception(GP(Selector));
                        if(!IsPresent(ReturnCode.SegmentDescriptor)) Exception(NP(Selector));
                        if(ReturnCode.SegmentSelector.RPL > CPL) {
                                //Return outer privilege level
                                if(OperandSize == 32 && !IsWithinStackLimits(TopStackBytes(16 + Source)) Exception(SS(0)); //top 16 + Source bytes of stack not within stack limits
                                //OperandSize == 16
                                else if(!IsWithinStackLimits(TopStackBytes(8 + Source)) Exception(SS(0)); //top 8 + Source bytes of stack not within stack limits
                                ReturnSegmentSelector = ReadReturnSegmentSelector();
                                if(StackSegmentSelector == 0) Exception(GP(0));
                                if(!IsWithinDescriptorTableLimits(ReturnStack.SegmentSelector.Index)) Exception(GP(Selector));
                                if(StackSegmentSelector.RPL != ReturnCode.SegmentSelector.RPL || !IsWritableDataSegment(StackSegment) || StackSegmentDescriptor.DPL != ReturnCode.SegmentSelector.RPL) Exception(GP(Selector));
                                if(!IsPresent(StackSegment)) Exception(SS(StackSegmentSelector));
                                if(!IsWithinLimits(ReturnCode.SegmentLimit, ReturnInstructionPointer)) Exception(GP(0));
                                CPL = ReturnCode.SegmentSelector.RPL;
                                if(OperandSize == 32) {
                                        EIP = Pop();
                                        CS = Pop(); //32-bit pop, high-order 16 bits discarded; segment descriptor information also loaded
                                        CS.RPL = CPL;
                                        ESP = ESP + Source; //release parameters from called procedure's stack
                                        TemporaryESP = Pop();
                                        TemporarySS = Pop(); //32-bit pop, high-order 16 bits discarded; segment descriptor information also loaded
                                        ESP = TemporaryESP;
                                        SS = TemporarySS;
                                }
                                //OperandSize == 16
                                else {
                                        EIP = Pop();
                                        EIP = EIP & 0xFFFF;
                                        CS = Pop(); //16-bit pop; segment descriptor information also loaded
                                        CS.RPL = CPL;
                                        ESP = ESP + Source; //release parameters from called procedure's stack
                                        TemporaryESP = Pop();
                                        TemporarySS = Pop(); //16-bit pop; segment descriptor information also loaded
                                        ESP = TemporaryESP;
                                        SS = TemporarySS;
                                }
                                SegmentRegisters[] = {ES, FS, GS, DS};
                                while(SegmentRegister = NextSegmentRegister(SegmentRegisters)) {
                                        if(IsDataPointer(SegmentRegister)  || !IsConformingCodeSegment(SegmentRegister) && CPL > SegmentDescriptor.DPL /*DPL in hidden part of segment register*/) SegmentSelector = 0; //segment register is invalid, null segment selector
                                        if(!IsWithinDescriptorTableLimits(SegmentSelector.Index) || (!IsData(SegmentDescriptor) && !IsReadableCodeSegment(SegmentDescriptor)) || (IsData(SegmentDescriptor) && !IsConformingCodeSegment(SegmentDescriptor) && SegmentDescriptor.DPL < CPL && SegmentDescriptor.DPL < CodeSegment.SegmentSelector.RPL)) SegmentSelectorRegister = NullSelector;
                                        ESP = ESP + Source; //release parameters from called procedure's stack
                                }
                        }
                        else {
                                //Return to same privilege level
                                if(!IsWithinLimits(ReturnCode.SegmentLimit, ReturnInstructionPointer)) Exception(GP(0));
                                if(OperandSize == 32) {
                                        EIP = Pop();
                                        CS = Pop(); //32-bit pop, high-order 16 bits are discarded
                                        ESP = ESP + Source; //Release parameters from stack
                                }
                                else { //OperandSize == 16
                                        EIP = Pop();
                                        EIP = EIP & 0xFFFF;
                                        ESP = ESP + Source; //Release parameters from stack
                                }
                        }
                }
                break;
}    
That isn't the microcode, it is the logical description in the manual. At the microcode level it will be very different.

Anyhow, as mentioned above test the code to make sure you get what you expect. There is a lot of old, and just plain wrong, advice on the Internet. Don't blindly trust it. CPUs are constantly changing in their internal designs so you never know what is now better and what is now worse.
Post 12 Jan 2017, 15:00
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.