flat assembler
Message board for the users of flat assembler.

Index > Main > How to recognize whether AL has a zero bit in each tetrad?

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 21 Oct 2017, 23:03
Results I get on my i5-4200M (Haswell) ±2% :
Code:
l_inc.0:    4126985 usec
Furs.0:     12340780 usec
SergeASM.0: 8874583 usec
Furs.1:     9079281 usec
_shura.0:   7636351 usec    

Because some snippets assume ah == 0 I added xor ah,ah to every snippet for the purpose of measuring.

Code used for the measurements:
Code:
format PE GUI 4.0

include 'win32a.inc'
;include 'general.inc'

entry start

section '.text' readable executable

MyGetTickCount:
        invoke QueryPerformanceCounter,perfCntrVal
        xor edx,edx
        mov eax,dword[perfCntrVal]
        mov ecx,dword[perfCntrRes]
        div ecx
ret

macro Measure_ name*
{
    Measure_#name:
        push ebx
        push edi

        call MyGetTickCount
        push eax

        xor ebx,ebx
        xor ecx,ecx
        align $20
        .loop:
            mov al,bl

        ;display `name#":",13,10
        ;ilen_
}
macro _Measure
{
                inc ecx     ;count inputs with no F-nibble
            .f_nibble:
        ;_ilen

        inc ebx
        jnz .loop

        push ecx
        
        call MyGetTickCount
        mov edi,eax
        
        pop ecx

        pop eax
        sub eax,edi
        neg eax

        pop edi
        pop ebx
    ret
}

Measure_ l_inc.0
    xor ah,ah
    add al,$11
    lahf
    test ah,$11
    jnz .f_nibble
_Measure

Measure_ Furs.0
    xor ah,ah
    add eax, 15 shl 8 + 16 + 1
    and al, ah
    test al, 15
    jz .f_nibble
_Measure

Measure_ SergeASM.0
    xor ah,ah
    shl eax,4
    cmp al,$F0
    je .f_nibble
    cmp ah,$0F
    je .f_nibble
_Measure

Measure_ Furs.1
    xor ah,ah
    add al, 16 + 1
    jc .f_nibble
    test al, 15
    jz .f_nibble
_Measure

Measure_ _shura.0
    xor ah,ah
    not al
    test al, 0x0f
    jz .f_nibble
    test al, 0xf0
    jz .f_nibble
_Measure

proc start
        invoke QueryPerformanceFrequency,perfCntrRes
        test eax,eax
        jnz @F
                invoke MessageBox,NULL,msgNS,titleErr,MB_OK or MB_ICONERROR
                ret
        @@:

        xor edx,edx
        mov eax,dword[perfCntrRes]
        mov ecx,1000000                                 ;resolution in usec
        div ecx

        test eax,eax
        jnz @F
                invoke MessageBox,NULL,msgTL,titleErr,MB_OK or MB_ICONERROR
                ret
        @@:

        mov dword[perfCntrRes],eax

        call Measure__shura.0
        push eax
        push ecx

        call Measure_Furs.1
        push eax
        push ecx

        call Measure_SergeASM.0
        push eax
        push ecx

        call Measure_Furs.0
        push eax
        push ecx

        call Measure_l_inc.0
        push eax
        push ecx

        cinvoke wsprintf,strBuf,fmtStr
        invoke MessageBox,NULL,strBuf,titleTiming,MB_OK
    invoke ExitProcess,0
endp

section '.data' data readable writeable

        align 16
        perfCntrRes                             dq ?
        perfCntrVal                             dq ?
        strBuf                                  rb 800h
        titleTiming                             db 'Timing results',0
        titleErr                                db 'Error',0
        msgNS                                   db 'High resolution performance counter is not supported',0
        msgTL                                   db 'High resolution performance counter is of very low resolution',0
        fmtStr                                  db 'l_inc.0: %u (%u usec); Furs.0: %u (%u usec); SergeASM.0: %u (%u usec); Furs.1: %u (%u usec); _shura.0: %u (%u usec)',0

section '.idata' readable
data import
        library kernel32,'kernel32.dll',\
                        user32,'user32.dll'

        include 'api\kernel32.inc'
        include 'api\user32.inc'
end data    

_________________
Faith is a superposition of knowledge and fallacy
Post 21 Oct 2017, 23:03
View user's profile Send private message Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 21 Oct 2017, 23:48
Somewhat different results on P4:
Code:
l_inc.0:    19813978 usec
Furs.0:     12301931 usec
SergeASM.0: 12064554 usec
Furs.1:     11127560 usec
_shura.0:   11972406 usec    

_________________
Faith is a superposition of knowledge and fallacy
Post 21 Oct 2017, 23:48
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 19869
Location: In your JS exploiting you and your system
revolution 22 Oct 2017, 01:56
All I see are bunch of useless simulated tests. These will have almost no correlation to what happens in the final application.
Post 22 Oct 2017, 01:56
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 22 Oct 2017, 10:41
revolution
Would you give any quantification for "almost no correlation" or is this just a useless blind guess? Smile

_________________
Faith is a superposition of knowledge and fallacy
Post 22 Oct 2017, 10:41
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 19869
Location: In your JS exploiting you and your system
revolution 22 Oct 2017, 12:38
l_inc wrote:
revolution
Would you give any quantification for "almost no correlation" or is this just a useless blind guess? Smile
I'm sure you are aware of the difficulties of estimating runtime on a modern CPU.

Even if we ignore all of the other factors even just changing the CPU will change the results. And even if we keep the CPU constant then all the other factors related to the surrounding code can make a large difference in the results. There are also times where the same code run in different circumstances can show variances in timings. So any simulated tests will show us nothing of value. We need to run whatever tests in the final code.

As for this code, we have no idea of the run system it is used in, so we have no idea how it will perform in the final application. Maybe a lookup table will be best in one circumstance, and a computed result will be best in another, perhaps a branchless code path is best in some cases.

Maybe the real application has a limited number of input values and we don't even need to have code that works for any arbitrary value. Maybe the code only runs at startup and the timing difference is meaningless. Maybe the code only makes a difference of 0.01% and it would be best to put more time into some other bottleneck.

The code sequence is very short and deals with unknown data inputs, running on an unknown system, within an unknown application, in an unknown OS, so any correlation you might get could be at best a coincidence if you are lucky.
Post 22 Oct 2017, 12:38
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 22 Oct 2017, 13:19
revolution
Quote:
So any simulated tests will show us nothing of value.

The value of initial simulated tests is not to show how the final application would perform. The value is to provide a notion about what should be taken into account while writing the final application even before the final application is ready for testing. Large variation in simulated tests on different machines is already a valuable result.

I personally, of course, do not have sufficient information in this particular case to estimate whether the optimization attempt here is of any significant value let alone to simulate representative microarchitectural conditions, but I can give a push for applying educated testing.

_________________
Faith is a superposition of knowledge and fallacy
Post 22 Oct 2017, 13:19
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2340
Furs 22 Oct 2017, 15:38
Sorry but I agree with revolution. Your test uses a counter that increases sequentially as the value. This means that "two branches" don't have much penalty because for most values they'll always be the same 15 times in a row.

It's a simulated case where the predictor has a very easy job. That's the problem. Try with random sequence of numbers in al instead (cache them in memory so the overhead of random generation is probably not included or something like that but "touch" the memory first before the first test to make sure it's in cache).

BTW don't use 'xor ah, ah' if it's come to that, use 'movzx eax, al' instead, it probably is faster on AMD at least (I don't know about Intel if they still split the register with zero penalties).

Or even better, 'movzx eax, bl' since in real code it's likely you'll be able to do that if it comes from another register or memory operand, to make it more real.
Post 22 Oct 2017, 15:38
View user's profile Send private message Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 22 Oct 2017, 20:41
Furs
Quote:
This means that "two branches" don't have much penalty because for most values they'll always be the same 15 times in a row.

You are right. I should have thought about that. But that's not the revolution's point to agree with, as he (*) completely discourages any testing outside of the final environment.

Quote:
BTW don't use 'xor ah, ah' if it's come to that, use 'movzx eax, al' instead

My preference would be to leave it out completely for cases that do not depend on it. But you can make your own suggestion in form of an improved or another listing. My idea was to provide a basis and see whether and how it evolves.
movzx eax,al seems to establish a dependency chain for the test with lahf, which makes it significantly slower than with xor ah,ah (or without it). Smile

Quote:
I don't know about Intel if they still split the register with zero penalties

The penalty is supposed to be just an additional microop:
Optimization reference manual wrote:
Starting with Intel microarchitecture code name Sandy Bridge and all subsequent generations of Intel Core microarchitecture, partial register access is handled in hardware by inserting a micro-op that merges the partial register with the full register in the following cases...


(*) No, revolution, I don't buy that potential "s" in front. Razz

_________________
Faith is a superposition of knowledge and fallacy
Post 22 Oct 2017, 20:41
View user's profile Send private message Reply with quote
_shura



Joined: 22 May 2015
Posts: 61
_shura 22 Oct 2017, 21:07
1. do not have windows, so I cannot check: whats about my second approach?
2. you did not check, whether the result is correct or not?
3. Performance-Tests on algorithms are worthless, when they are not testes separately. If you test it in the final application you have to many side effects. what you can do is to use the values you have to handle in the final application. But is this really the bottle-neck of the application?
4. setting ah to zero each run is not fair. when my code does not need it, it is faster than code, that does need it. If you cannot assume ah is null, you had to set it to zero and then it is part of the code.
5. what about random values and multiple runs?
Post 22 Oct 2017, 21:07
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 19869
Location: In your JS exploiting you and your system
revolution 23 Oct 2017, 00:24
_shura wrote:
3. Performance-Tests on algorithms are worthless, when they are not testes separately. If you test it in the final application you have to many side effects. what you can do is to use the values you have to handle in the final application. But is this really the bottle-neck of the application?
The whole point of testing in the final code is to know if such side effects make a difference or not. Simulated tests become worthless when the side effects wash away all your work and render it meaningless making you start again from scratch.

Also, testing in the final code allows one to properly focus upon the real points of performance problems. This can only realistically be done when everything is interacting together and you can measure where the CPU spends most of its time.

Edit: It is easy to trick oneself by testing some small portion of code in isolation and determining that version A is faster than B. Then you insert A into the final code. Thinking to yourself that A is the best possible result. But later when testing the final code it could be that version B is better when all the other surrounding code is also active. So all that time spent testing each part in isolation was wasted.
Post 23 Oct 2017, 00:24
View user's profile Send private message Visit poster's website Reply with quote
_shura



Joined: 22 May 2015
Posts: 61
_shura 23 Oct 2017, 18:06
The questions was how o quickly recognize whether AL has a zero bit in each nibble or not. Like: Which Sorting-Algorithm is the fastest. It is worthless to answer this questions, when you test the whole game, where you have to wait for user-input, etc. If you want to optimise your software optimising one part of the software may a good idea, but as I asked: Is this really the bottle-neck of the application?
Post 23 Oct 2017, 18:06
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.