flat assembler
Message board for the users of flat assembler.

Index > Main > How to recognize whether AL has a zero bit in each tetrad?

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
SergeASM



Joined: 13 Nov 2015
Posts: 21
SergeASM 15 Oct 2017, 08:29
I want to quickly recognize whether AL has a zero bit in each tetrad. I.e.

AL=11110001 not
AL=00000000 yes
AL=01111111 not
AL=00111011 yes
...

Is there a method faster than xlat?

TIA,
Serge
Post 15 Oct 2017, 08:29
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 15 Oct 2017, 10:59
SergeASM wrote:
... faster ...
Every machine will be different. Yes, really. You will have to test a range of possible solutions on the target machine to determine which is "fastest" for that machine.

Also, it is probably worthless to test only sample code. You should try to test the entire code and see which option(s) work best when the whole application is running.
Post 15 Oct 2017, 10:59
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2559
Furs 15 Oct 2017, 12:32
I had to Google what "tetrad" means (maybe I suck Sad) but this is usually called a "nibble" (4-bits).

The "pattern" here is that either the number mod 16 is 15, or it is higher than 239, it's the only time it will give out the answer "no". I'm not sure if you can use "ah" (i.e. rest of eax) as you didn't mention it, but if you can, let's assume ah is zero, try something like:
Code:
add eax, 16+1
dec ah
and al, ah
test al, 15
jz no    
The first one produces a carry into ah if the number was 239 or greater, which means "no" (since 239 mod 16 is 15, and >= 240 is the other condition for "no"). The addition places the carry into ah. Then we decrease ah, resulting in all 1s if there was no carry, and zero otherwise. We then "mask" the resulting al with this: if carry, set al to zero.

Then we test for low 15 bits to see if they are zeros -- because they are zero only if they were 15 (we added '1' to them when we did first addition, that's why "16+1") or if the carry was produced (it zeroed al). Meaning they should be zero if number was >= 240 or if low 4 bits were all 1s.

I only wanted to use one branch for obvious reasons. I have no idea if it's faster than a lookup table though, but you can't test if there is no alternative Smile
Post 15 Oct 2017, 12:32
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 15 Oct 2017, 12:39
Furs: That is a great analysis. But is it faster?
Post 15 Oct 2017, 12:39
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2559
Furs 15 Oct 2017, 12:40
I don't know, it probably is in some cases (it depends on surrounding code, since otherwise if branch gets predicted properly all the time it doesn't matter which one you use), probably not in others etc.

But still you need "different" ways to be able to test them, can't test something when you don't even know alternative Razz
Post 15 Oct 2017, 12:40
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8358
Location: Kraków, Poland
Tomasz Grysztar 15 Oct 2017, 12:41
Furst: I think your first two instructions may be contracted to:
Code:
add eax,(-1) shl 8 + 16 + 1    
Post 15 Oct 2017, 12:41
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2559
Furs 15 Oct 2017, 12:48
I was worried about the carry going into the al but it seems you are right! Wink I just tested it, so now it's:
Code:
add eax,(-1) shl 8 + 16 + 1
and al, ah
test al, 15
jz no    
I think now it's definitely "faster" or at least the "same" speed. Also much more friendly to the cache (no need for a lookup table using 256 bytes).
Post 15 Oct 2017, 12:48
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8358
Location: Kraków, Poland
Tomasz Grysztar 15 Oct 2017, 12:51
Furs wrote:
I was worried about the carry going into the al
Well, we could just use 0FF00h instead of "(-1) shl 8" and then forget about carry (it would land in the high part of EAX that does not concern us).
Post 15 Oct 2017, 12:51
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 15 Oct 2017, 13:35
Furs wrote:
I think now it's definitely "faster" or at least the "same" speed. Also much more friendly to the cache (no need for a lookup table using 256 bytes).
It probably uses more bytes in the instruction cache, but fewer bytes in the data cache. Maybe it is "faster", maybe not. We would need the OP to test it.
Post 15 Oct 2017, 13:35
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8358
Location: Kraków, Poland
Tomasz Grysztar 15 Oct 2017, 13:49
Because this is just a yes/no test, you could even try BT instead of XLAT. Lookup table would then be 32 bytes instead of 256. Wink
Post 15 Oct 2017, 13:49
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2559
Furs 15 Oct 2017, 14:01
Yeah but AFAIK bt with "mem, reg" operands is super slow. Sad (only that combination)
Post 15 Oct 2017, 14:01
View user's profile Send private message Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 15 Oct 2017, 16:14
This one does not need to assume an initial value of ah and seems almost 2 times faster on my machine (not taking the branch into account):
Code:
add al,$11
lahf
test ah,$11
jnz .no    

lahf may be unavailable in 64-bit mode on some machines and may also be slower on some other machines.

_________________
Faith is a superposition of knowledge and fallacy
Post 15 Oct 2017, 16:14
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2559
Furs 15 Oct 2017, 16:31
Nice, I completely forgot that the Auxiliary Flag even exists. Smile
Post 15 Oct 2017, 16:31
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2559
Furs 16 Oct 2017, 12:00
BTW without using the Auxiliary Flag trick, I realized you don't have to assume or touch the upper parts of eax, as long as you know 'ah' is zero. i.e.
Code:
add eax, 15 shl 8 + 16 + 1
and al, ah
test al, 15
jz no    
My brain wasn't functioning correctly yesterday here. Razz I mean, we're only interested in zeroing the lower 4 bits, don't care of the others.
Post 16 Oct 2017, 12:00
View user's profile Send private message Reply with quote
SergeASM



Joined: 13 Nov 2015
Posts: 21
SergeASM 21 Oct 2017, 11:57
Furs wrote:
I had to Google what "tetrad" means (maybe I suck Sad) but this is usually called a "nibble" (4-bits).

Sorry, I'm not an English speaker. Google Translate gave me two results: tetrad and quaternion. Smile

Thanks, but I wrote my naive test
Code:
xor ah,ah
shl eax,4
cmp al,$F0
je .no
cmp ah,$0F
je .no    
and this test works faster than both above examples (on my Athlon II X2 under Win32). Smile

_________________
Serge
Post 21 Oct 2017, 11:57
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2559
Furs 21 Oct 2017, 12:42
With random values? Two branches would be much worse in a non-simulated test, at least on newer CPUs, but I have no idea about Athlon II. I mean, make sure you test it with "reasonable" values you can expect in the program etc.

With two branches tell me if the following is faster on your CPU? Confused
Code:
add al, 16 + 1
jc .no
test al, 15
jz .no    
(no need for ah at all)
Post 21 Oct 2017, 12:42
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 21 Oct 2017, 13:30
Make your tests in the real code. No amount of simulation will help IMO.
Post 21 Oct 2017, 13:30
View user's profile Send private message Visit poster's website Reply with quote
_shura



Joined: 22 May 2015
Posts: 61
_shura 21 Oct 2017, 18:19
Code:
not al
and al, 0x0f
jz yes
and al, 0xf0
jz yes
    
Post 21 Oct 2017, 18:19
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2559
Furs 21 Oct 2017, 20:28
@_shura You meant to use "test" right? Now the second one will always be 0. test is also better as it can micro-fuse on newer CPUs with the branch (speed may be the same, but it's one less micro-op, which can help in other situations).

I doubt it's faster than the add+jc I posted (also two branches) even on Athlon which probably doesn't even do micro-fusing (because add->jc is probably not micro-fused since it's not very "common" like test/cmp + branch) since it has 1 more insn so (at least) 1 more micro op.
Post 21 Oct 2017, 20:28
View user's profile Send private message Reply with quote
_shura



Joined: 22 May 2015
Posts: 61
_shura 21 Oct 2017, 20:35
nobody said al had to contain the initial value Very Happy
what about
Code:
not al
test al, 0x0f
jz @f ;only one jump
test al, 0xf0
@@:
;ZF set: at least one nibble is 1111.
;ZF cleared: each nibble contains at least one null.
ret
    
Post 21 Oct 2017, 20:35
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.