flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register 
 Profile   Log in to check your private messages   Log in 
flat assembler > Main > How to recognize whether AL has a zero bit in each tetrad?

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
SergeASM



Joined: 13 Nov 2015
Posts: 21
How to recognize whether AL has a zero bit in each tetrad?
I want to quickly recognize whether AL has a zero bit in each tetrad. I.e.

AL=11110001 not
AL=00000000 yes
AL=01111111 not
AL=00111011 yes
...

Is there a method faster than xlat?

TIA,
Serge
Post 15 Oct 2017, 08:29
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua
Re: How to recognize whether AL has a zero bit in each tetrad?

SergeASM wrote:
... faster ...

Every machine will be different. Yes, really. You will have to test a range of possible solutions on the target machine to determine which is "fastest" for that machine.

Also, it is probably worthless to test only sample code. You should try to test the entire code and see which option(s) work best when the whole application is running.
Post 15 Oct 2017, 10:59
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 868
I had to Google what "tetrad" means (maybe I suck Sad) but this is usually called a "nibble" (4-bits).

The "pattern" here is that either the number mod 16 is 15, or it is higher than 239, it's the only time it will give out the answer "no". I'm not sure if you can use "ah" (i.e. rest of eax) as you didn't mention it, but if you can, let's assume ah is zero, try something like:

Code:
add eax16+1
dec ah
and alah
test al15
jz no

The first one produces a carry into ah if the number was 239 or greater, which means "no" (since 239 mod 16 is 15, and >= 240 is the other condition for "no"). The addition places the carry into ah. Then we decrease ah, resulting in all 1s if there was no carry, and zero otherwise. We then "mask" the resulting al with this: if carry, set al to zero.

Then we test for low 15 bits to see if they are zeros -- because they are zero only if they were 15 (we added '1' to them when we did first addition, that's why "16+1") or if the carry was produced (it zeroed al). Meaning they should be zero if number was >= 240 or if low 4 bits were all 1s.

I only wanted to use one branch for obvious reasons. I have no idea if it's faster than a lookup table though, but you can't test if there is no alternative Smile
Post 15 Oct 2017, 12:32
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua
Furs: That is a great analysis. But is it faster?
Post 15 Oct 2017, 12:39
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 868
I don't know, it probably is in some cases (it depends on surrounding code, since otherwise if branch gets predicted properly all the time it doesn't matter which one you use), probably not in others etc.

But still you need "different" ways to be able to test them, can't test something when you don't even know alternative Razz
Post 15 Oct 2017, 12:40
View user's profile Send private message Reply with quote
Tomasz Grysztar
Assembly Artist


Joined: 16 Jun 2003
Posts: 6633
Location: Kraków, Poland
Furst: I think your first two instructions may be contracted to:

Code:
add eax,(-1shl 8 + 16 + 1

Post 15 Oct 2017, 12:41
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 868
I was worried about the carry going into the al but it seems you are right! Wink I just tested it, so now it's:

Code:
add eax,(-1shl 8 + 16 + 1
and alah
test al15
jz no

I think now it's definitely "faster" or at least the "same" speed. Also much more friendly to the cache (no need for a lookup table using 256 bytes).
Post 15 Oct 2017, 12:48
View user's profile Send private message Reply with quote
Tomasz Grysztar
Assembly Artist


Joined: 16 Jun 2003
Posts: 6633
Location: Kraków, Poland

Furs wrote:
I was worried about the carry going into the al

Well, we could just use 0FF00h instead of "(-1) shl 8" and then forget about carry (it would land in the high part of EAX that does not concern us).
Post 15 Oct 2017, 12:51
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua

Furs wrote:
I think now it's definitely "faster" or at least the "same" speed. Also much more friendly to the cache (no need for a lookup table using 256 bytes).

It probably uses more bytes in the instruction cache, but fewer bytes in the data cache. Maybe it is "faster", maybe not. We would need the OP to test it.
Post 15 Oct 2017, 13:35
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar
Assembly Artist


Joined: 16 Jun 2003
Posts: 6633
Location: Kraków, Poland
Because this is just a yes/no test, you could even try BT instead of XLAT. Lookup table would then be 32 bytes instead of 256. Wink
Post 15 Oct 2017, 13:49
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 868
Yeah but AFAIK bt with "mem, reg" operands is super slow. Sad (only that combination)
Post 15 Oct 2017, 14:01
View user's profile Send private message Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
This one does not need to assume an initial value of ah and seems almost 2 times faster on my machine (not taking the branch into account):

Code:
add al,$11
lahf
test ah,$11
jnz .no


lahf may be unavailable in 64-bit mode on some machines and may also be slower on some other machines.

_________________
Faith is a superposition of knowledge and fallacy
Post 15 Oct 2017, 16:14
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 868
Nice, I completely forgot that the Auxiliary Flag even exists. Smile
Post 15 Oct 2017, 16:31
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 868
BTW without using the Auxiliary Flag trick, I realized you don't have to assume or touch the upper parts of eax, as long as you know 'ah' is zero. i.e.

Code:
add eax15 shl 8 + 16 + 1
and alah
test al15
jz no

My brain wasn't functioning correctly yesterday here. Razz I mean, we're only interested in zeroing the lower 4 bits, don't care of the others.
Post 16 Oct 2017, 12:00
View user's profile Send private message Reply with quote
SergeASM



Joined: 13 Nov 2015
Posts: 21

Furs wrote:
I had to Google what "tetrad" means (maybe I suck Sad) but this is usually called a "nibble" (4-bits).


Sorry, I'm not an English speaker. Google Translate gave me two results: tetrad and quaternion. Smile

Thanks, but I wrote my naive test

Code:
xor ah,ah
shl eax,4
cmp al,$F0
je .no
cmp ah,$0F
je .no

and this test works faster than both above examples (on my Athlon II X2 under Win32). Smile

_________________
Serge
Post 21 Oct 2017, 11:57
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 868
With random values? Two branches would be much worse in a non-simulated test, at least on newer CPUs, but I have no idea about Athlon II. I mean, make sure you test it with "reasonable" values you can expect in the program etc.

With two branches tell me if the following is faster on your CPU? Confused

Code:
add al16 + 1
jc .no
test al15
jz .no

(no need for ah at all)
Post 21 Oct 2017, 12:42
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15241
Location: 1I/ʻOumuamua
Make your tests in the real code. No amount of simulation will help IMO.
Post 21 Oct 2017, 13:30
View user's profile Send private message Visit poster's website Reply with quote
_shura



Joined: 22 May 2015
Posts: 60

Code:

not al
and al0x0f
jz yes
and al0xf0
jz yes


Post 21 Oct 2017, 18:19
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 868
@_shura You meant to use "test" right? Now the second one will always be 0. test is also better as it can micro-fuse on newer CPUs with the branch (speed may be the same, but it's one less micro-op, which can help in other situations).

I doubt it's faster than the add+jc I posted (also two branches) even on Athlon which probably doesn't even do micro-fusing (because add->jc is probably not micro-fused since it's not very "common" like test/cmp + branch) since it has 1 more insn so (at least) 1 more micro op.
Post 21 Oct 2017, 20:28
View user's profile Send private message Reply with quote
_shura



Joined: 22 May 2015
Posts: 60
nobody said al had to contain the initial value Very Happy
what about

Code:

not al
test al0x0f
jz @f ;only one jump
test al0xf0
@@:
;ZF set: at least one nibble is 1111.
;ZF cleared: each nibble contains at least one null.
ret


Post 21 Oct 2017, 20:35
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2005 phpBB Group.

Main index   Download   Documentation   Examples   Message board
Copyright © 2004-2016, Tomasz Grysztar.