flat assembler
Message board for the users of flat assembler.

 Index > Main > How to recognize whether AL has a zero bit in each tetrad? Goto page 1, 2  Next
Author
SergeASM

Joined: 13 Nov 2015
Posts: 21
SergeASM 15 Oct 2017, 08:29
I want to quickly recognize whether AL has a zero bit in each tetrad. I.e.

AL=11110001 not
AL=00000000 yes
AL=01111111 not
AL=00111011 yes
...

Is there a method faster than xlat?

TIA,
Serge
15 Oct 2017, 08:29
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20068
revolution 15 Oct 2017, 10:59
SergeASM wrote:
... faster ...
Every machine will be different. Yes, really. You will have to test a range of possible solutions on the target machine to determine which is "fastest" for that machine.

Also, it is probably worthless to test only sample code. You should try to test the entire code and see which option(s) work best when the whole application is running.
15 Oct 2017, 10:59
Furs

Joined: 04 Mar 2016
Posts: 2422
Furs 15 Oct 2017, 12:32
I had to Google what "tetrad" means (maybe I suck ) but this is usually called a "nibble" (4-bits).

The "pattern" here is that either the number mod 16 is 15, or it is higher than 239, it's the only time it will give out the answer "no". I'm not sure if you can use "ah" (i.e. rest of eax) as you didn't mention it, but if you can, let's assume ah is zero, try something like:
Code:
```add eax, 16+1
dec ah
and al, ah
test al, 15
jz no    ```
The first one produces a carry into ah if the number was 239 or greater, which means "no" (since 239 mod 16 is 15, and >= 240 is the other condition for "no"). The addition places the carry into ah. Then we decrease ah, resulting in all 1s if there was no carry, and zero otherwise. We then "mask" the resulting al with this: if carry, set al to zero.

Then we test for low 15 bits to see if they are zeros -- because they are zero only if they were 15 (we added '1' to them when we did first addition, that's why "16+1") or if the carry was produced (it zeroed al). Meaning they should be zero if number was >= 240 or if low 4 bits were all 1s.

I only wanted to use one branch for obvious reasons. I have no idea if it's faster than a lookup table though, but you can't test if there is no alternative
15 Oct 2017, 12:32
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20068
revolution 15 Oct 2017, 12:39
Furs: That is a great analysis. But is it faster?
15 Oct 2017, 12:39
Furs

Joined: 04 Mar 2016
Posts: 2422
Furs 15 Oct 2017, 12:40
I don't know, it probably is in some cases (it depends on surrounding code, since otherwise if branch gets predicted properly all the time it doesn't matter which one you use), probably not in others etc.

But still you need "different" ways to be able to test them, can't test something when you don't even know alternative
15 Oct 2017, 12:40
Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8340
Location: Kraków, Poland
Tomasz Grysztar 15 Oct 2017, 12:41
Furst: I think your first two instructions may be contracted to:
Code:
`add eax,(-1) shl 8 + 16 + 1    `
15 Oct 2017, 12:41
Furs

Joined: 04 Mar 2016
Posts: 2422
Furs 15 Oct 2017, 12:48
I was worried about the carry going into the al but it seems you are right! I just tested it, so now it's:
Code:
```add eax,(-1) shl 8 + 16 + 1
and al, ah
test al, 15
jz no    ```
I think now it's definitely "faster" or at least the "same" speed. Also much more friendly to the cache (no need for a lookup table using 256 bytes).
15 Oct 2017, 12:48
Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8340
Location: Kraków, Poland
Tomasz Grysztar 15 Oct 2017, 12:51
Furs wrote:
I was worried about the carry going into the al
Well, we could just use 0FF00h instead of "(-1) shl 8" and then forget about carry (it would land in the high part of EAX that does not concern us).
15 Oct 2017, 12:51
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20068
revolution 15 Oct 2017, 13:35
Furs wrote:
I think now it's definitely "faster" or at least the "same" speed. Also much more friendly to the cache (no need for a lookup table using 256 bytes).
It probably uses more bytes in the instruction cache, but fewer bytes in the data cache. Maybe it is "faster", maybe not. We would need the OP to test it.
15 Oct 2017, 13:35
Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8340
Location: Kraków, Poland
Tomasz Grysztar 15 Oct 2017, 13:49
Because this is just a yes/no test, you could even try BT instead of XLAT. Lookup table would then be 32 bytes instead of 256.
15 Oct 2017, 13:49
Furs

Joined: 04 Mar 2016
Posts: 2422
Furs 15 Oct 2017, 14:01
Yeah but AFAIK bt with "mem, reg" operands is super slow. (only that combination)
15 Oct 2017, 14:01
l_inc

Joined: 23 Oct 2009
Posts: 881
l_inc 15 Oct 2017, 16:14
This one does not need to assume an initial value of ah and seems almost 2 times faster on my machine (not taking the branch into account):
Code:
```add al,\$11
lahf
test ah,\$11
jnz .no    ```

lahf may be unavailable in 64-bit mode on some machines and may also be slower on some other machines.

_________________
Faith is a superposition of knowledge and fallacy
15 Oct 2017, 16:14
Furs

Joined: 04 Mar 2016
Posts: 2422
Furs 15 Oct 2017, 16:31
Nice, I completely forgot that the Auxiliary Flag even exists.
15 Oct 2017, 16:31
Furs

Joined: 04 Mar 2016
Posts: 2422
Furs 16 Oct 2017, 12:00
BTW without using the Auxiliary Flag trick, I realized you don't have to assume or touch the upper parts of eax, as long as you know 'ah' is zero. i.e.
Code:
```add eax, 15 shl 8 + 16 + 1
and al, ah
test al, 15
jz no    ```
My brain wasn't functioning correctly yesterday here. I mean, we're only interested in zeroing the lower 4 bits, don't care of the others.
16 Oct 2017, 12:00
SergeASM

Joined: 13 Nov 2015
Posts: 21
SergeASM 21 Oct 2017, 11:57
Furs wrote:
I had to Google what "tetrad" means (maybe I suck ) but this is usually called a "nibble" (4-bits).

Sorry, I'm not an English speaker. Google Translate gave me two results: tetrad and quaternion.

Thanks, but I wrote my naive test
Code:
```xor ah,ah
shl eax,4
cmp al,\$F0
je .no
cmp ah,\$0F
je .no    ```
and this test works faster than both above examples (on my Athlon II X2 under Win32).

_________________
Serge
21 Oct 2017, 11:57
Furs

Joined: 04 Mar 2016
Posts: 2422
Furs 21 Oct 2017, 12:42
With random values? Two branches would be much worse in a non-simulated test, at least on newer CPUs, but I have no idea about Athlon II. I mean, make sure you test it with "reasonable" values you can expect in the program etc.

With two branches tell me if the following is faster on your CPU?
Code:
```add al, 16 + 1
jc .no
test al, 15
jz .no    ```
(no need for ah at all)
21 Oct 2017, 12:42
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20068
revolution 21 Oct 2017, 13:30
Make your tests in the real code. No amount of simulation will help IMO.
21 Oct 2017, 13:30
_shura

Joined: 22 May 2015
Posts: 61
_shura 21 Oct 2017, 18:19
Code:
```not al
and al, 0x0f
jz yes
and al, 0xf0
jz yes
```
21 Oct 2017, 18:19
Furs

Joined: 04 Mar 2016
Posts: 2422
Furs 21 Oct 2017, 20:28
@_shura You meant to use "test" right? Now the second one will always be 0. test is also better as it can micro-fuse on newer CPUs with the branch (speed may be the same, but it's one less micro-op, which can help in other situations).

I doubt it's faster than the add+jc I posted (also two branches) even on Athlon which probably doesn't even do micro-fusing (because add->jc is probably not micro-fused since it's not very "common" like test/cmp + branch) since it has 1 more insn so (at least) 1 more micro op.
21 Oct 2017, 20:28
_shura

Joined: 22 May 2015
Posts: 61
_shura 21 Oct 2017, 20:35
nobody said al had to contain the initial value
Code:
```not al
test al, 0x0f
jz @f ;only one jump
test al, 0xf0
@@:
;ZF set: at least one nibble is 1111.
;ZF cleared: each nibble contains at least one null.
ret
```
21 Oct 2017, 20:35
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area
Goto page 1, 2  Next

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum