flat assembler
Message board for the users of flat assembler.

Index > Main > clearing the bit 63 of a register

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
sylware



Joined: 23 Oct 2020
Posts: 180
Location: Marseille/France
sylware 24 Sep 2022, 14:44
Yeah, reasonably only btr?
Post 24 Sep 2022, 14:44
View user's profile Send private message Reply with quote
CandyMan



Joined: 04 Sep 2009
Posts: 368
Location: film "CandyMan" directed through Bernard Rose OR Candy Shop
CandyMan 24 Sep 2022, 17:09
Code:
mov rdx,not (1 shl 63)
and rax,rdx
;
mov rdx,1 shl 63
andn rax,rdx,rax
;
shl rax,1
clc
rcr rax,1    

_________________
smaller is better
Post 24 Sep 2022, 17:09
View user's profile Send private message Reply with quote
SeproMan



Joined: 11 Oct 2009
Posts: 62
Location: Belgium
SeproMan 24 Sep 2022, 20:09
A simple, compact, 2 instructions alternativ:

Code:
shl rax, 1
shr rax, 1    


The shr rax, 1 is faster than clc followed by rcr rax, 1.

_________________
Real Address Mode.
Post 24 Sep 2022, 20:09
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 673
Location: Russia
macomics 24 Sep 2022, 21:18
That's just when using the shr command, the remaining bits (62-0) will change, and the task is only to reset 63 bit. To do this, the shortest and fastest way is to use the btr/bts/btc commands, if we are talking about only one bit in the register.

By the way, when using the bt/btr/bts/btc commands in the form
Code:
bt qword [rax],rcx    
it is possible to work with a bit string longer than 8 bytes. In the 32-bit version, this string could be up to 512Mb. About the limitations for 64-bit variants, I need to look in the manual.
Post 24 Sep 2022, 21:18
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18942
Location: In your JS exploiting you and your system
revolution 24 Sep 2022, 22:32
SeproMan wrote:
The shr rax, 1 is faster than clc followed by rcr rax, 1.
This might correct in some cases. There are many other things that can affect the runtime that can overwhelm the execution time of an extra instruction.

All we can really say by just reading the code is that it uses one fewer instruction. Anything else is speculation.
Post 24 Sep 2022, 22:32
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 180
Location: Marseille/France
sylware 25 Sep 2022, 13:52
then if "btr reg, imm" has proper silicium pathing, it'll be more efficient than shl/shr.

I guess I'll go for this, I don't need to preserve the carry flag.

I know bt* instructions have to be handled with care, since their silicium "setup" can be quite expensive. (you know what did happen with short and big accelerated rep sto* and rep mov*)
Post 25 Sep 2022, 13:52
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1946
Furs 25 Sep 2022, 13:56
AFAIK, bt* instructions are super fast as long as you don't use memory operands. It's only the memory operand versions that are slow and micro-coded (mostly because of what macomics wrote about them addressing more than 8 bytes). Agner Fog's optimization manuals are the Bible here.

revolution wrote:
SeproMan wrote:
The shr rax, 1 is faster than clc followed by rcr rax, 1.
This might correct in some cases. There are many other things that can affect the runtime that can overwhelm the execution time of an extra instruction.

All we can really say by just reading the code is that it uses one fewer instruction. Anything else is speculation.
Probably because it has more latency and same/less throughput, on most (or all?) CPUs.
Post 25 Sep 2022, 13:56
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18942
Location: In your JS exploiting you and your system
revolution 25 Sep 2022, 14:13
Furs wrote:
Probably because it has more latency and same/less throughput, on most (or all?) CPUs.
And in a lot of cases this has no measurable effect.

Agner Fog's timing data is only relevant in a few niche code situations that most code never encounters in a normal program. Counting cycles by looking at the instructions is pointless. There is probably 100x more benefit to optimising the memory accesses using "slower" instructions, or more instruction. Try it one day, you might be surprised by the results.

Perhaps watching the end of this lecture will shed some light on how systems today are mostly under-utilised due to memory bottlenecks. CPU instruction selection won't help you. Other effects completely dwarf anything you could hope to measure by changing just one instruction.

Note the actual vs theoretical, 0.9% utilisation only in one case. That is from the current best in class software, running on the supercomputers, written by experts.
Post 25 Sep 2022, 14:13
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 673
Location: Russia
macomics 25 Sep 2022, 14:17
Furs wrote:
bt* instructions are super fast as long as you don't use memory operands. It's only the memory operand versions that are slow and micro-coded (mostly because of what macomics wrote about them addressing more than 8 bytes).

I am familiar with this disadvantage. That's why instructions are not so popular for working with bit strings. It is much easier to work with memory with ordinary logical commands. But the initial idea of the command with the receiver in memory was good. As always, the implementation failed.
Post 25 Sep 2022, 14:17
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1115
Location: Belarus
DimonSoft 25 Sep 2022, 15:24
macomics wrote:
Furs wrote:
bt* instructions are super fast as long as you don't use memory operands. It's only the memory operand versions that are slow and micro-coded (mostly because of what macomics wrote about them addressing more than 8 bytes).

I am familiar with this disadvantage. That's why instructions are not so popular for working with bit strings. It is much easier to work with memory with ordinary logical commands. But the initial idea of the command with the receiver in memory was good. As always, the implementation failed.

Do they really work slower than calculating the byte offset and mask manually taking a few registers for that? What about the impact on the performance of the nearby code that now becomes more limited in terms of available registers? (I obviously understand that sometimes the offset or mask are already there, just asking about the common case.)
Post 25 Sep 2022, 15:24
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 673
Location: Russia
macomics 25 Sep 2022, 15:48
DimonSoft wrote:
I obviously understand that sometimes the offset or mask are already there, just asking about the common case.
Perhaps you will do something more optimal, but btr can generally be replaced with this code
Code:
; rdx - bit string pointer, rcx - bit number
mov ah, cl
mov al, 1
and cl, 7
shl al, cl
mov cl, ah
xor al, -1
shr rcx, 3
and [rdx + rcx], al

; or

btr qword [rdx], rcx    
Post 25 Sep 2022, 15:48
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1115
Location: Belarus
DimonSoft 25 Sep 2022, 18:15
macomics wrote:
Perhaps you will do something more optimal, but btr can generally be replaced with this code
Code:
; rdx - bit string pointer, rcx - bit number
mov ah, cl
mov al, 1
and cl, 7
shl al, cl
mov cl, ah
xor al, -1
shr rcx, 3
and [rdx + rcx], al

; or

btr qword [rdx], rcx    

Yep. And that’s the point of my question. Is this (or similar) code containing lots of data dependencies (both real and fake) expected to run faster than a single instruction with somewhat worse “cycle count”? Especially if we keep in mind that the outer piece of code is now subject to the fact that ?AX register will be used in addition to those already involved. Probably even with partial register access stalls.

Special disclaimer for revolution. Yes, we should measure, but I ask about expectations and probabilities, and since we all know a billion of instructions runs slower than a single one and the possibility of multiple instructions running faster than a single one is limited to the processor’s internals, especially the number of pipeline stages and pipelines themselves as well as other stuff which can be counted with not-very-large natural numbers, there definitely is some educated guess about the resulting performance.
Post 25 Sep 2022, 18:15
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 673
Location: Russia
macomics 25 Sep 2022, 19:27
Only I wrote this code observing the layout of the parameters. The parameters at the input of the btr command and this command block are the same. When implemented in practice, you may not use end-to-end bit indexing, but store the bit index and byte index separately from each other.
Post 25 Sep 2022, 19:27
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1115
Location: Belarus
DimonSoft 25 Sep 2022, 20:30
Yep, I’ve mentioned that generally some optimization might be available. Anyway, the characteristics of bt* don’t seem to be too bad to avoid it or performance reasons. Writing an equivalent piece of code that would outperform it significantly is highly unlikely. After all, the memory access still can’t be avoided due to the nature of the instruction (for memory operand). Readability (for asm programming), machine code size (cache-friendliness), number of instructions and registers involved (pipelining friendliness). I’d take bt*, and give me two. Not that I’ve ever really needed it though.
Post 25 Sep 2022, 20:30
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18942
Location: In your JS exploiting you and your system
revolution 25 Sep 2022, 22:36
If you don't test then you don't know.

Science:
1. Make a hypothesis
2. Test the hypothesis
3. Use the results to update your knowledge..

Religion
1. Make a hypothesis.
2. Assume it is true.
3. Fight to the death to defend your beliefs.


Last edited by revolution on 26 Sep 2022, 04:31; edited 1 time in total
Post 25 Sep 2022, 22:36
View user's profile Send private message Visit poster's website Reply with quote
sinsi



Joined: 10 Aug 2007
Posts: 755
Location: Adelaide
sinsi 26 Sep 2022, 04:07
Q. clearing the bit 63 of a register
A. SHR reg,1
Post 26 Sep 2022, 04:07
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 673
Location: Russia
macomics 26 Sep 2022, 04:38
Quote:
Q. clearing the bit 63 of a register
A. SHR reg,1
#2 SeproMan
Code:
rol reg, 1
shr reg, 1    
Post 26 Sep 2022, 04:38
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1115
Location: Belarus
DimonSoft 26 Sep 2022, 09:04
revolution wrote:
If you don't test then you don't know.

Science:
1. Make a hypothesis
2. Test the hypothesis
3. Use the results to update your knowledge..

Religion
1. Make a hypothesis.
2. Assume it is true.
3. Fight to the death to defend your beliefs.

Please, make sure you’ve read the disclaimer specifically targeted at you. Thanks.
Post 26 Sep 2022, 09:04
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18942
Location: In your JS exploiting you and your system
revolution 26 Sep 2022, 09:26
DimonSoft wrote:
Please, make sure you’ve read the disclaimer specifically targeted at you. Thanks.
Okay.
DimonSoft wrote:
... and the possibility of multiple instructions running faster than a single one is limited to the processor’s internals, especially the number of pipeline stages and pipelines
That isn't what I meant.

You can, for example, do stream pre-loads then process data with more instructions (that theoretically take longer) while the data are in the CPU, and stream store results. Bonus points for doing overlapped, loads with processing of the previous data. Compared to using fewer instructions on a single datum point with multiple rounds into memory and back for each layer of computation.

You can also do more with multiple fields in a data structure while you have it in a register. To save storing it as separate fields. Sometimes this needs more instructions to complete the computation, but is an overall time saver if memory accesses are optimised well.

And as with all of these techniques, testing will guide better than guessing or assuming (or counting cycles from Agner Fog's files).
Post 26 Sep 2022, 09:26
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3511
Location: vpcmipstrm
bitRAKE 26 Sep 2022, 14:06
Code:
value dq 0x...

and byte [value+7], 0x7F
mov reg64, [value]    
... just a perspective not in the thread - to mask the bit prior to load.

_________________
¯\(°_o)/¯ unlicense.org
Post 26 Sep 2022, 14:06
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.