flat assembler
Message board for the users of flat assembler.
Index
> Main > clearing the bit 63 of a register Goto page 1, 2 Next |
Author |
|
sylware 24 Sep 2022, 14:44
Yeah, reasonably only btr?
|
|||
24 Sep 2022, 14:44 |
|
CandyMan 24 Sep 2022, 17:09
Code: mov rdx,not (1 shl 63) and rax,rdx ; mov rdx,1 shl 63 andn rax,rdx,rax ; shl rax,1 clc rcr rax,1 _________________ smaller is better |
|||
24 Sep 2022, 17:09 |
|
macomics 24 Sep 2022, 21:18
That's just when using the shr command, the remaining bits (62-0) will change, and the task is only to reset 63 bit. To do this, the shortest and fastest way is to use the btr/bts/btc commands, if we are talking about only one bit in the register.
By the way, when using the bt/btr/bts/btc commands in the form Code: bt qword [rax],rcx |
|||
24 Sep 2022, 21:18 |
|
revolution 24 Sep 2022, 22:32
SeproMan wrote: The shr rax, 1 is faster than clc followed by rcr rax, 1. All we can really say by just reading the code is that it uses one fewer instruction. Anything else is speculation. |
|||
24 Sep 2022, 22:32 |
|
sylware 25 Sep 2022, 13:52
then if "btr reg, imm" has proper silicium pathing, it'll be more efficient than shl/shr.
I guess I'll go for this, I don't need to preserve the carry flag. I know bt* instructions have to be handled with care, since their silicium "setup" can be quite expensive. (you know what did happen with short and big accelerated rep sto* and rep mov*) |
|||
25 Sep 2022, 13:52 |
|
Furs 25 Sep 2022, 13:56
AFAIK, bt* instructions are super fast as long as you don't use memory operands. It's only the memory operand versions that are slow and micro-coded (mostly because of what macomics wrote about them addressing more than 8 bytes). Agner Fog's optimization manuals are the Bible here.
revolution wrote:
|
|||
25 Sep 2022, 13:56 |
|
revolution 25 Sep 2022, 14:13
Furs wrote: Probably because it has more latency and same/less throughput, on most (or all?) CPUs. Agner Fog's timing data is only relevant in a few niche code situations that most code never encounters in a normal program. Counting cycles by looking at the instructions is pointless. There is probably 100x more benefit to optimising the memory accesses using "slower" instructions, or more instruction. Try it one day, you might be surprised by the results. Perhaps watching the end of this lecture will shed some light on how systems today are mostly under-utilised due to memory bottlenecks. CPU instruction selection won't help you. Other effects completely dwarf anything you could hope to measure by changing just one instruction. Note the actual vs theoretical, 0.9% utilisation only in one case. That is from the current best in class software, running on the supercomputers, written by experts. |
|||
25 Sep 2022, 14:13 |
|
macomics 25 Sep 2022, 14:17
Furs wrote: bt* instructions are super fast as long as you don't use memory operands. It's only the memory operand versions that are slow and micro-coded (mostly because of what macomics wrote about them addressing more than 8 bytes). I am familiar with this disadvantage. That's why instructions are not so popular for working with bit strings. It is much easier to work with memory with ordinary logical commands. But the initial idea of the command with the receiver in memory was good. As always, the implementation failed. |
|||
25 Sep 2022, 14:17 |
|
DimonSoft 25 Sep 2022, 15:24
macomics wrote:
Do they really work slower than calculating the byte offset and mask manually taking a few registers for that? What about the impact on the performance of the nearby code that now becomes more limited in terms of available registers? (I obviously understand that sometimes the offset or mask are already there, just asking about the common case.) |
|||
25 Sep 2022, 15:24 |
|
macomics 25 Sep 2022, 15:48
DimonSoft wrote: I obviously understand that sometimes the offset or mask are already there, just asking about the common case. Code: ; rdx - bit string pointer, rcx - bit number mov ah, cl mov al, 1 and cl, 7 shl al, cl mov cl, ah xor al, -1 shr rcx, 3 and [rdx + rcx], al ; or btr qword [rdx], rcx |
|||
25 Sep 2022, 15:48 |
|
DimonSoft 25 Sep 2022, 18:15
macomics wrote: Perhaps you will do something more optimal, but btr can generally be replaced with this code Yep. And that’s the point of my question. Is this (or similar) code containing lots of data dependencies (both real and fake) expected to run faster than a single instruction with somewhat worse “cycle count”? Especially if we keep in mind that the outer piece of code is now subject to the fact that ?AX register will be used in addition to those already involved. Probably even with partial register access stalls. Special disclaimer for revolution. Yes, we should measure, but I ask about expectations and probabilities, and since we all know a billion of instructions runs slower than a single one and the possibility of multiple instructions running faster than a single one is limited to the processor’s internals, especially the number of pipeline stages and pipelines themselves as well as other stuff which can be counted with not-very-large natural numbers, there definitely is some educated guess about the resulting performance. |
|||
25 Sep 2022, 18:15 |
|
macomics 25 Sep 2022, 19:27
Only I wrote this code observing the layout of the parameters. The parameters at the input of the btr command and this command block are the same. When implemented in practice, you may not use end-to-end bit indexing, but store the bit index and byte index separately from each other.
|
|||
25 Sep 2022, 19:27 |
|
DimonSoft 25 Sep 2022, 20:30
Yep, I’ve mentioned that generally some optimization might be available. Anyway, the characteristics of bt* don’t seem to be too bad to avoid it or performance reasons. Writing an equivalent piece of code that would outperform it significantly is highly unlikely. After all, the memory access still can’t be avoided due to the nature of the instruction (for memory operand). Readability (for asm programming), machine code size (cache-friendliness), number of instructions and registers involved (pipelining friendliness). I’d take bt*, and give me two. Not that I’ve ever really needed it though.
|
|||
25 Sep 2022, 20:30 |
|
revolution 25 Sep 2022, 22:36
If you don't test then you don't know.
Science: 1. Make a hypothesis 2. Test the hypothesis 3. Use the results to update your knowledge.. Religion 1. Make a hypothesis. 2. Assume it is true. 3. Fight to the death to defend your beliefs. Last edited by revolution on 26 Sep 2022, 04:31; edited 1 time in total |
|||
25 Sep 2022, 22:36 |
|
sinsi 26 Sep 2022, 04:07
Q. clearing the bit 63 of a register
A. SHR reg,1 |
|||
26 Sep 2022, 04:07 |
|
macomics 26 Sep 2022, 04:38
|
|||
26 Sep 2022, 04:38 |
|
DimonSoft 26 Sep 2022, 09:04
revolution wrote: If you don't test then you don't know. Please, make sure you’ve read the disclaimer specifically targeted at you. Thanks. |
|||
26 Sep 2022, 09:04 |
|
revolution 26 Sep 2022, 09:26
DimonSoft wrote: Please, make sure you’ve read the disclaimer specifically targeted at you. Thanks. DimonSoft wrote: ... and the possibility of multiple instructions running faster than a single one is limited to the processor’s internals, especially the number of pipeline stages and pipelines You can, for example, do stream pre-loads then process data with more instructions (that theoretically take longer) while the data are in the CPU, and stream store results. Bonus points for doing overlapped, loads with processing of the previous data. Compared to using fewer instructions on a single datum point with multiple rounds into memory and back for each layer of computation. You can also do more with multiple fields in a data structure while you have it in a register. To save storing it as separate fields. Sometimes this needs more instructions to complete the computation, but is an overall time saver if memory accesses are optimised well. And as with all of these techniques, testing will guide better than guessing or assuming (or counting cycles from Agner Fog's files). |
|||
26 Sep 2022, 09:26 |
|
bitRAKE 26 Sep 2022, 14:06
Code: value dq 0x... and byte [value+7], 0x7F mov reg64, [value] _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
26 Sep 2022, 14:06 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.