flat assembler
Message board for the users of flat assembler.

Index > Main > How to efficiently exchange register with memory?

Author
Thread Post new topic Reply to topic
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
The x86 got the XCHG instruction. Unfortuantely it is slow on almost every architecture if used in conjunction with a memory operand.

Say, you want to exchange a DWORD at ESI with the content of register EAX. The obvious way to optimize this is to use the folowing code:
Code:
;either this way
        mov     edx,eax
     mov     eax,[esi]
   mov     [esi],edx
;or the other way round
        mov     edx,[esi]
   mov     [esi],eax
   mov     eax,edx
    

Funnily, this is a bit faster in most cases, but the problem with this code is that you need an additional register to hold a temporary value. You can avoid this drawback with the following code:
Code:
use32
;either this way
  xor     eax,[esi]
   xor     [esi],eax
   xor     eax,[esi]
;or the other way  round
       xor     [esi],eax
   xor     eax,[esi]
   xor     [esi],eax
    

now you have freed 1 register, which may speed up some time sensitive loops. Unfortunately, this code doesn't break the register dependency, which is another drawback. The same applies to the versions where the XORs are replaced with ADDs and SUBs.

Does anyone know an even better way to exchange a register and memory operand?

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 08 Jan 2008, 08:04
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
XOR doesn't break dependency??? Neutral That's new - I've always thought it did - better get back to the manuals.

EDIT: Oh, I got it Wink XOR breaks it only under the condition that src=dest. And in this case we don't even have a false dependence, but a real one.
Code:
;First lets see this one:
        xor     eax,[esi] ; I think its better to have 2 reads (2R+1W)
        xor     [esi],eax ; because XOR m32,r32 takes 6 clocks of latency
        xor     eax,[esi] ; otherwise you'd have 12 clock minimal theoretical
;and it would end not earlier than 6 clocks after the final instruction.
    

This version is better because 6-clock wait on [esi] is finished before the final xor eax,[esi] and this finishes immediately (no additional latency marked with xor r32,m32). You'll get something like this:
Code:
        xor     eax,[esi] ; Takes port0 as first vacant
        xor     [esi],eax ; Starts in the same clock occupying port1,2,3,4
        xor     eax,[esi] ; on the 7th clock port0 is vacant
        ; Here you can have instructions operating on eax or [esi] without
        ; any latency. port1 (and 5 on Core 2), ports 2,3 and 4 are vacant.
    


Last edited by Madis731 on 08 Jan 2008, 10:24; edited 1 time in total
Post 08 Jan 2008, 09:58
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Vov4ik



Joined: 20 May 2007
Posts: 28
Location: USSR, Moscow Region
Vov4ik
I think, XOR with RAM operand will be splitted into more than one microinstructions - reading, XORing and then writing back. So, in my opinion, MOVing is faster. But it will be better, as Madis731 said, to consult with intel's manuals.
Post 08 Jan 2008, 10:14
View user's profile Send private message ICQ Number Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Heh, and btw from Agner's:
Agner Fog wrote:

16.3 XCHG (all processors)
The XCHG register,[memory] instruction is dangerous. This instruction always has an
implicit LOCK prefix which prevents it from using the cache. This instruction is therefore very
time consuming, and should always be avoided.
The XCHG instruction with register operands may be useful when optimizing for size as
explained on page 65.


EDIT: If you are running out of registers and you absolutely need XCHG r32,m32 then there are other ways around it. Make your application even more memory-accessing and leave all your registers in a location name i.e. r8 dd ?, r9 dd ? etc.

The good thing is that when eax needs to be exchanged with either one of them then i.e. xchg eax,[r8] is not a good option and a rather fast alternative exists (provided you have at least MMX or even SSE). If you have SSE, prefer it to MMX even if you don't need 128 bits.
Code:
mov     [r9],eax
pshufd  xmm0,[r8],10110001b ; DCBA will be => CDAB (C,D are ignored in this example)
movq    [r8],xmm0 ; Only r8 and r9 are written back (dword+dword=qword)
;movdqa  [r8],xmm0 ; Uncomment this if you want to exchange r8<x>r9 | r10<x>r11
mov     eax,[r9]
;...
align 16
r8  dd ?
r9  dd ?
r10 dd ?
r11 dd ?
;...
    


EDIT:
Did some calculations...
Code:
                               /                     uops each                         \
                        uops fused|    015|     0 |     1 |     5 |     2 |     3 |     4 |  Lat  |  Recip|             
                    =====================================================================================
mov    [r_9],eax               1 |       |       |       |       |       |     1 |     1 |     3 |     1 |
pshufd   xmm0,[r_8],10110001b    3 |     2 |     x |     x |     1 |     1 |       |       |       |     1 |
movq     [r_8],xmm0              1 |       |       |       |       |       |     1 |     1 |       |     1 |
(movdqa      [r_8],xmm0)        (1)|     ()|     ()|     ()|     ()|     ()|    (1)|    (1)|    (3)|    (1)|
mov      eax,[r_9]               1 |       |       |       |       |     1 |       |       |     2 |     1 |
         TOTAL:              6 |     2 |     1 |     1 |     1 |     2 |     2 |     2 |   5/8 |     4 |

6 or 9 clocks (one or two exchanges)
And you are not wasting any registers.
Comparable XCHG sequences would take
7 or 14 clocks minimum.
    
Post 08 Jan 2008, 10:28
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
Madis731 wrote:

Code:
mov     [r9],eax
pshufd  xmm0,[r8],10110001b ; DCBA will be => CDAB (C,D are ignored in this example)
movq    [r8],xmm0 ; Only r8 and r9 are written back (dword+dword=qword)
;movdqa  [r8],xmm0 ; Uncomment this if you want to exchange r8<x>r9 | r10<x>r11
mov     eax,[r9]
;...
align 16
r8  dd ?
r9  dd ?
r10 dd ?
r11 dd ?
;...
    


The problem with your code is that you assume that you can write to r9, so that you would need additional space beyond r8(the actual variable), which is not always the case.
Post 09 Jan 2008, 05:46
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
The very fact that write eax to [r9] means that it can be written to. The only assumption is with MOVDQA, where other 8 bytes are not guaranteed. Its upto the coder (or maybe a macro) to guarantee that r8 & r9 are consecutive.

Its true, thought that alignment is needed because SSE can't read beyond page borders and some other problems, like its a lot of coding and doesn't have much speed benefit. Especially over MOV sequence with a temp register, but using SIMD can do the trick sometimes Smile
Post 09 Jan 2008, 06:50
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Keep in mind that XCHG with memory operand as an implicit bus LOCK... pretty useful with multithreading (keyword: atomic - your solutions aren't).
Post 09 Jan 2008, 12:15
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.