flat assembler
Message board for the users of flat assembler.

Index > Main > mov eax,ebx or xchg eax,ebx

Author
Thread Post new topic Reply to topic
asmrus



Joined: 31 Dec 2005
Posts: 4
Location: .gr + .ru
asmrus 10 Mar 2006, 22:25
whats better for speed or size ?

_________________
"...they track us, our interests and our hosts, we track them,
their interests and their hosts, it's an interesting match and we'll
always win, coz we do not do it for money... work well, +ORC"
Post 10 Mar 2006, 22:25
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 10 Mar 2006, 22:33
"xchg" is one byte and "mov" two bytes.

mov is faster

Did you noticed that xchg will copy ebx to eax but will overwrite the value of ebx with the value of eax too?
Post 10 Mar 2006, 22:33
View user's profile Send private message Reply with quote
asmrus



Joined: 31 Dec 2005
Posts: 4
Location: .gr + .ru
asmrus 10 Mar 2006, 22:45
yes i knew that...

x - change bytes

but how many times can be mov faster than xchg ? Smile ...
its all assembly Wink

_________________
"...they track us, our interests and our hosts, we track them,
their interests and their hosts, it's an interesting match and we'll
always win, coz we do not do it for money... work well, +ORC"
Post 10 Mar 2006, 22:45
View user's profile Send private message Reply with quote
penang



Joined: 01 Oct 2004
Posts: 59
penang 11 Mar 2006, 07:03
Test it under the following condition:

A. Exchange something for a hundred million times. Time the result.

B. Move something for a hundred million times. Time the result.

C. Compare the result of A and B.

D. Walla !
Post 11 Mar 2006, 07:03
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 11 Mar 2006, 08:34
MOV is about 4X faster than XCHG
On an AMD x2 3800+ (2.0ghz) 1gig ram
Results from the below code snippet:
XCHG 10076 milliseconds
MOV 2357 milliseconds

Code:
    push 0
    push 0
    push 0
    push 0
    call [MessageBox]
;---------------------------------------
    call [GetTickCount]
    mov edi,eax
    mov esi,07FFFFFFh
tst1:
    repeat 100
    xchg eax,ebx
    end repeat

    dec esi
    jnz tst1
    call [GetTickCount]
    sub eax,edi
    push eax
    push result1
    call [printf]

    call [GetTickCount]
    mov edi,eax
    mov esi,07FFFFFFh
tst2:
    repeat 100
    mov eax,ebx
    end repeat

    dec esi
    jnz tst2
    call [GetTickCount]
    sub eax,edi
    push eax
    push result2
    call [printf]
;+++++++++++++++++++++++++++++++++++++++++++++
     push 0
     push buffer
     push buffer
     push 0
     call [MessageBox]
     push 0
     call [ExitProcess]  
    
Post 11 Mar 2006, 08:34
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8351
Location: Kraków, Poland
Tomasz Grysztar 11 Mar 2006, 10:47
This is perhaps because XCHG always locks the bus, as if you had used the LOCK prefix with it.
Post 11 Mar 2006, 10:47
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2465
Location: Bucharest, Romania
Borsuc 11 Mar 2006, 13:38
I thought XCHG locks the bus only when you do memory operations with it? Register-to-register is not a memory operation.. or am I wrong?
Post 11 Mar 2006, 13:38
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 11 Mar 2006, 13:52
Thomasz, not sure if you are right but I think that the problem in the Athlon64 is that XCHG is a VectorPath instruction even if the operands are regs only.

Software Optimization Guide for the AMD64 Processors wrote:

Instruction Type Description
DirectPath Single: A relatively common instruction that the processor decodes directly into one macro-op
in hardware.
DirectPath Double: A relatively common instruction that the processor decodes directly into two macroops
in hardware.
VectorPath: A sophisticated or less common instruction that the processor decodes into one or
more (usually three or more) macro-ops using the on-chip microcode-engine ROM
(MROM).


Description:
Filesize: 43.59 KB
Viewed: 9537 Time(s)

Instruction Latencies.PNG


Post 11 Mar 2006, 13:52
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8351
Location: Kraków, Poland
Tomasz Grysztar 11 Mar 2006, 13:53
Right, I missed the point that it's only about register exchaning here.
Post 11 Mar 2006, 13:53
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 11 Mar 2006, 14:24
I'd tested with a code similar to r22's code using 16 bytes aligned loops and realtime priority but I had the same times.
Post 11 Mar 2006, 14:24
View user's profile Send private message Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2465
Location: Bucharest, Romania
Borsuc 11 Mar 2006, 14:39
Mobile AMD Sempron 2800+ (1.6 Ghz) and tested with 1000 repeats:

xchg eax, ebx: 1500 cycles
mov eax, ebx: 329 cycles

But that isn't fair, as xchg exchanges the values.

Code:
  mov ecx, eax
  mov eax, ebx
  mov ebx, ecx    
this results in exactly 1500 cycles, just like xchg Wink

xchg is the best (at least on this processor) to exchange the registers (even the trick with 3 xors is about 3000 cycles, not to mention it's a lot larger in size) Very Happy
Post 11 Mar 2006, 14:39
View user's profile Send private message Reply with quote
bogdanontanu



Joined: 07 Jan 2004
Posts: 403
Location: Sol. Earth. Europe. Romania. Bucuresti
bogdanontanu 11 Mar 2006, 16:34
As Grey_Beast say, XCHG really DOES something else.
So the comparison is not fair.

XCHG is also usefull because it uses an internal CPU temeporary register and by doing this is avoids making dirty one on the standard registers. This property can be very usefull at times!

I would first care for simple and easy to understand code and only later for speed. However i agree that for speed issues it is better to avoid using XCHG inside "inner loops" unless you desperately need its extra temporary register.
Post 11 Mar 2006, 16:34
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.