flat assembler
Message board for the users of flat assembler.

Index > Main > Integer average algorithm

Author
Thread Post new topic Reply to topic
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 23 May 2005, 21:47
Motivation: (a+b)/2 may get an overflow so alternatives should be thought!
(a+b)/2 = ((a&b)+(a|b))/2 = ((x^y)+2*(x&y))/2 => ((x^y)/2+(x&y))
Code:
mov eax,3141592653
mov ebx,1897932384

;Five instructions (1µop - theory 2.33clocks - even 2.0 serialized)
mov edx,eax ;port0 or 1 - first clock
xor edx,ebx ;port1 or 0
shr edx,1   ;port0 - second clock
and eax,ebx ;port1
add eax,edx ;port0 - third clock
;Result: 11.1clk - reordering didn't help
;I really don't know why it is so cruel on my PIII

;I would use this kind of algo on CPUs supporting
;rotations through carry
;TWO instructions (1+2µops - theory 2clocks - 1.66 serialized)
add eax,ebx ;port0 - first clock
rcr eax,1   ;port0+1 - second clock
;Result: 1.7clk - next add takes advantage of previous rcr
;though delaying on rotate.
    


Would anyone care to explain to me why this first one is recommended on 'gems' sites and not the latter one. Compatibility? Why would my CPU waste
2 clocks instead of 1/3 on every 1µop instruction?

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 23 May 2005, 21:47
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 24 May 2005, 07:01
As I see it:

Code:
mov edx,eax
xor edx,ebx ;wait until mov is finished
shr edx,1   ;wait until xor is finished
and eax,ebx
add eax,edx ;wait until and is finished
    


What about:

Code:
mov edx,eax
and eax,ebx
xor edx,ebx
shr edx,1   ;wait until xor is finished
add eax,edx
    
Post 24 May 2005, 07:01
View user's profile Send private message Visit poster's website Reply with quote
Octavio



Joined: 21 Jun 2003
Posts: 366
Location: Spain
Octavio 24 May 2005, 09:54
Madis731 wrote:
Would anyone care to explain to me why this first one is recommended on 'gems' sites and not the latter one. Compatibility? Why would my CPU waste
2 clocks instead of 1/3 on every 1µop instruction?

perhaps because the second one only works on unsigned numbers.
What about using mmx?
Post 24 May 2005, 09:54
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 24 May 2005, 10:53
Let me point out a small unprecision in your code suggestion
Code:
mov edx,eax 
and eax,ebx 
xor edx,ebx 
sar edx,1   ;wait until xor is finished 
add eax,edx ;wait until sar is finished; <- forgot this Smile
    


This one is only slightly better
Code:
mov edx,eax 
xor eax,ebx
and edx,ebx 
sar eax,1
add eax,edx ;wait sar is finished
    


Quote:
What about using mmx?


Sure, this goes too:
Code:
;a: mm0  ;<- this should not mean "ammo" Wink
;b: mm1
;tmp: mm2 ;help mmx register value

movq mm2,mm0
pxor mm0,mm1
pand mm2,mm1
psraw/d mm0,mm0,1 ;<= this disallows you from using byte/qword average because psrlb/w are not available in mmx Sad
paddb/w/d/q* mm0,mm2 ;wait psraw/d is finished
    

actually, you can't only use word and dword data sizes, because of the restriction of the paddX and psraX instruction.

* note: paddq was introduced with SSE2.

Actually, if you allow MMX-II instructions introduced with SSE, you can calculate the average in one instruction:
Code:
pavgb/w mm0,mm1;only available for byte/words
;There is also something very similar for xmm FPU registers
    

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||


Last edited by MCD on 25 May 2005, 11:07; edited 4 times in total
Post 24 May 2005, 10:53
View user's profile Send private message Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 24 May 2005, 11:01
Eh, you're right, MCD Wink

I was pointing in the first place to the fact that Madis have leaved out of consideration dependecies between the instructions.
Post 24 May 2005, 11:01
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 24 May 2005, 15:31
You are right - there are dependancies but I would've noticed them if I had used Pentium II or earlier one. The code you suggested acts the same way because the "mov eax,edx" can't start before the last "add eax,edx" is finished so one pass is faster, but by running it 10 times in a row (NO jmp - only 10xcode) the dependancy problem occurs in a different place.

I want to argue about signed numbers:
It works with POS+POS, POS+NEG, NEG+POS and NEG+NEG

0FFFFFFFFh+0FFFFFFFDh=Carry+0FFFFFFFCh => -1+(-3)=-4
rcr 0FFFFFFFCh,1 = 0FFFFFFFEh <= -2
What do you mean it works ONLY on unsigned numbers?

BTW the SSE solution is elegant Wink but I was speaking about the gems that were introduced many-many years ago and I just got a confirmation that rcr/rcl did indeed exist in 286 Intel processors so why wasn't it used.

You didn't answer my question but were arguing about the opimization of the first algorithm.

What I need to know is WHY is my version BAD?
I don't want to know how optimized the first version is - thanks!
Post 24 May 2005, 15:31
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 24 May 2005, 22:08
Code:
mov eax,-4
mov edx,2
add eax,edx
rcr eax,1
push eax
push fmt ; '%li',0
push buffer
call [wsprintf]
add esp,0ch
push 0
push buffer
push buffer
push 0
call [MessageBox]
    

When finding the average of a negative and a positive number the
ADD
RCR
algorithm fails.
Post 24 May 2005, 22:08
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 25 May 2005, 10:18
Indeed Neutral I didn't notice it before - but the first algorithm also works only with unsigned. I should use "sar" instead of "shr" then.
Thanks for pointing that out.
Post 25 May 2005, 10:18
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 25 May 2005, 11:03
Ups, i missed this issue too. Just corrected my stuff also.
Post 25 May 2005, 11:03
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 26 May 2005, 16:45
Ok, now that its clear I say that I will stick to my version because I've never had to use signed values. This will do.
Thanks everybody for your input!

Too bad there isn't rotate arithmetic right Smile i.e. RAR
Post 26 May 2005, 16:45
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.