flat assembler
Message board for the users of flat assembler.

 Index > Main > Integer average algorithm
Author

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Motivation: (a+b)/2 may get an overflow so alternatives should be thought!
(a+b)/2 = ((a&b)+(a|b))/2 = ((x^y)+2*(x&y))/2 => ((x^y)/2+(x&y))
Code:
```mov eax,3141592653
mov ebx,1897932384

;Five instructions (1µop - theory 2.33clocks - even 2.0 serialized)
mov edx,eax ;port0 or 1 - first clock
xor edx,ebx ;port1 or 0
shr edx,1   ;port0 - second clock
and eax,ebx ;port1
add eax,edx ;port0 - third clock
;Result: 11.1clk - reordering didn't help
;I really don't know why it is so cruel on my PIII

;I would use this kind of algo on CPUs supporting
;rotations through carry
;TWO instructions (1+2µops - theory 2clocks - 1.66 serialized)
add eax,ebx ;port0 - first clock
rcr eax,1   ;port0+1 - second clock
;though delaying on rotate.
```

Would anyone care to explain to me why this first one is recommended on 'gems' sites and not the latter one. Compatibility? Why would my CPU waste
2 clocks instead of 1/3 on every 1µop instruction?

_________________
My updated idol http://www.agner.org/optimize/
23 May 2005, 21:47
MazeGen

Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen
As I see it:

Code:
```mov edx,eax
xor edx,ebx ;wait until mov is finished
shr edx,1   ;wait until xor is finished
and eax,ebx
add eax,edx ;wait until and is finished
```

Code:
```mov edx,eax
and eax,ebx
xor edx,ebx
shr edx,1   ;wait until xor is finished
```
24 May 2005, 07:01
Octavio

Joined: 21 Jun 2003
Posts: 366
Location: Spain
Octavio
Would anyone care to explain to me why this first one is recommended on 'gems' sites and not the latter one. Compatibility? Why would my CPU waste
2 clocks instead of 1/3 on every 1µop instruction?

perhaps because the second one only works on unsigned numbers.
24 May 2005, 09:54
MCD

Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD
Let me point out a small unprecision in your code suggestion
Code:
```mov edx,eax
and eax,ebx
xor edx,ebx
sar edx,1   ;wait until xor is finished
add eax,edx ;wait until sar is finished; <- forgot this
```

This one is only slightly better
Code:
```mov edx,eax
xor eax,ebx
and edx,ebx
sar eax,1
add eax,edx ;wait sar is finished
```

Quote:

Sure, this goes too:
Code:
```;a: mm0  ;<- this should not mean "ammo"
;b: mm1
;tmp: mm2 ;help mmx register value

movq mm2,mm0
pxor mm0,mm1
pand mm2,mm1
psraw/d mm0,mm0,1 ;<= this disallows you from using byte/qword average because psrlb/w are not available in mmx
paddb/w/d/q* mm0,mm2 ;wait psraw/d is finished
```

actually, you can't only use word and dword data sizes, because of the restriction of the paddX and psraX instruction.

* note: paddq was introduced with SSE2.

Actually, if you allow MMX-II instructions introduced with SSE, you can calculate the average in one instruction:
Code:
```pavgb/w mm0,mm1;only available for byte/words
;There is also something very similar for xmm FPU registers
```

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||

Last edited by MCD on 25 May 2005, 11:07; edited 4 times in total
24 May 2005, 10:53
MazeGen

Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen
Eh, you're right, MCD

I was pointing in the first place to the fact that Madis have leaved out of consideration dependecies between the instructions.
24 May 2005, 11:01

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
You are right - there are dependancies but I would've noticed them if I had used Pentium II or earlier one. The code you suggested acts the same way because the "mov eax,edx" can't start before the last "add eax,edx" is finished so one pass is faster, but by running it 10 times in a row (NO jmp - only 10xcode) the dependancy problem occurs in a different place.

I want to argue about signed numbers:
It works with POS+POS, POS+NEG, NEG+POS and NEG+NEG

0FFFFFFFFh+0FFFFFFFDh=Carry+0FFFFFFFCh => -1+(-3)=-4
rcr 0FFFFFFFCh,1 = 0FFFFFFFEh <= -2
What do you mean it works ONLY on unsigned numbers?

BTW the SSE solution is elegant but I was speaking about the gems that were introduced many-many years ago and I just got a confirmation that rcr/rcl did indeed exist in 286 Intel processors so why wasn't it used.

You didn't answer my question but were arguing about the opimization of the first algorithm.

What I need to know is WHY is my version BAD?
I don't want to know how optimized the first version is - thanks!
24 May 2005, 15:31
r22

Joined: 27 Dec 2004
Posts: 805
r22
Code:
```mov eax,-4
mov edx,2
rcr eax,1
push eax
push fmt ; '%li',0
push buffer
call [wsprintf]
push 0
push buffer
push buffer
push 0
call [MessageBox]
```

When finding the average of a negative and a positive number the
RCR
algorithm fails.
24 May 2005, 22:08

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Indeed I didn't notice it before - but the first algorithm also works only with unsigned. I should use "sar" instead of "shr" then.
Thanks for pointing that out.
25 May 2005, 10:18
MCD

Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD
Ups, i missed this issue too. Just corrected my stuff also.
25 May 2005, 11:03

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Ok, now that its clear I say that I will stick to my version because I've never had to use signed values. This will do.

Too bad there isn't rotate arithmetic right i.e. RAR
26 May 2005, 16:45
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum