flat assembler
Message board for the users of flat assembler.

 Index > Main > Calculating multi-dimensional array displacements, how? Goto page Previous  1, 2, 3  Next
Author
LocoDelAssembly

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 05 Dec 2007, 04:55
Code:
```mov  ecx, eax
mov  edx, ebx

and  eax, not \$01010101
and  ebx, not \$01010101

and  ecx, \$01010101
and  edx, \$01010101

rcr  eax, 1

and  edx, ecx ; Yes, AND

ret    ```

The rationale is that Int(A/2 + B/2) is below by one than Int((A + B)/2) only when A mod 2 = B mod 2 = 1.

Actually I should sleep now so perhaps I said non sense above
05 Dec 2007, 04:55
LocoDelAssembly

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 05 Dec 2007, 17:23
I have made an extra version that is compatible with MMX PAVGB. The code also verifies the compatibility
Code:
```include 'win32ax.inc'

MMX_OUTPUT_COMPATIBLE = 1

.code

start:
mov     esi, \$03020100

.loop:
xor     ebp, ebp

.innerLoop:
mov     eax, esi
mov     edx, ebp
call    mmx_pavgb

mov     edi, eax

mov     eax, esi
mov     edx, ebp
call    alu_pavgb

cmp     eax, edi
jne     .invalidResult

jnc     .innerLoop

jnc     .loop

invoke  MessageBox, 0, "Fully compatible!! :D", "Packed average test", MB_ICONINFORMATION
invoke  ExitProcess, 0

.invalidResult:
invoke  MessageBox, 0, "There was an incompatible result :(", "Packed average test", MB_ICONERROR
invoke  ExitProcess, 0

mmx_pavgb:
movd    mm0, eax
movd    mm1, edx

pavgb   mm0, mm1

movd    eax, mm0
ret

alu_pavgb:
mov     ebx, eax
mov     ecx, edx

and     eax, not \$01010101
and     edx, not \$01010101

and     ebx, \$01010101
and     ecx, \$01010101

rcr     eax, 1

if defined MMX_OUTPUT_COMPATIBLE
or      ebx, ecx ; Rounds up
else
and     ebx, ecx ; Rounds down
end if

ret

.end start    ```
05 Dec 2007, 17:23
bitRAKE

Joined: 21 Jul 2003
Posts: 3976
Location: vpcmipstrm
bitRAKE 05 Dec 2007, 21:28
Good work, LocoDelAssembly. The rounding value could be masked after they are joined together - saving two instructions, but no improvement in speed, imho.

I just wanted to mention that this particular algorithm doesn't require multi-dimentional arrays - both imagines can be addressed like basic arrays if they are the same size. In the case where they are different sizes, one will advance like a one-dimentional array and the other will need an offset added for each line.

It is a good example of how the data model colapses at the code level.
05 Dec 2007, 21:28
LocoDelAssembly

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 05 Dec 2007, 21:53
Quote:

The rounding value could be masked after they are joined together - saving two instructions, but no improvement in speed, imho.

I can't see what do you say I can do
05 Dec 2007, 21:53
bitRAKE

Joined: 21 Jul 2003
Posts: 3976
Location: vpcmipstrm
bitRAKE 05 Dec 2007, 23:50
Code:
```mov  edx, eax
and  eax, not \$01010101
and  edx, ebx
and  ebx, not \$01010101
and  edx, \$01010101
rcr  eax, 1
ret    ```
05 Dec 2007, 23:50
xspeed

Joined: 16 Aug 2007
Posts: 22
xspeed 06 Dec 2007, 00:26
why dont you post your source code in any form of HL and we will see what you are talking about.

I found that when you are dealing with matrix/multi-array it is best to use repition fuction.

post the alogarithm in a HL form so we see what you are talking about.
06 Dec 2007, 00:26
LocoDelAssembly

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 06 Dec 2007, 01:37
bitRAKE, I'm very dissapointed of myself now

Now I modified my code according to your suggestions so we can still choose for MMX compatibility.

Code:
```alu_pavgb:
mov     ecx, edx
and     edx, not \$01010101

if defined MMX_OUTPUT_COMPATIBLE
or      ecx, eax ; Rounds up
else
and     ecx, eax ; Rounds down
end if

and     eax, not \$01010101
and     ecx, \$01010101

rcr     eax, 1

ret    ```

And now it is fastcall compatible (ebx is not destroyed anymore and arguments are passed as fastcall like before). Of course by no means this must be used as a PROC, one have to inline it always.

Thanks bitRAKE!!
06 Dec 2007, 01:37
edfed

Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 06 Dec 2007, 11:26
Quote:

post the alogarithm in a HL form so we see what you are talking about.

when coding in asm, algorythms are in asm.
make some pseudo code in HL or C or basic doesn't help to understand.

personally, when i read asm code, i understand the idea there is behind the instructions.
learning asm, understand asm is a long and hard walk.
but after, you can easily understand every languages.
like learning latin helps to learn frensh, spanish, italian etc...

here, the name of the forum is "board.flatassembler", not 'HLA".
06 Dec 2007, 11:26
bitRAKE

Joined: 21 Jul 2003
Posts: 3976
Location: vpcmipstrm
bitRAKE 06 Dec 2007, 14:56
LocoDelAssembly wrote:
bitRAKE, I'm very dissapointed of myself now
Don't let that bother you. I always have this sense of missing something when I code - it never goes away.
06 Dec 2007, 14:56
edfed

Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 06 Dec 2007, 16:00
Code:
```;eax=pxl1  32 bit
;ebx=pxl2  32 bit
mov  ecx, eax
mov  edx, ebx
and  eax, not \$01010101
and  ebx, not \$01010101
and  ecx, \$01010101
and  edx, \$01010101
rcr  eax, 1
;add edx,ecx    ;yes, ADD, because of 1+1=2 ; false: 0/20;;;need to shr 1 : 1+1=2
and  edx, ecx ;;;; NO;;;; because 1 AND 1=1       ;solution
;if cached, then use U and V pipelines, only 5 CPU cycles.
;for 24 bit, 16 bit and 15 bit, just change the AND_MASKs
;AND MASK can be in EDI and/or ESI
```

Last edited by edfed on 06 Dec 2007, 20:18; edited 1 time in total
06 Dec 2007, 16:00
LocoDelAssembly

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 06 Dec 2007, 18:42
edfed, no, ADD no, but if you use it then don't forget to use SHR, otherwise you are over valuating the least significant bit. In fact, the reason of why I used more ANDs than needed before is because I was going to do it like you (plus the SHR), but even before writting SHR I realised that AND can do the task of the ADD/SHR pair so I used an AND but my brain got a hardcoded view about AND acting as ADD so I wasn't able to see the redundant instructions

PS: And yes, I tried your code, it is incompatible with both, round up (MMX) and round down. Also note that you even have to mask the result even after the SHR, otherwise it keeps incompatible
Code:
``` ; Code below is equivalent to "and edx, ecx" (on this particular context)
shr edx, 1
and edx, \$01010101
```
06 Dec 2007, 18:42
edfed

Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 06 Dec 2007, 20:16
yes,
sorry,
Code:
```         and edx,ecx
```

not this:
Code:
```         add edx,ecx
shr edx,1
and edx,01010101h
```

sorry! i don't tes the code i post, i don't code for 32 bit graphics now. so i don't test it. sorry!
0/20 peu mieux faire
06 Dec 2007, 20:16
edfed

Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 06 Dec 2007, 20:24
and what about the mean of 9 pixels?

Code:
```cell
dd 00,10,20  ;0,1,2
dd 01,11,21  ;3,4,5
dd 02,12,22  ;6,7,8
ret
; cell 11 = sygma cell 00->22 ( 0->8 ) / Ncell
```

the fastest as possible for 386 and MMX.
06 Dec 2007, 20:24
bitRAKE

Joined: 21 Jul 2003
Posts: 3976
Location: vpcmipstrm
bitRAKE 06 Dec 2007, 21:07
PSADBW could be used along with a shift to average many pixels. Shuffling around the bytes will be a little work - faster than 9 loads though. ALU would be 9x(MOVZX/ADD)+SHR for each compenent (haven't put much effort in thinking up a faster routine).
06 Dec 2007, 21:07
bitRAKE

Joined: 21 Jul 2003
Posts: 3976
Location: vpcmipstrm
bitRAKE 06 Dec 2007, 22:29
Just thought up a neat idea for the neighborhood average problem: could keep a running average and just update for the pixels that changed. Moving horizontally/vertically three pixels need to be added and three subtracted to get the sum of the next pixel. Still need to divide by nine - which could be done with a multiply instruction.

Total[n+1] = Total[n] - (Pixel[n-y-1] + Pixel[n-1] + Pixel[n+y-1]) + (Pixel[n-y+2]+Pixel[n+2]+Pixel[n+y+2])

Pixel[n+1] = Total[n+1] / 9

One third less processing right there for the ALU code. (c:

IIRC, the average can be done pairwise with each nieghboring pixel separately for relatively the same effect -- greater rounding error.

Edit: for the range of numbers [0,<1200h] the following will divide EDX by nine (which is sufficient for the adverage of nine pixels):
Code:
```imul edx,0E38E4h
shr edx,23    ```
...much faster than (I)DIV.

Found a better choice:
Code:
```    imul edx,1C72h
shr edx,16    ```
...works for all signed 16-bit values - extended to 32-bit of course. Works for MMX/SSE2 as well - because we have a high word multiply instruction. Now I can combine the PSADBW values multiply and store to memory - should be very fast.
06 Dec 2007, 22:29
edfed

Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 07 Dec 2007, 00:31
how many FPS for a 800*600 32 bit?
just to know if it is the best choise for a default system function.

defining test as an antialiasing filter, operating on a decompressed video.
no?
the size of the cell can change, [3*3],[9*9],[10*10], coeficients for each cell can change too. 66%, cos (45° )/2, etc...
by treading one byte per word.
two bytes per dword
four bytes per qword
etc...

the cell can make the circular mean of a pixelized picture.
...

Code:
```; 00,01,02
; 10,11,12
; 20,21,23
cell:
;settings for screen13h
.size dd @f-\$-4
.00 dd -320-1
.01 dd -320
.02 dd -320+1
.10 dd -1
.11 dd 0
.12 dd +1
.20 dd +320-1
.21 dd +320
.22 dd +320+1
cellmean:
mov esi,[cell]
mov edi,[cell+esi]
mov eax,[pixel+buffer0+edi]
mov ecx,eax
shr ecx,8
and eax,00FF00FFh
and ecx,00FF00FFh
@@:
sub esi,4
jl @f
mov edi,[esi]
mov ebx;[pixel+buffer+edi]
mov edx,ebx
shr edx,8
and ebx,00FF00FFh
and edx,00FF00FFh
jmp @b
@@:

and eax,00FF00FFh
and ecx,00FF00FFh
shl ecx,8
mov esi,[cell]
shr esi,2
idiv eax,esi
mov [pixel+buffer1],eax
the_end:
;ret
```

this algorythm is slow !
and the borders need to be 0, black, null
07 Dec 2007, 00:31
bitRAKE

Joined: 21 Jul 2003
Posts: 3976
Location: vpcmipstrm
bitRAKE 07 Dec 2007, 05:34
edfed wrote:
how many FPS for a 800*600 32 bit?
It is important to understand how slow memory is: for example, my laptop has a latency of 3, 11, and 200 cycles for accesses to L1, L2, and main memory, respectively. Reading a single byte (whole cacheline will be loaded) from uncached main memory is 200 cycles!

No doubt this routine would be memory bound for full screen operations. I'm a little deep into another project, atm.
07 Dec 2007, 05:34
edfed

Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 07 Dec 2007, 05:51
so, if i correctlly understand, it can be better to do nothing and continue to end use the existing OSes!
07 Dec 2007, 05:51
bitRAKE

Joined: 21 Jul 2003
Posts: 3976
Location: vpcmipstrm
bitRAKE 07 Dec 2007, 17:33
edfed wrote:
so, if i correctlly understand, it can be better to do nothing and continue to end use the existing OSes!
I didn't say that!

We should be mindful of time granularity (packet/quanta) if the algorithm will be used on very large (greater than cache) data sets.

Additionally, if the algorithm will not be executed frequently then use no data - try to encode routine within instructions without memory access (code load is also memory access). This is because a single cache miss can take longer than the algorithm. Stack space is usually in cache, so that isn't so bad.
07 Dec 2007, 17:33
bitRAKE

Joined: 21 Jul 2003
Posts: 3976
Location: vpcmipstrm
bitRAKE 09 Dec 2007, 22:25
Here is a paper that talk a little about the effects of memory on algorithm choices: How Caching Affects Hashing by Gregory L. Heileman and Wenbin Luo 2005.
09 Dec 2007, 22:25
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area
Goto page Previous  1, 2, 3  Next

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum