flat assembler
Message board for the users of flat assembler.

Index > Main > Calculating multi-dimensional array displacements, how?

Goto page Previous  1, 2, 3  Next
Author
Thread Post new topic Reply to topic
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Code:
mov  ecx, eax
mov  edx, ebx

and  eax, not $01010101
and  ebx, not $01010101

and  ecx, $01010101
and  edx, $01010101

add  eax, ebx
rcr  eax, 1

and  edx, ecx ; Yes, AND

add  eax, edx
ret    


The rationale is that Int(A/2 + B/2) is below by one than Int((A + B)/2) only when A mod 2 = B mod 2 = 1.

Actually I should sleep now so perhaps I said non sense above Razz
Post 05 Dec 2007, 04:55
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
I have made an extra version that is compatible with MMX PAVGB. The code also verifies the compatibility
Code:
include 'win32ax.inc'

MMX_OUTPUT_COMPATIBLE = 1

.code

start:
  mov     esi, $03020100

.loop:
  xor     ebp, ebp

  .innerLoop:
    mov     eax, esi
    mov     edx, ebp
    call    mmx_pavgb

    mov     edi, eax

    mov     eax, esi
    mov     edx, ebp
    call    alu_pavgb

    cmp     eax, edi
    jne     .invalidResult

    add     ebp, $01010101
    jnc     .innerLoop

  add     esi, $04040404
  jnc     .loop

  invoke  MessageBox, 0, "Fully compatible!! :D", "Packed average test", MB_ICONINFORMATION
  invoke  ExitProcess, 0

.invalidResult:
  invoke  MessageBox, 0, "There was an incompatible result :(", "Packed average test", MB_ICONERROR
  invoke  ExitProcess, 0



mmx_pavgb:
  movd    mm0, eax
  movd    mm1, edx

  pavgb   mm0, mm1

  movd    eax, mm0
  ret

alu_pavgb:
  mov     ebx, eax
  mov     ecx, edx

  and     eax, not $01010101
  and     edx, not $01010101

  and     ebx, $01010101
  and     ecx, $01010101

  add     eax, edx
  rcr     eax, 1

if defined MMX_OUTPUT_COMPATIBLE
  or      ebx, ecx ; Rounds up
else
  and     ebx, ecx ; Rounds down
end if

  add     eax, ebx
  ret

.end start    
Post 05 Dec 2007, 17:23
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2942
Location: vpcmipstrm
bitRAKE
Good work, LocoDelAssembly. The rounding value could be masked after they are joined together - saving two instructions, but no improvement in speed, imho.

I just wanted to mention that this particular algorithm doesn't require multi-dimentional arrays - both imagines can be addressed like basic arrays if they are the same size. In the case where they are different sizes, one will advance like a one-dimentional array and the other will need an offset added for each line.

It is a good example of how the data model colapses at the code level.
Post 05 Dec 2007, 21:28
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

The rounding value could be masked after they are joined together - saving two instructions, but no improvement in speed, imho.

I can't see what do you say I can do Embarassed
Post 05 Dec 2007, 21:53
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2942
Location: vpcmipstrm
bitRAKE
Code:
mov  edx, eax
and  eax, not $01010101
and  edx, ebx
and  ebx, not $01010101
and  edx, $01010101
add  eax, ebx
rcr  eax, 1
add  eax, edx
ret    
Post 05 Dec 2007, 23:50
View user's profile Send private message Visit poster's website Reply with quote
xspeed



Joined: 16 Aug 2007
Posts: 22
xspeed
why dont you post your source code in any form of HL and we will see what you are talking about.

I found that when you are dealing with matrix/multi-array it is best to use repition fuction.

post the alogarithm in a HL form so we see what you are talking about.
Post 06 Dec 2007, 00:26
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
bitRAKE, I'm very dissapointed of myself now Sad

Now I modified my code according to your suggestions so we can still choose for MMX compatibility.

Code:
alu_pavgb:
  mov     ecx, edx
  and     edx, not $01010101

if defined MMX_OUTPUT_COMPATIBLE
  or      ecx, eax ; Rounds up
else
  and     ecx, eax ; Rounds down
end if

  and     eax, not $01010101
  and     ecx, $01010101

  add     eax, edx
  rcr     eax, 1

  add     eax, ecx
  ret    


And now it is fastcall compatible (ebx is not destroyed anymore and arguments are passed as fastcall like before). Of course by no means this must be used as a PROC, one have to inline it always.

Thanks bitRAKE!!
Post 06 Dec 2007, 01:37
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4240
Location: 2018
edfed
Quote:

post the alogarithm in a HL form so we see what you are talking about.


when coding in asm, algorythms are in asm.
make some pseudo code in HL or C or basic doesn't help to understand.

personally, when i read asm code, i understand the idea there is behind the instructions.
learning asm, understand asm is a long and hard walk.
but after, you can easily understand every languages.
like learning latin helps to learn frensh, spanish, italian etc...

here, the name of the forum is "board.flatassembler", not 'HLA".
Wink
Post 06 Dec 2007, 11:26
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2942
Location: vpcmipstrm
bitRAKE
LocoDelAssembly wrote:
bitRAKE, I'm very dissapointed of myself now Sad
Don't let that bother you. I always have this sense of missing something when I code - it never goes away.
Post 06 Dec 2007, 14:56
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4240
Location: 2018
edfed
Code:
;eax=pxl1  32 bit
;ebx=pxl2  32 bit 
mov  ecx, eax 
mov  edx, ebx 
and  eax, not $01010101 
and  ebx, not $01010101 
and  ecx, $01010101 
and  edx, $01010101 
add  eax, ebx 
rcr  eax, 1       
;add edx,ecx    ;yes, ADD, because of 1+1=2 ; false: 0/20;;;need to shr 1 : 1+1=2
and  edx, ecx ;;;; NO;;;; because 1 AND 1=1       ;solution
add  eax, edx 
;if cached, then use U and V pipelines, only 5 CPU cycles.
;for 24 bit, 16 bit and 15 bit, just change the AND_MASKs
;AND MASK can be in EDI and/or ESI
    


Last edited by edfed on 06 Dec 2007, 20:18; edited 1 time in total
Post 06 Dec 2007, 16:00
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
edfed, no, ADD no, but if you use it then don't forget to use SHR, otherwise you are over valuating the least significant bit. In fact, the reason of why I used more ANDs than needed before is because I was going to do it like you (plus the SHR), but even before writting SHR I realised that AND can do the task of the ADD/SHR pair so I used an AND but my brain got a hardcoded view about AND acting as ADD so I wasn't able to see the redundant instructions Razz

PS: And yes, I tried your code, it is incompatible with both, round up (MMX) and round down. Also note that you even have to mask the result even after the SHR, otherwise it keeps incompatible
Code:
 ; Code below is equivalent to "and edx, ecx" (on this particular context)
add edx,ecx    ;yes, ADD, because of 1+1=2
shr edx, 1
and edx, $01010101
    
Post 06 Dec 2007, 18:42
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4240
Location: 2018
edfed
yes,
sorry,
Code:
         and edx,ecx
    

not this:
Code:
         add edx,ecx
         shr edx,1
         and edx,01010101h
    


sorry! i don't tes the code i post, i don't code for 32 bit graphics now. so i don't test it. sorry!
0/20 peu mieux faire Sad
Post 06 Dec 2007, 20:16
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4240
Location: 2018
edfed
and what about the mean of 9 pixels?

Code:
cell
dd 00,10,20  ;0,1,2
dd 01,11,21  ;3,4,5 
dd 02,12,22  ;6,7,8 
ret
; cell 11 = sygma cell 00->22 ( 0->8 ) / Ncell
    

the fastest as possible for 386 and MMX.
Post 06 Dec 2007, 20:24
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2942
Location: vpcmipstrm
bitRAKE
PSADBW could be used along with a shift to average many pixels. Shuffling around the bytes will be a little work - faster than 9 loads though. ALU would be 9x(MOVZX/ADD)+SHR for each compenent (haven't put much effort in thinking up a faster routine).
Post 06 Dec 2007, 21:07
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2942
Location: vpcmipstrm
bitRAKE
Just thought up a neat idea for the neighborhood average problem: could keep a running average and just update for the pixels that changed. Moving horizontally/vertically three pixels need to be added and three subtracted to get the sum of the next pixel. Still need to divide by nine - which could be done with a multiply instruction.

Total[n+1] = Total[n] - (Pixel[n-y-1] + Pixel[n-1] + Pixel[n+y-1]) + (Pixel[n-y+2]+Pixel[n+2]+Pixel[n+y+2])

Pixel[n+1] = Total[n+1] / 9

One third less processing right there for the ALU code. (c:

IIRC, the average can be done pairwise with each nieghboring pixel separately for relatively the same effect -- greater rounding error.

Edit: for the range of numbers [0,<1200h] the following will divide EDX by nine (which is sufficient for the adverage of nine pixels):
Code:
imul edx,0E38E4h
shr edx,23    
...much faster than (I)DIV.

Found a better choice:
Code:
    imul edx,1C72h
      shr edx,16    
...works for all signed 16-bit values - extended to 32-bit of course. Works for MMX/SSE2 as well - because we have a high word multiply instruction. Now I can combine the PSADBW values multiply and store to memory - should be very fast.
Post 06 Dec 2007, 22:29
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4240
Location: 2018
edfed
how many FPS for a 800*600 32 bit?
just to know if it is the best choise for a default system function.

defining test as an antialiasing filter, operating on a decompressed video.
no?
the size of the cell can change, [3*3],[9*9],[10*10], coeficients for each cell can change too. 66%, cos (45° )/2, etc...
by treading one byte per word.
two bytes per dword
four bytes per qword
etc...

the cell can make the circular mean of a pixelized picture.
...

Code:
; 00,01,02
; 10,11,12
; 20,21,23
cell:
;settings for screen13h
.size dd @f-$-4
.00 dd -320-1  
.01 dd -320
.02 dd -320+1
.10 dd -1
.11 dd 0
.12 dd +1
.20 dd +320-1
.21 dd +320
.22 dd +320+1
cellmean:
mov esi,[cell]
mov edi,[cell+esi]
mov eax,[pixel+buffer0+edi]
mov ecx,eax
shr ecx,8
and eax,00FF00FFh
and ecx,00FF00FFh
@@:
sub esi,4
jl @f
mov edi,[esi]
mov ebx;[pixel+buffer+edi]
mov edx,ebx
shr edx,8
and ebx,00FF00FFh
and edx,00FF00FFh
add eax,ebx 
add ecx,edx
jmp @b
@@:

and eax,00FF00FFh
and ecx,00FF00FFh
shl ecx,8
add eax,ecx
mov esi,[cell]
shr esi,2
idiv eax,esi
mov [pixel+buffer1],eax
the_end:
;ret
    


this algorythm is slow !
and the borders need to be 0, black, null
Post 07 Dec 2007, 00:31
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2942
Location: vpcmipstrm
bitRAKE
edfed wrote:
how many FPS for a 800*600 32 bit?
It is important to understand how slow memory is: for example, my laptop has a latency of 3, 11, and 200 cycles for accesses to L1, L2, and main memory, respectively. Reading a single byte (whole cacheline will be loaded) from uncached main memory is 200 cycles!

No doubt this routine would be memory bound for full screen operations. I'm a little deep into another project, atm.
Post 07 Dec 2007, 05:34
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4240
Location: 2018
edfed
so, if i correctlly understand, it can be better to do nothing and continue to end use the existing OSes!
Post 07 Dec 2007, 05:51
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2942
Location: vpcmipstrm
bitRAKE
edfed wrote:
so, if i correctlly understand, it can be better to do nothing and continue to end use the existing OSes!
I didn't say that!

We should be mindful of time granularity (packet/quanta) if the algorithm will be used on very large (greater than cache) data sets.

Additionally, if the algorithm will not be executed frequently then use no data - try to encode routine within instructions without memory access (code load is also memory access). This is because a single cache miss can take longer than the algorithm. Stack space is usually in cache, so that isn't so bad.
Post 07 Dec 2007, 17:33
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2942
Location: vpcmipstrm
bitRAKE
Here is a paper that talk a little about the effects of memory on algorithm choices: How Caching Affects Hashing by Gregory L. Heileman and Wenbin Luo 2005.
Post 09 Dec 2007, 22:25
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.