flat assembler
Message board for the users of flat assembler.
Index
> Main > Calculating multi-dimensional array displacements, how? Goto page Previous 1, 2, 3 Next |
Author |
|
LocoDelAssembly 05 Dec 2007, 17:23
I have made an extra version that is compatible with MMX PAVGB. The code also verifies the compatibility
Code: include 'win32ax.inc' MMX_OUTPUT_COMPATIBLE = 1 .code start: mov esi, $03020100 .loop: xor ebp, ebp .innerLoop: mov eax, esi mov edx, ebp call mmx_pavgb mov edi, eax mov eax, esi mov edx, ebp call alu_pavgb cmp eax, edi jne .invalidResult add ebp, $01010101 jnc .innerLoop add esi, $04040404 jnc .loop invoke MessageBox, 0, "Fully compatible!! :D", "Packed average test", MB_ICONINFORMATION invoke ExitProcess, 0 .invalidResult: invoke MessageBox, 0, "There was an incompatible result :(", "Packed average test", MB_ICONERROR invoke ExitProcess, 0 mmx_pavgb: movd mm0, eax movd mm1, edx pavgb mm0, mm1 movd eax, mm0 ret alu_pavgb: mov ebx, eax mov ecx, edx and eax, not $01010101 and edx, not $01010101 and ebx, $01010101 and ecx, $01010101 add eax, edx rcr eax, 1 if defined MMX_OUTPUT_COMPATIBLE or ebx, ecx ; Rounds up else and ebx, ecx ; Rounds down end if add eax, ebx ret .end start |
|||
05 Dec 2007, 17:23 |
|
bitRAKE 05 Dec 2007, 21:28
Good work, LocoDelAssembly. The rounding value could be masked after they are joined together - saving two instructions, but no improvement in speed, imho.
I just wanted to mention that this particular algorithm doesn't require multi-dimentional arrays - both imagines can be addressed like basic arrays if they are the same size. In the case where they are different sizes, one will advance like a one-dimentional array and the other will need an offset added for each line. It is a good example of how the data model colapses at the code level. |
|||
05 Dec 2007, 21:28 |
|
LocoDelAssembly 05 Dec 2007, 21:53
Quote:
I can't see what do you say I can do |
|||
05 Dec 2007, 21:53 |
|
bitRAKE 05 Dec 2007, 23:50
Code: mov edx, eax and eax, not $01010101 and edx, ebx and ebx, not $01010101 and edx, $01010101 add eax, ebx rcr eax, 1 add eax, edx ret |
|||
05 Dec 2007, 23:50 |
|
xspeed 06 Dec 2007, 00:26
why dont you post your source code in any form of HL and we will see what you are talking about.
I found that when you are dealing with matrix/multi-array it is best to use repition fuction. post the alogarithm in a HL form so we see what you are talking about. |
|||
06 Dec 2007, 00:26 |
|
LocoDelAssembly 06 Dec 2007, 01:37
bitRAKE, I'm very dissapointed of myself now
Now I modified my code according to your suggestions so we can still choose for MMX compatibility. Code: alu_pavgb: mov ecx, edx and edx, not $01010101 if defined MMX_OUTPUT_COMPATIBLE or ecx, eax ; Rounds up else and ecx, eax ; Rounds down end if and eax, not $01010101 and ecx, $01010101 add eax, edx rcr eax, 1 add eax, ecx ret And now it is fastcall compatible (ebx is not destroyed anymore and arguments are passed as fastcall like before). Of course by no means this must be used as a PROC, one have to inline it always. Thanks bitRAKE!! |
|||
06 Dec 2007, 01:37 |
|
edfed 06 Dec 2007, 11:26
Quote:
when coding in asm, algorythms are in asm. make some pseudo code in HL or C or basic doesn't help to understand. personally, when i read asm code, i understand the idea there is behind the instructions. learning asm, understand asm is a long and hard walk. but after, you can easily understand every languages. like learning latin helps to learn frensh, spanish, italian etc... here, the name of the forum is "board.flatassembler", not 'HLA". |
|||
06 Dec 2007, 11:26 |
|
bitRAKE 06 Dec 2007, 14:56
LocoDelAssembly wrote: bitRAKE, I'm very dissapointed of myself now |
|||
06 Dec 2007, 14:56 |
|
edfed 06 Dec 2007, 16:00
Code: ;eax=pxl1 32 bit ;ebx=pxl2 32 bit mov ecx, eax mov edx, ebx and eax, not $01010101 and ebx, not $01010101 and ecx, $01010101 and edx, $01010101 add eax, ebx rcr eax, 1 ;add edx,ecx ;yes, ADD, because of 1+1=2 ; false: 0/20;;;need to shr 1 : 1+1=2 and edx, ecx ;;;; NO;;;; because 1 AND 1=1 ;solution add eax, edx ;if cached, then use U and V pipelines, only 5 CPU cycles. ;for 24 bit, 16 bit and 15 bit, just change the AND_MASKs ;AND MASK can be in EDI and/or ESI Last edited by edfed on 06 Dec 2007, 20:18; edited 1 time in total |
|||
06 Dec 2007, 16:00 |
|
LocoDelAssembly 06 Dec 2007, 18:42
edfed, no, ADD no, but if you use it then don't forget to use SHR, otherwise you are over valuating the least significant bit. In fact, the reason of why I used more ANDs than needed before is because I was going to do it like you (plus the SHR), but even before writting SHR I realised that AND can do the task of the ADD/SHR pair so I used an AND but my brain got a hardcoded view about AND acting as ADD so I wasn't able to see the redundant instructions
PS: And yes, I tried your code, it is incompatible with both, round up (MMX) and round down. Also note that you even have to mask the result even after the SHR, otherwise it keeps incompatible Code: ; Code below is equivalent to "and edx, ecx" (on this particular context) add edx,ecx ;yes, ADD, because of 1+1=2 shr edx, 1 and edx, $01010101 |
|||
06 Dec 2007, 18:42 |
|
edfed 06 Dec 2007, 20:16
yes,
sorry, Code:
and edx,ecx
not this: Code: add edx,ecx shr edx,1 and edx,01010101h sorry! i don't tes the code i post, i don't code for 32 bit graphics now. so i don't test it. sorry! 0/20 peu mieux faire |
|||
06 Dec 2007, 20:16 |
|
edfed 06 Dec 2007, 20:24
and what about the mean of 9 pixels?
Code: cell dd 00,10,20 ;0,1,2 dd 01,11,21 ;3,4,5 dd 02,12,22 ;6,7,8 ret ; cell 11 = sygma cell 00->22 ( 0->8 ) / Ncell the fastest as possible for 386 and MMX. |
|||
06 Dec 2007, 20:24 |
|
bitRAKE 06 Dec 2007, 21:07
PSADBW could be used along with a shift to average many pixels. Shuffling around the bytes will be a little work - faster than 9 loads though. ALU would be 9x(MOVZX/ADD)+SHR for each compenent (haven't put much effort in thinking up a faster routine).
|
|||
06 Dec 2007, 21:07 |
|
bitRAKE 06 Dec 2007, 22:29
Just thought up a neat idea for the neighborhood average problem: could keep a running average and just update for the pixels that changed. Moving horizontally/vertically three pixels need to be added and three subtracted to get the sum of the next pixel. Still need to divide by nine - which could be done with a multiply instruction.
Total[n+1] = Total[n] - (Pixel[n-y-1] + Pixel[n-1] + Pixel[n+y-1]) + (Pixel[n-y+2]+Pixel[n+2]+Pixel[n+y+2]) Pixel[n+1] = Total[n+1] / 9 One third less processing right there for the ALU code. (c: IIRC, the average can be done pairwise with each nieghboring pixel separately for relatively the same effect -- greater rounding error. Edit: for the range of numbers [0,<1200h] the following will divide EDX by nine (which is sufficient for the adverage of nine pixels): Code: imul edx,0E38E4h shr edx,23 Found a better choice: Code: imul edx,1C72h shr edx,16 |
|||
06 Dec 2007, 22:29 |
|
edfed 07 Dec 2007, 00:31
how many FPS for a 800*600 32 bit?
just to know if it is the best choise for a default system function. defining test as an antialiasing filter, operating on a decompressed video. no? the size of the cell can change, [3*3],[9*9],[10*10], coeficients for each cell can change too. 66%, cos (45° )/2, etc... by treading one byte per word. two bytes per dword four bytes per qword etc... the cell can make the circular mean of a pixelized picture. ... Code: ; 00,01,02 ; 10,11,12 ; 20,21,23 cell: ;settings for screen13h .size dd @f-$-4 .00 dd -320-1 .01 dd -320 .02 dd -320+1 .10 dd -1 .11 dd 0 .12 dd +1 .20 dd +320-1 .21 dd +320 .22 dd +320+1 cellmean: mov esi,[cell] mov edi,[cell+esi] mov eax,[pixel+buffer0+edi] mov ecx,eax shr ecx,8 and eax,00FF00FFh and ecx,00FF00FFh @@: sub esi,4 jl @f mov edi,[esi] mov ebx;[pixel+buffer+edi] mov edx,ebx shr edx,8 and ebx,00FF00FFh and edx,00FF00FFh add eax,ebx add ecx,edx jmp @b @@: and eax,00FF00FFh and ecx,00FF00FFh shl ecx,8 add eax,ecx mov esi,[cell] shr esi,2 idiv eax,esi mov [pixel+buffer1],eax the_end: ;ret this algorythm is slow ! and the borders need to be 0, black, null |
|||
07 Dec 2007, 00:31 |
|
bitRAKE 07 Dec 2007, 05:34
edfed wrote: how many FPS for a 800*600 32 bit? No doubt this routine would be memory bound for full screen operations. I'm a little deep into another project, atm. |
|||
07 Dec 2007, 05:34 |
|
edfed 07 Dec 2007, 05:51
so, if i correctlly understand, it can be better to do nothing and continue to end use the existing OSes!
|
|||
07 Dec 2007, 05:51 |
|
bitRAKE 07 Dec 2007, 17:33
edfed wrote: so, if i correctlly understand, it can be better to do nothing and continue to end use the existing OSes! We should be mindful of time granularity (packet/quanta) if the algorithm will be used on very large (greater than cache) data sets. Additionally, if the algorithm will not be executed frequently then use no data - try to encode routine within instructions without memory access (code load is also memory access). This is because a single cache miss can take longer than the algorithm. Stack space is usually in cache, so that isn't so bad. |
|||
07 Dec 2007, 17:33 |
|
bitRAKE 09 Dec 2007, 22:25
Here is a paper that talk a little about the effects of memory on algorithm choices: How Caching Affects Hashing by Gregory L. Heileman and Wenbin Luo 2005.
|
|||
09 Dec 2007, 22:25 |
|
Goto page Previous 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.