flat assembler
Message board for the users of flat assembler.
![]() Goto page 1, 2 Next |
Author |
|
idle 12 May 2012, 18:20
Code: ... NextPixel: and dword[esi],$00'ff'00'00 ;= 255 shl 16 add esi,4 loop NextPixel shl/r work like a meet chopper - NOT a circle hi Andy, i recognize myself in you, good luck! |
|||
![]() |
|
Andy 12 May 2012, 18:54
Thanks guys, I'll try to understand all your code.
Quote: is it zeroing the other channels? Yes, in this case was for blue channel. Actually this code it's just to generate raw binary code and then run it from a HLL. Br, Andy |
|||
![]() |
|
shutdownall 12 May 2012, 22:10
idle wrote:
This would be executed faster I think, by use of one more register (ebx). But the data is modified in descending order (not ascending) but this is maybe not important. Code: ... mov ebx,$00ff0000 sub esi,4 NextPixel: and dword[esi+ecx*4],ebx loop NextPixel By the way, this code does not exactly do the same as the code from TE. It sets alpha channel to 255 (0ffh) too. This need a second register (edx) to avoid immediates during loop. To do this the example need one additional line: Code: ... mov ebx,$00ff0000 mov edx,$ff000000 sub esi,4 NextPixel: and dword[esi+ecx*4],ebx or dword[esi+ecx*4],edx loop NextPixel This could be an alternative to the code above with two register more (eax is used by lodsd, edi needed for stosd). The execution of memory area is ascending (with cld). Not sure if execution would be faster than code above. Maybe have to experimentate. Code: cld mov ebx,$00ff0000 mov edx,$ff000000 mov edi,esi NextPixel: lodsd and eax,ebx or eax,edx stosd loop NextPixel |
|||
![]() |
|
idle 13 May 2012, 00:13
shutdownall, great!
last variant gives ~1/3 better results to the above one Code: format pe gui 4.0 include 'win32ax.inc' section '' code import readable writable library kernel32,'kernel32.dll',\ user32,'user32.dll' include 'api\kernel32.inc' include 'api\user32.inc' entry $ rept 2 ?:2{ invoke GetTickCount push eax ;time in push 10000 ;etc counter @@:mov esi,base mov ecx,(base.-base)/4 mov ebx,$00ff0000 mov edx,$ff000000 call variant#? ;variant2/3 dec dword[esp] jnz @b pop eax ;kill counter invoke GetTickCount ;+time out neg dword[esp] ;-time in add [esp],eax } ;=time invoke wsprintfA,lpOut,lpFmt ,, add esp,8*4 invoke MessageBoxA,0,lpOut,0,0 invoke ExitProcess,eax variant2: sub esi,4 .NextPixel: and [esi+ecx*4],ebx or [esi+ecx*4],edx loop .NextPixel ret variant3: cld mov edi,esi .NextPixel: lodsd and eax,ebx or eax,edx stosd loop .NextPixel ret base: file '1.asm' .: lpFmt db 'variant3: %u ms',10,'variant2: %u ms',0 lpOut rb MAX_PATH |
|||
![]() |
|
idle 13 May 2012, 00:17
can we make rept(and others') counter count downwards: e.g. 3,2..
|
|||
![]() |
|
revolution 13 May 2012, 00:24
idle wrote: can we make rept(and others') counter count downwards: e.g. 3,2.. Code: rept ... { reverse ... } |
|||
![]() |
|
LostCoder 14 May 2012, 13:05
You can try change
Code: loop .NextPixel Code: dec ecx jnz .NextPixel Code: dec ecx Code: sub ecx,1 |
|||
![]() |
|
Madis731 16 May 2012, 10:01
You won't need to change the direction of counting if you change the direction with STD and CLD. At loop entry you should always know how much to count and you can always count down by setting the ecx to this count first.
And yes, the LOOP-instruction is usually slow. On the other hand REPx STOSx/MOVSx/CMPSx are faster when you meet certain criteria. http://www.agner.org/optimize/optimizing_assembly.pdf (16.10 String instructions) "REP MOVSD and REP STOSD are quite fast if the repeat count is not too small." |
|||
![]() |
|
shutdownall 16 May 2012, 11:51
I tested myself and was surprised.
The difference between loop and jz is not very big. jnz is about 10% faster, But the difference between dec ecx and sub ecx,1 is very big. sub ecx,1 is about 75% faster than (about 25% of all-in-all execution time only) in this loop: Code: NextPixel: lodsd and eax,ebx or eax,edx stosd ;loop NextPixel ;dec ecx sub ecx,1 jnz NextPixel |
|||
![]() |
|
Madis731 16 May 2012, 12:12
Not the only way to go faster is with the help of SSE/AVX.
Code: ;init xmm1 to 00FF0000h x 4 ;and xmm2 to FF000000h x 4 nextPixelBlock: movdqa xmm0,[rsi] pand xmm0,xmm1 por xmm0,xmm2 movdqa [rsi],xmm0 add rsi,16 sub rcx,16 jnz nextPixelBlock |
|||
![]() |
|
revolution 16 May 2012, 12:15
Madis731 wrote: Not the only way to go faster is with the help of SSE/AVX. |
|||
![]() |
|
Madis731 16 May 2012, 12:32
Well, AVX is a relatively new thing, but safe to assume that SSE is supported.
I tested SSE, and some other approaches taken from previous posts. i5 650 @ 3.2GHz Variant1 - 921 ms Variant2 - 2839 ms Variant3 - 5663 ms Variant4 - 6567 ms Code: variant1: sub esi,16 shl ecx,2 movdqa xmm1,dqword[hex00FF0000] movdqa xmm2,dqword[hexFF000000] .NextPixelBlock: movdqa xmm0,[esi+ecx] pand xmm0,xmm1 por xmm0,xmm2 movdqa [esi+ecx],xmm0 sub ecx,16 jnz .NextPixelBlock ret variant2: sub esi,4 .NextPixel: mov eax,[esi+ecx*4] and eax,ebx or eax,edx mov [esi+ecx*4],eax sub ecx,1 jnz .NextPixel ret variant3: sub esi,4 .NextPixel: and [esi+ecx*4],ebx or [esi+ecx*4],edx loop .NextPixel ret variant4: cld mov edi,esi .NextPixel: lodsd and eax,ebx or eax,edx stosd loop .NextPixel ret Last edited by Madis731 on 16 May 2012, 12:51; edited 1 time in total |
|||
![]() |
|
revolution 16 May 2012, 12:42
Madis731 wrote: ... but safe to assume that SSE is supported. |
|||
![]() |
|
Madis731 16 May 2012, 12:52
![]() wiki wrote: ...designed by Intel and introduced in 1999 in their Pentium III series... EDIT: ...or there's always MMX ![]() |
|||
![]() |
|
LocoDelAssembly 16 May 2012, 13:03
shutdownall, what CPU have you used? Pentium IV?
|
|||
![]() |
|
shutdownall 16 May 2012, 14:10
LocoDelAssembly wrote: shutdownall, what CPU have you used? Pentium IV?
|
||||||||||
![]() |
|
shutdownall 16 May 2012, 14:12
Madis, why didn't try this one ?
Should be somewhere nearer to your SSE timings. shutdownall wrote:
|
|||
![]() |
|
bzdashek 16 May 2012, 18:26
shutdownall wrote: Madis, why didn't try this one ? Isn't lodsd slower than mov eax,dword[esi] ? I'm not sure, hence the question. |
|||
![]() |
|
shutdownall 16 May 2012, 19:07
bzdashek wrote:
On my computer I tried and lodsd is quite faster. |
|||
![]() |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.