flat assembler
Message board for the users of flat assembler.
Index
> Main > MMX u2h trial proc |
Author |
|
hopcode 12 Feb 2010, 01:57
Hallo All,
following this http://board.flatassembler.net/topic.php?p=25330#25330 i have got my first trial at MMX instructions as SIMD mental training. Here follows a little u2h proc. I dont know much about MMX dependencies/stalls etc, nor if it is useful or how much speedy is. I was trying to think parallel as far as i could. The goal was to convert a stream of aligned byte in ascii format. the proc read an entire DQWORD and convert it in only one pass. Also from 1 QWORD (8 stream bytes) -> 16 byte Ascii text Updated 104 bytes and improved (but not doubled) speed Code: ; stream db 1Ah,2Bh,3Ch,4Dh,0ABh,0CDh,0EFh,22h ; dd 0 ; result dq 0.0 ; dd 0 ; maskF0 dq 0F0F0'F0F0'F0F0'F0F0h ; mask09 dq 0909090909090909h ; mask30 dq 3030303030303030h ; mask07 dq 0707070707070707h ; szFormat db ">> %s",13,10,0 mmx_u2h: ;----- fixed block ------- movq mm4,[mask09] movq mm5,[mask07] movq mm6,[mask30] movq mm7,[maskF0] ;----- loader ------------ movq mm0,qword[stream] ; load 8 bytes of stream movq mm1,mm7 ; save mask0F movq mm2,mm0 ; copy 8 bytes to mm2 pand mm0,mm1 ; AND 8 bytes with 0F_ psrlq mm0,4 ; R SHIFT 0F_ mask pandn mm1,mm2 ; NOT AND IN MM1 (this will destroy mask 0F_ ) movq mm3,mm0 ; COPY to avoid MM0 destruction punpcklbw mm0,mm1 ; interleave bytes in low MM0 with those in low MM1 punpckhbw mm3,mm1 ; interleave bytes in hi MM3 with those in hi MM1 movq mm1,mm0 ;<--- copy movq mm2,mm3 ;<--- copy pcmpgtb mm0,mm4 ;set to FF bytes greater than in mask 09_ pcmpgtb mm3,mm4 pand mm0,mm5 ;and FF bytes to mask 07_ paddb mm0,mm6 ;add each byte 30_ paddb mm0,mm1 ;add each byte 0-9 rests movq [result],mm0 ;<------ 1st DWORD ; push result ; push szFormat ; cinvoke printf ; add esp,8 pand mm3,mm5 ;same as above paddb mm3,mm6 paddb mm3,mm2 movq [result],mm3 ;<------ 2nd DWORD ; push result ; push szFormat ; cinvoke printf ; add esp,8 Cheers, hopcode . . . Last edited by hopcode on 12 Feb 2010, 22:38; edited 3 times in total |
|||
12 Feb 2010, 01:57 |
|
revolution 12 Feb 2010, 04:38
Isn't this a binary to ASCII hex converter? So not really an i2a, more of a u2h?
|
|||
12 Feb 2010, 04:38 |
|
edemko 12 Feb 2010, 05:01
Could any comment the code step by step: it looks interesting but i cannot ride it out
|
|||
12 Feb 2010, 05:01 |
|
bitshifter 12 Feb 2010, 06:55
I suggest looking up each instruction in the Intel manuals...
|
|||
12 Feb 2010, 06:55 |
|
hopcode 12 Feb 2010, 15:38
Updated: added some comments on source.
baldr wrote: ...why... not... nibbles from entire qword ? Simply because i am used to think DWORD. But you are right 100%, because it is a SIMD instruction. So, why not ? Also, i have updated the proc: size reduced from 124 --> 104 bytes. Thank for the tip. Now, the endianess, should be handled by a bswap on each dword at load stage (or better if inverting operands on the 2 punpcklbw instructions.) or.. suggestions revolution wrote: u2h u yes, unsigned 2 yes, conversion h yes, more detailled of a-scii, without ambiguity because it is a conversion btw: imho, these SIMD instructions create dependecies in all senses . Cheers, hopcode |
|||
12 Feb 2010, 15:38 |
|
baldr 12 Feb 2010, 17:26
hopcode wrote: Now, the endianess, should be handled by a bswap on each dword at load stage (or better if inverting operands on the 2 punpcklbw instructions.) |
|||
12 Feb 2010, 17:26 |
|
edemko 14 Feb 2010, 16:18
Code: ;value -> eax ;mm0 <- result ;mm1 <- 8x($00) | 8x($07) ;mm2 <- 8x($30) ;mm3 <- 8x($0f) proc mmx_dword2hex push $09090909 $09090909 ;make mask movq mm2,[esp] ;load mask push $0f0f0f0f $0f0f0f0f ;make mask movq mm3,[esp] ;load mask lea esp,[esp+16] ;restore stack keeping flags bswap eax ;$1234abcd -> $cdab3412 movd mm0,eax ;$cdab3412 bswap eax ;$1234abcd movq mm1,mm0 ;$cdab3412 psrlq mm0,4 ;$0cdab341 punpcklbw mm0,mm1 ;$cd0cabda'34b31241 pand mm0,mm3 ;$0d0c0b0a'04030201 = 8x(0..15) movq mm1,mm0 ;'0' + 0..9..15 = '0'..'9'..':'..'?'; provide ':'..'?' -> 'A'..'F' if any pcmpgtb mm1,mm2 ;$ffffffff'00000000; no more need in a $09 mask, make it a $30 one paddq mm2,mm3 ;$18181818'18181818 = 8x(00011000b) psllq mm2,1 ;$30303030'30303030 = 8x(00110000b) por mm0,mm2 ;$3d3c3b3a'34333231 = 8x(0011xxxxb) pand mm1,mm3 ;$0f0f0f0f'00000000 psrlq mm1,1 ;$07878787'80000000 pand mm1,mm3 ;$07070707'00000000 paddq mm0,mm1 ;$44434241'34333231 = '1234ABCD' ret ;good luck endp /* OllyDbg dump CPU Disasm Address Hex dump Command Comments 00402033 /$ 68 09090909 push 9090909 00402038 |. 68 09090909 push 9090909 0040203D |. 0F6F1424 movq mm2,[qword ss:esp] 00402041 |. 68 0F0F0F0F push 0F0F0F0F 00402046 |. 68 0F0F0F0F push 0F0F0F0F 0040204B |. 0F6F1C24 movq mm3,[qword ss:esp] 0040204F |. 8D6424 10 lea esp,[esp+10] 00402053 |. 0FC8 bswap eax 00402055 |. 0F6EC0 movd mm0,eax 00402058 |. 0FC8 bswap eax 0040205A |. 0F6FC8 movq mm1,mm0 0040205D |. 0F73D0 04 psrlq mm0,4 00402061 |. 0F60C1 punpcklbw mm0,mm1 00402064 |. 0FDBC3 pand mm0,mm3 00402067 |. 0F6FC8 movq mm1,mm0 0040206A |. 0F64CA pcmpgtb mm1,mm2 0040206D |. 0FD4D3 paddq mm2,mm3 00402070 |. 0F73F2 01 psllq mm2,1 00402074 |. 0FEBC2 por mm0,mm2 00402077 |. 0FDBCB pand mm1,mm3 0040207A |. 0F73D1 01 psrlq mm1,1 0040207E |. 0FDBCB pand mm1,mm3 00402081 |. 0FD4C1 paddq mm0,mm1 00402084 \. C3 retn */ Thanks for your explanations |
|||
14 Feb 2010, 16:18 |
|
hopcode 15 Feb 2010, 00:34
I have got an idea
I think that when one is in SIMD-use, one should use SIMD 100%, not for one single DWORD. bswap is ok, but i think it should be handled separatedly, unless using conditional compile or write 2 different procs. This new version read 16byte (4 dwords per cycle) and output 32chars of text. It is useful for example in an hex viewer (i am actually rebuilding that of mine). . Testing it with a personal timer-recipe (one i have found on board and slightly readapted), under normal priority it outputs 32 chars in 22,5 cycles for 16kb mem containig dwords. I have got better performances (16-18 cycles) , but i am ctually studying dependencies/stalls. usage Code: stream db 1Ah,2Bh,3Ch,4Dh,0ABh,0CDh,0EFh,22h db 11h,22h,33h,44h,55h,66h,77h,88h result db 32 dup (0) ;IN EAX = 8 aligned source ;IN EDX = 8 aligned dest ;IN ECX = source size mov edx,result mov eax,stream mov ecx,16 call mmx_u2h Code: align 4 mmx_u2h: ;----- fixed block ------- push ebp push ebx push edi push esi shr ecx,4 xchg edi,edx xchg esi,eax xchg ebp,ecx push 09090909h push 09090909h mov eax,esp push 07070707h push 07070707h mov ebx,esp push 30303030h push 30303030h mov ecx,esp push 0xF0F0F0F0 push 0xF0F0F0F0 mov edx,esp ;----- loader ------------ .mmx_u2hA: movq mm0,[esi] ; load 8 bytes of stream movq mm4,[esi+8] ; load next 8 bytes movq mm7,[edx] ; MM7 = maskF0_ movq mm1,mm0 ; copy MM0 movq mm5,mm4 ; copy MM4 pand mm0,mm7 ; AND F0_ pand mm4,mm7 ; AND F0_ psrlq mm0,4 ; R SHIFT psrlq mm7,4 ; R SHIFT maskF0 -> 0F_ psrlq mm4,4 ; R SHIFT pand mm1,mm7 ; AND mask0F_ pand mm5,mm7 ; AND mask0F_ movq mm3,mm0 ; copy MM0 movq mm7,mm4 ; copy MM4 punpcklbw mm0,mm1 ; interleave bytes in low MM0 with those in low MM1 punpcklbw mm4,mm5 ; interleave bytes in low MM4 with those in low MM5 punpckhbw mm3,mm1 ; interleave bytes in hi MM3 with those in hi MM1 punpckhbw mm7,mm5 ; interleave bytes in hi MM7 with those in hi MM5 movq mm1,[eax] ; copy mask09_ movq mm2,mm0 ; copy MM0 movq mm5,[ecx] ; copy mask30_ pcmpgtb mm0,mm1 ; set to FF bytes greater than in mask 09_ pand mm0,[ebx] ; AND FF bytes to mask 07_ paddb mm0,mm2 ; ADD each byte 0-9 rests paddb mm0,mm5 ; ADD each byte 30_ movq mm6,mm4 ; copy MM4 movq [edi],mm0 ;<------ 1st DWORD pcmpgtb mm4,mm1 ; set to FF bytes greater than in mask 09_ movq mm0,[ebx] ; copy mask 07_ movq mm2,mm3 ; copy interleaved bytes pand mm4,mm0 ; AND FF bytes to mask 07_ paddb mm4,mm6 ; ADD each byte 0-9 rests paddb mm4,mm5 ; ADD each byte 30_ movq mm6,mm7 ; copy interleaved bytes movq [edi+16],mm4 ;<------ 3rd DWORD pcmpgtb mm3,mm1 ;set to FF bytes greater than in mask 09_ pcmpgtb mm7,mm1 ;set to FF bytes greater than in mask 09_ pand mm3,mm0 ; AND FF bytes to mask 07_ pand mm7,mm0 ; AND FF bytes to mask 07_ paddb mm3,mm2 ; ADD each byte 0-9 rests paddb mm7,mm6 ; ADD each byte 0-9 rests paddb mm3,mm5 ; ADD each byte 30_ paddb mm7,mm5 ; ADD each byte 30_ movq [edi+8],mm3 ;<------ 2nd DWORD; movq [edi+24],mm7 ;<------ 4th DWORD; add esi,16 add edi,32 dec ebp jnz .mmx_u2hA .mmx_u2hB: add esp,32 pop esi pop edi pop ebx pop ebp ret Cheers hopcode . |
|||
15 Feb 2010, 00:34 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.