flat assembler
Message board for the users of flat assembler.

Index > Main > MMX u2h trial proc

Author
Thread Post new topic Reply to topic
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 12 Feb 2010, 01:57
Hallo All,
following this http://board.flatassembler.net/topic.php?p=25330#25330
i have got my first trial at MMX instructions as SIMD mental training. Here follows a little u2h proc. I dont know much about MMX dependencies/stalls etc, nor if it is useful or how much speedy is. I was trying to think parallel Cool as far as i could.
The goal was to convert a stream of aligned byte in ascii format.
the proc read an entire DQWORD and convert it in only one pass.
Also from 1 QWORD (8 stream bytes) -> 16 byte Ascii text
Updated
104 bytes and improved (but not doubled) speed
Code:
; stream db 1Ah,2Bh,3Ch,4Dh,0ABh,0CDh,0EFh,22h
; dd 0
; result dq 0.0
; dd 0
; maskF0 dq 0F0F0'F0F0'F0F0'F0F0h
; mask09 dq 0909090909090909h
; mask30 dq 3030303030303030h
; mask07 dq 0707070707070707h
; szFormat db ">> %s",13,10,0

mmx_u2h:
      ;----- fixed block -------
  movq mm4,[mask09]
   movq mm5,[mask07]
   movq mm6,[mask30]
   movq mm7,[maskF0]

       ;----- loader ------------
  movq mm0,qword[stream]  ; load 8 bytes of stream

        movq mm1,mm7          ; save mask0F
 movq mm2,mm0          ; copy 8 bytes to mm2
 pand mm0,mm1          ; AND 8 bytes with 0F_
        psrlq mm0,4           ; R SHIFT 0F_ mask
        pandn mm1,mm2         ; NOT AND IN MM1 (this will destroy mask 0F_ )
    movq mm3,mm0          ; COPY to avoid MM0 destruction
        punpcklbw mm0,mm1     ; interleave bytes in low MM0 with those in low MM1
        punpckhbw mm3,mm1     ; interleave bytes in hi MM3 with those in hi MM1
       movq mm1,mm0                  ;<--- copy
 movq mm2,mm3                  ;<--- copy
 pcmpgtb mm0,mm4       ;set to FF bytes greater than in mask 09_
     pcmpgtb mm3,mm4

 pand mm0,mm5          ;and FF bytes to mask 07_
     paddb mm0,mm6         ;add each byte 30_
    paddb mm0,mm1         ;add each byte 0-9 rests
      movq [result],mm0           ;<------ 1st DWORD

;      push result
;        push szFormat
;      cinvoke printf
;     add esp,8

       pand mm3,mm5          ;same as above
        paddb mm3,mm6
       paddb mm3,mm2
       movq [result],mm3     ;<------ 2nd DWORD

;    push result
;        push szFormat
;      cinvoke printf
;     add esp,8
    


Cheers, Very Happy
hopcode
.
.
.


Last edited by hopcode on 12 Feb 2010, 22:38; edited 3 times in total
Post 12 Feb 2010, 01:57
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20295
Location: In your JS exploiting you and your system
revolution 12 Feb 2010, 04:38
Isn't this a binary to ASCII hex converter? So not really an i2a, more of a u2h?
Post 12 Feb 2010, 04:38
View user's profile Send private message Visit poster's website Reply with quote
edemko



Joined: 18 Jul 2009
Posts: 549
edemko 12 Feb 2010, 05:01
Could any comment the code step by step: it looks interesting but i cannot ride it out Sad
Post 12 Feb 2010, 05:01
View user's profile Send private message Reply with quote
bitshifter



Joined: 04 Dec 2007
Posts: 796
Location: Massachusetts, USA
bitshifter 12 Feb 2010, 06:55
I suggest looking up each instruction in the Intel manuals...
Post 12 Feb 2010, 06:55
View user's profile Send private message Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr 12 Feb 2010, 08:52
revolution,

His code converts 8 bytes to their [concatenated] hexadecimal representation. It can be u2h if those bytes are big-endian qword.

__________
hopcode,

Can you explain why you unpack nibbles from each dword and not from entire qword? For example:
Code:
        movq    mm0, qword[stream]
        movq    mm2, mm0
        psrlq   mm2, 4
        movq    mm1, [mask0F]
        pand    mm0, mm1
        pand    mm2, mm1
        movq    mm1, mm0
        punpcklbw mm0, mm2
        punpckhbw mm1, mm2    
Post 12 Feb 2010, 08:52
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 12 Feb 2010, 15:38
Updated: added some comments on source.
baldr wrote:
...why... not... nibbles from entire qword ?

Simply because i am used to think DWORD. But you are right 100%, because it is a SIMD instruction. So, why not ?
Also, i have updated the proc: size reduced from 124 --> 104 bytes. Thank for the tip. Now, the endianess, should be handled by a bswap on each dword at load stage (or better if inverting operands on the 2 punpcklbw instructions.)
or.. suggestions Question
revolution wrote:
u2h
Yes.

    u yes, unsigned
    2 yes, conversion
    h yes, more detailled of a-scii, without ambiguity because it is a conversion


btw: imho, these SIMD instructions create dependecies in all senses
.
Very Happy
Cheers,
hopcode
Post 12 Feb 2010, 15:38
View user's profile Send private message Visit poster's website Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr 12 Feb 2010, 17:26
hopcode wrote:
Now, the endianess, should be handled by a bswap on each dword at load stage (or better if inverting operands on the 2 punpcklbw instructions.)
or.. suggestions?
Swapping punpck(l|h)bw operands will swap nibbles, you definitely wouldn't want that. pshufb can be used to shuffle bytes (SSSE3, Core2/Conroe+).
Post 12 Feb 2010, 17:26
View user's profile Send private message Reply with quote
edemko



Joined: 18 Jul 2009
Posts: 549
edemko 14 Feb 2010, 16:18
Code:
;value -> eax
;mm0   <- result
;mm1   <- 8x($00) | 8x($07)
;mm2   <- 8x($30)
;mm3   <- 8x($0f)
proc mmx_dword2hex
        push    $09090909 $09090909 ;make mask
        movq    mm2,[esp]           ;load mask
        push    $0f0f0f0f $0f0f0f0f ;make mask
        movq    mm3,[esp]           ;load mask
        lea     esp,[esp+16]        ;restore stack keeping flags
        bswap   eax                 ;$1234abcd -> $cdab3412
        movd    mm0,eax             ;$cdab3412
        bswap   eax                 ;$1234abcd
        movq    mm1,mm0             ;$cdab3412
        psrlq   mm0,4               ;$0cdab341
        punpcklbw mm0,mm1           ;$cd0cabda'34b31241
        pand    mm0,mm3             ;$0d0c0b0a'04030201 = 8x(0..15)
        movq    mm1,mm0             ;'0' + 0..9..15 = '0'..'9'..':'..'?'; provide ':'..'?' -> 'A'..'F' if any
        pcmpgtb mm1,mm2             ;$ffffffff'00000000; no more need in a $09 mask, make it a $30 one
        paddq   mm2,mm3             ;$18181818'18181818 = 8x(00011000b)
        psllq   mm2,1               ;$30303030'30303030 = 8x(00110000b)
        por     mm0,mm2             ;$3d3c3b3a'34333231 = 8x(0011xxxxb)
        pand    mm1,mm3             ;$0f0f0f0f'00000000
        psrlq   mm1,1               ;$07878787'80000000
        pand    mm1,mm3             ;$07070707'00000000
        paddq   mm0,mm1             ;$44434241'34333231 = '1234ABCD'
        ret                         ;good luck
endp

/* OllyDbg dump
CPU Disasm
Address    Hex dump          Command                                Comments
00402033   /$  68 09090909   push    9090909
00402038   |.  68 09090909   push    9090909
0040203D   |.  0F6F1424      movq    mm2,[qword ss:esp]
00402041   |.  68 0F0F0F0F   push    0F0F0F0F
00402046   |.  68 0F0F0F0F   push    0F0F0F0F
0040204B   |.  0F6F1C24      movq    mm3,[qword ss:esp]
0040204F   |.  8D6424 10     lea     esp,[esp+10]
00402053   |.  0FC8          bswap   eax
00402055   |.  0F6EC0        movd    mm0,eax
00402058   |.  0FC8          bswap   eax
0040205A   |.  0F6FC8        movq    mm1,mm0
0040205D   |.  0F73D0 04     psrlq   mm0,4
00402061   |.  0F60C1        punpcklbw mm0,mm1
00402064   |.  0FDBC3        pand    mm0,mm3
00402067   |.  0F6FC8        movq    mm1,mm0
0040206A   |.  0F64CA        pcmpgtb mm1,mm2
0040206D   |.  0FD4D3        paddq   mm2,mm3
00402070   |.  0F73F2 01     psllq   mm2,1
00402074   |.  0FEBC2        por     mm0,mm2
00402077   |.  0FDBCB        pand    mm1,mm3
0040207A   |.  0F73D1 01     psrlq   mm1,1
0040207E   |.  0FDBCB        pand    mm1,mm3
00402081   |.  0FD4C1        paddq   mm0,mm1
00402084   \.  C3            retn
*/
    

Thanks for your explanations
Post 14 Feb 2010, 16:18
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 15 Feb 2010, 00:34
Very Happy I have got an idea Exclamation
I think that when one is in SIMD-use, one should use SIMD 100%, not for one single DWORD.
bswap is ok, but i think it should be handled separatedly, unless using conditional compile or write 2 different procs.

This new version read 16byte (4 dwords per cycle) and output 32chars of text. It is useful for example in an hex viewer (i am actually rebuilding that of mine).
.
Testing it with a personal timer-recipe (one i have found on board and slightly readapted), under normal priority it outputs 32 chars in 22,5 cycles for 16kb mem containig dwords.
I have got better performances (16-18 cycles) , but i am ctually studying dependencies/stalls.

usage
Code:
stream db 1Ah,2Bh,3Ch,4Dh,0ABh,0CDh,0EFh,22h
           db 11h,22h,33h,44h,55h,66h,77h,88h
result   db 32 dup (0)

  ;IN EAX = 8 aligned source 
  ;IN EDX = 8 aligned dest
  ;IN ECX = source size

  mov edx,result
  mov eax,stream
  mov ecx,16
  call mmx_u2h
    


Code:
align 4
mmx_u2h:
  ;----- fixed block -------
  push ebp
  push ebx
  push edi
  push esi

  shr ecx,4
  xchg edi,edx
  xchg esi,eax
  xchg ebp,ecx

  push 09090909h
  push 09090909h
  mov eax,esp

  push 07070707h
  push 07070707h
  mov ebx,esp

  push 30303030h
  push 30303030h
  mov ecx,esp

  push 0xF0F0F0F0
  push 0xF0F0F0F0
  mov edx,esp

  ;----- loader ------------
.mmx_u2hA:
  movq mm0,[esi]      ; load 8 bytes of stream
  movq mm4,[esi+8]    ; load next 8 bytes
  movq mm7,[edx]      ; MM7 = maskF0_
  movq mm1,mm0        ; copy MM0
  movq mm5,mm4        ; copy MM4
  pand mm0,mm7        ; AND F0_ 
  pand mm4,mm7        ; AND F0_
  psrlq mm0,4         ; R SHIFT 
  psrlq mm7,4         ; R SHIFT maskF0 -> 0F_ 
  psrlq mm4,4         ; R SHIFT
  pand mm1,mm7        ; AND mask0F_
  pand mm5,mm7        ; AND mask0F_
  movq mm3,mm0        ; copy MM0
  movq mm7,mm4        ; copy MM4
  punpcklbw mm0,mm1   ; interleave bytes in low MM0 with those in low MM1
  punpcklbw mm4,mm5   ; interleave bytes in low MM4 with those in low MM5
  punpckhbw mm3,mm1   ; interleave bytes in hi  MM3 with those in hi  MM1
  punpckhbw mm7,mm5   ; interleave bytes in hi  MM7 with those in hi MM5
  movq mm1,[eax]      ; copy mask09_
  movq mm2,mm0        ; copy MM0
  movq mm5,[ecx]      ; copy mask30_ 
  pcmpgtb mm0,mm1     ; set to FF bytes greater than in mask 09_
  pand mm0,[ebx]      ; AND FF bytes to mask 07_
  paddb mm0,mm2       ; ADD each byte 0-9 rests
  paddb mm0,mm5       ; ADD each byte 30_
  movq mm6,mm4        ; copy MM4
  movq [edi],mm0          ;<------ 1st DWORD
  pcmpgtb mm4,mm1     ; set to FF bytes greater than in mask 09_
  movq mm0,[ebx]      ; copy mask 07_
  movq mm2,mm3        ; copy interleaved bytes
  pand mm4,mm0        ; AND FF bytes to mask 07_
  paddb mm4,mm6       ; ADD each byte 0-9 rests
  paddb mm4,mm5       ; ADD each byte 30_
  movq mm6,mm7        ; copy interleaved bytes
  movq [edi+16],mm4   ;<------ 3rd DWORD
  pcmpgtb mm3,mm1     ;set to FF bytes greater than in mask 09_
  pcmpgtb mm7,mm1     ;set to FF bytes greater than in mask 09_
  pand mm3,mm0        ; AND FF bytes to mask 07_
  pand mm7,mm0        ; AND FF bytes to mask 07_
  paddb mm3,mm2       ; ADD each byte 0-9 rests
  paddb mm7,mm6       ; ADD each byte 0-9 rests
  paddb mm3,mm5       ; ADD each byte 30_
  paddb mm7,mm5       ; ADD each byte 30_
  movq [edi+8],mm3    ;<------ 2nd DWORD;
  movq [edi+24],mm7   ;<------ 4th DWORD;

  add esi,16
  add edi,32
  dec ebp
  jnz .mmx_u2hA

.mmx_u2hB:
  add esp,32
  pop esi
  pop edi
  pop ebx
  pop ebp
  ret
    


Cheers Very Happy
hopcode
.
Post 15 Feb 2010, 00:34
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.