flat assembler
Message board for the users of flat assembler.

Index > Windows > Optimizing pseudo random generator

Author
Thread Post new topic Reply to topic
nmake



Joined: 13 Sep 2012
Posts: 193
nmake
This pseudo algorithm by George Marsaglia is the one I am using. And I am disappointed by the performance, I can create about 1.8 billion pseudo random numbers using 4 cores per second. I've seen people do up to 2 billion numbers on just a single core using c++. That's miles away from the performance of my implementation.

Could anyone take a look and see where it can be improved?
Code:
; random generator for main thread
proc random uses esi,bufptr
        ; initialize sse registers
        mov eax,num_65535
        mov edx,mainseed
        mov esi,[bufptr]

        movaps xmm7,[eax]          ; xmm7=65535   (PERMANENT)
        movaps xmm6,[eax+16]       ; xmm6=16      (PERMANENT)
        movaps xmm5,[eax+32]       ; xmm5=36969   (PERMANENT)
        movaps xmm4,[eax+48]       ; xmm4=18000   (PERMANENT)
        movaps xmm0,[edx]          ; xmm0=V       (Re-Use, Re-Calculate)
        movaps xmm1,[edx+16]       ; xmm1=U       (Re-Use, Re-Calculate)
        movaps xmm2,xmm0           ; xmm2=V       (Re-Use, Re-Calculate)
        movaps xmm3,xmm1           ; xmm3=U       (Re-Use, Re-Calculate)

        mov ecx,[count]
        add ecx,[remainder]

        ; Generate random numbers
        align 16
    .loop:
        pand xmm0,xmm7     ; V AND 65535
        pand xmm1,xmm7     ; U AND 65535

        ; Multiply
        pmulld xmm0,xmm5   ; * 36969
        pmulld xmm1,xmm4   ; * 18000

        ; Shift right V and U by 16
        psrad xmm2,xmm6    ; V SHR 16
        psrad xmm3,xmm6    ; U SHR 16

        ; Add (xmm0 and xmm1 now contains complete new values)
        paddd xmm0,xmm2    ; (V AND 65535) + (V SHR 16)
        paddd xmm1,xmm3    ; (U AND 65535) + (U SHR 16)

        ; Store new V and U in xmm2 and xmm3
        movdqa xmm2,xmm0
        movdqa xmm3,xmm1

        ; SHL V by 16 on the return value registers
        pslld xmm2,xmm6 ; V SHL 16
        paddd xmm2,xmm1;paddq xmm2,xmm1 ; V + U

        ; save random value
        movntdq [esi],xmm2;movdqa [edi],xmm2

        ; move new V and U back to the altered xmm2 and xmm3 from xmm0 and xmm1
        movaps xmm2,xmm0
        movaps xmm3,xmm1

        add esi,16
        sub ecx,1
        jnz .loop

        ret
endp

section '.data' data readable writeable
        align 16
        mainseed     rd 8    ; unique seeds to main thread
        align 16
        num_65535    dd 65535
                     dd 65535
                     dd 65535
                     dd 65535

        num_16       dd 16
                     dd 0
                     dd 0
                     dd 0

        num_36969    dd 36969
                     dd 36969
                     dd 36969
                     dd 36969

        num_18000    dd 18000
                     dd 18000
                     dd 18000
                     dd 18000

    
The algorithm in c++ view:
Code:
private static uint GetUint()
{
    V = 36969 * (V & 65535) + (V >> 16);
    U = 18000 * (U & 65535) + (U >> 16);
    return (V << 16) + U;
}    
Algorithm explained in steps:

1. set V to a non-zero initial value.
2. set U to a non-zero initial value.

3. multiply L.O word of V by 36969
4. add H.O word of V to the result

5. Repeat steps 3-4 for U but multiply by 18000 instead
6. SHIFT left V and add U and return it

V and U are both 32 bit unsigned integers. I hope there are better instructions to use and better way to implement it.
Post 12 Dec 2012, 06:24
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
this seems to be memory bound.

If I take out the writes to memory (the two movntdq), I get ~1.8billion numbers per second on one core of an i5-3320M.

If I do write the results to memory (un-comment the two movntdq), I get ~0.8billion numbers per second.

I would just say that your computer has a low memory bandwidth.

And, do you know if the c++ test you mentioned was writing the results back to memory?

this is what I was testing:

Code:
format PE GUI 5.0
entry Start

include 'win32ax.inc'

section '.text' code readable executable


;private static uint GetUint()
;{
;    V = 36969 * (V & 65535) + (V >> 16);
;    U = 18000 * (U & 65535) + (U >> 16);
;    return (V << 16) + U;
;}

random:      ; ecx: count , esi: buffer

        push  ebp
         mov  ebp,esp
         and  esp,-16
         sub  esp,64

         mov  eax,65535
        movd  xmm0,eax
      pshufd  xmm0,xmm0,0
      movdqa  dqword[esp+16*0],xmm0

         mov  eax,16
        movd  xmm0,eax
      pshufd  xmm0,xmm0,0
      movdqa  dqword[esp+16*1],xmm0

         mov  eax,36969
        movd  xmm0,eax
      pshufd  xmm0,xmm0,0
      movdqa  dqword[esp+16*2],xmm0

         mov  eax,18000
        movd  xmm0,eax
      pshufd  xmm0,xmm0,0
      movdqa  dqword[esp+16*3],xmm0


        shr  ecx,1     ; 2x unroll

        movaps xmm0,dqword[mainseed+16*0]       ; xmm0=V       (Re-Use, Re-Calculate)
        movaps xmm1,dqword[mainseed+16*1]       ; xmm1=U       (Re-Use, Re-Calculate)
        movaps xmm2,xmm0                ; xmm2=V       (Re-Use, Re-Calculate)
        movaps xmm3,xmm1                ; xmm3=U       (Re-Use, Re-Calculate)


        movaps xmm4,dqword[mainseed+16*1]       ; xmm4=V       (Re-Use, Re-Calculate)
        movaps xmm5,dqword[mainseed+16*2]       ; xmm5=U       (Re-Use, Re-Calculate)
        movaps xmm6,xmm4                ; xmm6=V       (Re-Use, Re-Calculate)
        movaps xmm7,xmm5                ; xmm7=U       (Re-Use, Re-Calculate)


        ; Generate random numbers
        align 64
    .l1:
        pand xmm0,dqword[esp+16*0]
        pand xmm1,dqword[esp+16*0]
        pand xmm4,dqword[esp+16*0]
        pand xmm5,dqword[esp+16*0]
        psrad xmm2,dqword[esp+16*1]
        psrad xmm3,dqword[esp+16*1]
        psrad xmm6,dqword[esp+16*1]
        psrad xmm7,dqword[esp+16*1]
        pmulld xmm0,dqword[esp+16*2]
        pmulld xmm1,dqword[esp+16*3]
        pmulld xmm4,dqword[esp+16*2]
        pmulld xmm5,dqword[esp+16*3]
        paddd xmm0,xmm2
        paddd xmm1,xmm3
        paddd xmm4,xmm6
        paddd xmm5,xmm7
        movdqa xmm2,xmm0
        movdqa xmm3,xmm1
        movdqa xmm6,xmm4
        movdqa xmm7,xmm5

        pslld xmm0,dqword[esp+16*1]
        pslld xmm4,dqword[esp+16*1]
        paddd xmm0,xmm1
        paddd xmm4,xmm5

   ;     movntdq[esi+16*0],xmm0
   ;     movntdq[esi+16*1],xmm4

        movdqa xmm0,xmm2
        movdqa xmm4,xmm6

        add esi,32
        sub ecx,1 
        jnz .l1


        mov  esp,ebp
        pop  ebp
        ret



Start:               invoke  QueryPerformanceFrequency,Frequency

                        mov  eax,[Count]
                        shl  eax,4
                     invoke  VirtualAlloc,0,eax,MEM_COMMIT,PAGE_READWRITE
                        mov  [Result],eax

                     invoke  QueryPerformanceCounter,Time

                        mov  ecx,[Count]
                        mov  esi,[Result]
                       call  random

                        mov  esi,dword[Time+4*0]
                        mov  edi,dword[Time+4*1]
                     invoke  QueryPerformanceCounter,Time
                        sub  dword[Time+4*0],esi
                        sbb  dword[Time+4*1],edi

                       fild  qword[Time]
                       fild  qword[Frequency]
                       fild  dword[Count]
                      fmulp  st1,st0
                     fdivrp  st1,st0
                       fadd  st0,st0      ; 4 numbers per count
                       fadd  st0,st0      ;
                       fstp  qword[Rate]
                     invoke  sprintf,Message,MessageFormat,double [Rate]
                     invoke  MessageBoxA,0,Message,Caption,MB_OK

                     invoke  VirtualFree,[Result],0,MEM_RELEASE

                     invoke  ExitProcess,0




section '.data' data readable writeable

align 16
        mainseed   dd 1,2,3,4
                   dd 5,6,7,8
                   dd 1,3,5,6
                   dd 7,5,4,3

align 8
  Time dq 0
  Frequency dq 0
  Rate dq 0

align 4
  Result dd 0
  Count dd 20000000

align 1
  Caption db 'test',0
  MessageFormat db 'numbers per second: %f',0
  Message rb 200



section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          user32,'USER32.DLL',\
          gdi32,'GDI32.DLL',\
          msvcrt,'MSVCRT.DLL'

include 'api\kernel32.inc'
include 'api\user32.inc'
include 'api\gdi32.inc'


 import  msvcrt,\
            sprintf,'sprintf'    
Post 12 Dec 2012, 09:05
View user's profile Send private message Reply with quote
nmake



Joined: 13 Sep 2012
Posts: 193
nmake
I do not know if the c++ program wrote anything back. But I do know it was windows 64 and not 32. You can read about it here: http://www.dimkovic.com/node/22
Post 12 Dec 2012, 09:22
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
well, ok - I think that answers it.
You have every reason to relax, dude. Smile
That code does not write the results back to a huge array
and was run on a 3.8 GHz Core i7 2687W.
My slightly modified version of your code got 1.8 billion numbers per second on my 3.3 GHz i5-3320M.
We have 3.8/3.3 * 1.8 billion = 2 billion expected numbers per second on their machine.

Try to write some 64 bit code and get it up to 3 billion numbers per second. Smile
Post 12 Dec 2012, 10:34
View user's profile Send private message Reply with quote
cod3b453



Joined: 25 Aug 2004
Posts: 619
cod3b453
I think you can reduce the calculation for both U and V to a single instruction:
Code:
; xmm1 = 0001 4650 0001 9069
; xmm0 = UUUU uuuu VVVV vvvv
pmaddwd xmm0,xmm1
; xmm0 = (18000*u)+(U shr 16) (36969*v)+(V shr 16)    
Then something like punpcklwd and paddd to get the (V shl 16) + U step Question
Post 12 Dec 2012, 19:02
View user's profile Send private message Reply with quote
nmake



Joined: 13 Sep 2012
Posts: 193
nmake
tthsqe this is fast for 32 bit. I think, it can probably become faster too. cod3b453, good find. I will try it, I have to admit I was peeking at the mul/add instruction but never thought of doing it your way. Very smart way, I'm sure it will give a good boost.

Can we convert this algo into using a base without using division in the end? A base of 100 outputs a number between 0-99 naturally. But division is costy. Btw. more smart tricks and implementations is wanted. Very Happy
Post 12 Dec 2012, 23:24
View user's profile Send private message Reply with quote
nmake



Joined: 13 Sep 2012
Posts: 193
nmake
I managed to set it all up using the pmaddwd to run 2 billion iterations in 481 milliseconds, without implementing the punpcklwd and paddd yet. But I believe this will become very fast. (on a single core) and this is using only 1 sse register with data, if you add 8 of them together, it will become tremendously fast.
Post 13 Dec 2012, 01:15
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.