Fast and small Multiplicative Random number generator

Index > Main > Fast and small Multiplicative Random number generator

Goto page Previous 1, 2, 3

Author

Thread

hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany

hopcode 05 Apr 2012, 07:05

AsmGuru62 wrote:

@hopcode: I looked at 'normal' distribution in Wikipedia and it resembles some kind of symmetrical curve (like a Bell Curve), so I assume that numbers are not generated in equal probability.

Hi all,
that symmetry is a "architectural". it tell us where/when the deviation happens but not directly the value of the outcome at that variance/deviation. rolling a dice, we have 1,2,3,4,5,6 as outcomes. considering 3.5 the mean, 1 and 6 have same variance but different values. it depends on where one chooses the mean. considering bytes, if we place the mean at 127.5 then 0 and 255, 1 and 254,2 and 253 may share a relationship that may lead to biased randomness.

"architectural" there denotes the fact that at the moment it is the basal structure for almost all happening phenomena in the universe ("almost" is after the fact that it is a continous function). they happens naturally "deviated" from an ideal "mean", and "almost" always according the rule 68/95/99, when one localizes them in a determined/closed environment. this supposedly "natural" rule is the warranty of an important aspect (read "phenomena") to be re-built in the algo, in order to get from it a final randomness.

revolution wrote:

Usually we generate a uniform distribution with a PRNG. This means that if we graph the counts of each output number we expect to see a flat line.

Taking a simple example of a PRNG that generates numbers from 0 to 15 inclusive in a uniform dist.
If we run our PRNG 16,000,000 times then we expect to see 1,000,000 0's and 1,000,000
1's ... up to 1,000,000 15's. (In most good quality PRNGs and RNGs we would usually see small deviations from this ideal expectation).

starting with uniformity is simpler bat a very bad approach.
because uniformity tells us essentially about the likelihood of the outcomes in a range.
rolling a dice we may have 1,2,3,4,5,6. likelihood is 1/6
for each k in/of the range. because empirically 1 balances 6, 2 balances 5 from the mean, we say uniformity is ok.

rolling now again we may have 1,2,2,3,4,5 for an unbalanced uniformity. for this reason we must go lokal to the outcomes to consider the deviation because, the random numbers we like are 1) choosen from a finite sample,2) they may be only "infinitively" uniform,3) but they are biased when uniform-ed.

1) are always a finite sample.
uniformity should to be achieved only globally, and after all outcomes.

2) becuse uniformity is only confirmed/checkable after having all outcomes
(consider pi digits found till now and their increasing likelihood)

3) as said above, better starting your algo considering parting the range in 8 and then using a z-score measure for it than, call it the way you like,
a biased algo that generates numbers in the range of 1 standard deviation only because someone told you it must be of uniform likelyhood !!!!

revolution wrote:

...Instead, let's say we want a normal distribution (a bell curve result). Then we need to modify our PRNG to generate...

sorry, definitvely RBH Wink

Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑

05 Apr 2012, 07:05

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20689
Location: In your JS exploiting you and your system

revolution 05 Apr 2012, 12:05

hopcode: I think you misunderstand some things. Uniform distribution is by far the easiest to generate and to test for biases. That is why currently all good non-uniform generators start with a uniform output and later alter the output to fit into the desired distribution. Some people have tried to directly generate some non-uniform distributions but ultimately the results were very poor and slower than the more simple method of generating uniform first with known good algorithms and massaging for non-uniformity.

Did you finally understand the proper meanings of PRNG and RNG? This is also an important thing to understand when dealing with these things. Putting the "P" in front of "RNG" does not suggest it is suddenly of low quality, it is supposed to show the fundamental difference in how the numbers are generated and sourced.

05 Apr 2012, 12:05

hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany

hopcode 05 Apr 2012, 16:14

hopcode wrote:

...because uniformity tells us essentially about the
likelihood of the outcomes in a range...

just like each other
distribution tells us respectively of some other features of the outcomes in
the range. also one concrete important factor when choosing an algo
is the period it manages.
if i choose for example 2^32 -1 from the Park-Miller because i need 256
words of 16 bit as result, and i generate them singularly, and not serially
as the generator is programmed for,whatever the "linearity" it implements,
it means that locally from my 256 outcomes, i may find strange/biased
frequencies because truncating on 2^31 / 2^16 i may count out a repeated outcome every 2^14 times.
on the contrary, using Marsenne of a huge period, the deviation may be larger.
and this may lead to think that the output has no uniformity.for the same reason you cannot test 2 algos for the best one,
if they work on different periods. Park-Miller and mod-31s work on a 2^31-1 period, Marsenne on a 2^1937, if i am not wrong.
but the worst, they work good only when used as generators. for the same same reason, one should start thinking about

1) the period to manage
2) the mode: singularity/seriality of the algo

because those are mainly the few things that make algos different.

Cheers,

p.s. all the rest is verbosity overbloating over the time

_________________
⠓⠕⠏⠉⠕⠙⠑

05 Apr 2012, 16:14

kalambong

Joined: 08 Nov 2008
Posts: 165

kalambong 06 Apr 2012, 09:11

AsmGuru62 wrote:

Best tests you can run on your generator are DIEHARD tests.

Original Diehard test is no longer maintained

Alternatives can be found at

http://www.phy.duke.edu/~rgb/General/dieharder.php

and

http://csrc.nist.gov/groups/ST/toolkit/rng/documents/sts-2.1.1.zip

06 Apr 2012, 09:11

hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany

hopcode 15 Apr 2012, 12:52

amrt, Assembly Math Run Time
now and immediately ?

http://sites.google.com/site/x64lab/home/reloaded-algorithms

Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑

15 Apr 2012, 12:52

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 25 Apr 2012, 08:47

Has anyone tried using rule 30 ( http://en.wikipedia.org/wiki/Rule_30 ) to generate 'random' numbers? A quality SSE implementation would be nice...

25 Apr 2012, 08:47

AsmGuru62

Joined: 28 Jan 2004
Posts: 1717
Location: Toronto, Canada

AsmGuru62 25 Apr 2012, 15:09

I didn't get the algorithm on this Rule 30.
There are just some pictures and no clear explanation.

25 Apr 2012, 15:09

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 26 Apr 2012, 00:38

it is a cellular automaton.
there is an initial configuration of 0's and 1's on a doubly infinite sequence of cells.
The next state of a cell is determined by the current state of itself and both of its two neighbors. the actual rule for each the eight possibilities is given on the wikipedia site.

26 Apr 2012, 00:38

bzdashek

Joined: 15 Feb 2012
Posts: 147
Location: Tolstokvashino, Russia

bzdashek 26 Apr 2012, 05:21

tthsqe wrote:

it is a cellular automaton.
there is an initial configuration of 0's and 1's on a doubly infinite sequence of cells.
The next state of a cell is determined by the current state of itself and both of its two neighbors. the actual rule for each the eight possibilities is given on the wikipedia site.

Sorry, I'm technically challenged at maths.
Could you please explain a special case:
1. Sequence initialized by some number, let's say: 20.
2. We generated some members of the sequence, let's say 100

How do we select our next "random" number?

26 Apr 2012, 05:21

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 26 Apr 2012, 16:10

since 20 = 10100 in binary, we would initialize it with
...00001010000...
Rule 30 is the following update rule:
SELF = (LEFTNEIGHBOR xor (SELF or RIGHTNEIGHBOR))
the plot is shown using black for 1 and white for 0 and each generation on a new row.
Apparently, we can use the center column (blue) for a 'random' source of bits.

Description:
Filesize:	79.86 KB
Viewed:	14809 Time(s)

26 Apr 2012, 16:10

bzdashek

Joined: 15 Feb 2012
Posts: 147
Location: Tolstokvashino, Russia

bzdashek 26 Apr 2012, 17:35

@tthsqe
Thanks for the explanation, tthsqe!

26 Apr 2012, 17:35

r22

Joined: 27 Dec 2004
Posts: 805

r22 28 Apr 2012, 02:35

@tthsqe - I was skeptical at first about how well Rule30 would translate to SSE, but constraining +/- INFINITY to ~64 bits and keeping a cache of the pseudo random bits produced to re-initialize the seed seems to do the trick.

Here's version 0.1 of Rule 30 ala SSE:

Code:

format PE64 CONSOLE
entry start

include 'win64ax.inc'

section '.text' code readable executable

start:
        SUB             rsp, 8
        ;
        stdcall         R30Init
        ;
.again:
        stdcall         R30Next
        cinvoke         printf, _fmtbits, rax
        cinvoke         printf, _fmtseed, [seed1+4], [seed1], [seed1+12], [seed1+8]
        cinvoke         printf, _fmtcache, [cache+4], [cache], [cache+12], [cache+8]

        SUB             [_cnt], 1
        JNS             .again

exit:
        cinvoke  system, _pause
        invoke  ExitProcess,0

R30Init:
        MOV             rcx, [init_mul1]
        RDTSC
        ROR             rax, 17
        MUL             rcx
        MOV             [seed1], rax
        MOV             rcx, [init_mul2]
        RDTSC
        ROL             rax, 5
        MUL             rcx
        MOV             [seed1 + 8], rax
        RET             0

R30Next:
        MOV             rdx, cache
        MOV             rcx, [cache_ptr]
        MOVDQA          xmm0, dqword[seed1] ; = current
        MOVDQA          xmm1, xmm0
        MOVDQA          xmm2, xmm0
        PSLLQ           xmm1, 1 ; = right
        PSRLQ           xmm2, 1 ; = left
        POR             xmm0, xmm1 ; = current | right
        PXOR            xmm0, xmm2 ; = left ^(current | right)
        MOVDQA          xmm3, xmm0
        PSLLQ           xmm3, 32 ; bit 32 is the pretend center (origin)
        MOVDQA          dqword[seed1], xmm0 ; save new seed
        MOVMSKPD        rax, xmm3 ; = 2 pseudo random bits
        ; save the bits to the cache
        MOV             r9, rax ; = copy for cache
        MOV             r8, rcx ; = copy of bit ptr into cache
        ROL             r9, cl ; position bits to add to cache
        SHR             r8, 6 ; = / 64
        LEA             rdx, [rdx + r8 * 8]
        OR              qword[rdx], r9
        ; if the cache is full then copy cache to seed
        ADD             rcx, 2
        AND             rcx, 127
        JNZ             .skip
        MOVDQA          xmm0, dqword[cache]
        PXOR            xmm1, xmm1
        MOVDQA          dqword[seed1], xmm0
        MOVDQA          dqword[cache], xmm1
.skip:
        MOV             [cache_ptr], rcx
        RET             0

section '.bss' data readable writable
align 16
dd 0

section '.data' data readable writable
align 16
init_mul1               dq 0cdefab1223455673h
init_mul2               dq 0deadb33f1337357ah

align 16
cache                   dq 0, 0
cache_ptr               dq 0

align 16
seed1                   dq 0, 0

_fmtbits                db 'Bits:',9,'%d',13,10,0
_fmtseed                db 'Seed:',9,'%.8X%.8X %.8X%.8X',13,10,0
_fmtcache               db 'Cache:',9,'%.8X%.8X %.8X%.8X',13,10,0
_pause                  db 'pause',0
_cnt                    dd 100

section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          user32,'USER32.DLL',\
          msvcrt,'MSVCRT.DLL'

  include 'API\KERNEL32.INC'
  include 'API\USER32.INC'

  import  msvcrt,\
          system,'system',\
          printf,'printf'

The most obvious optimization from here is quadrupling the output so you produce 8 pseudo random bits at a time. The would allow interleaving of the SIMD instructions avoiding write/read stalls and simplify the caching a bit etc blah blah.

28 Apr 2012, 02:35

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 28 Apr 2012, 03:15

@r22, I think I see what you are doing Smile

(two size-64 configurations in parallel and reading from bit 32 in each; the seed is updated when 64 generations have been produced in each).
However, since you are not taking care of the edges, the cellular automaton is not correct past the ~32nd generation, and I would worry about the quality of the later bits. In the plot below I put 32bit integers in seed[0:31] and seed[64:95] and ran

Code:

mixup:               movdqa  xmm0,dqword[seed]
                     movdqa  xmm1,xmm0
                     movdqa  xmm2,xmm0
                      psllq  xmm1,1
                      psrlq  xmm2,1
                        por  xmm0,xmm1
                       pxor  xmm0,xmm2
                     movdqa  dqword[seed],xmm0
                        ret

You can see that it loses quality quickly, but it might not be a problem as the seed should be updated before the 64th generation.

Description:
Filesize:	109.81 KB
Viewed:	14738 Time(s)

28 Apr 2012, 03:15

r22

Joined: 27 Dec 2004
Posts: 805

r22 29 Apr 2012, 03:58

Here's version 0.2 of Rule 30 ala SSE.
This update creates a whole byte of pseudo random goodness per call.
All that's left is to interleave the SSE instructions to get better throughput.
Also, it refreshes the seed every 32 generations instead of 64 generations, but to do so I did some "expanding of the 32byte cache to 64byte seed, which could probably use some work.

Code:

format PE64 CONSOLE
entry start

include 'win64ax.inc'

section '.text' code readable executable

start:
        SUB             rsp, 8
        ;
        stdcall            R30Init
        ;
        MOV             [_cnt], 7FFFFFFh
.again:
        stdcall         R30Next
        ADD             dword[_stats + rax * 4], 1
       ; cinvoke         printf, _fmtbits, rax
        SUB             [_cnt], 1
        JNS             .again
        ;
        MOV             [_cnt], 255
.print:
        MOV             eax, dword[_cnt]
        cinvoke         printf, _fmtstats, [_cnt], [_stats + rax * 4]
        SUB             [_cnt], 1
        JNS             .print
exit:
        cinvoke  system, _pause
        invoke  ExitProcess,0

;;
;; Initialize the seed using the current time and some constants
;;
R30Init:
        RDTSC
        MOV             rdx, seed
        MOV             rcx, init_matrix
        MOV             r8, 7
.fill:
        XOR             rax, [rcx + r8 * 8]
        ROR             rax, 7
        MOV             [rdx + r8 * 8], rax
        SUB             r8, 1
        JNS             .fill
        RET             0

R30Next:
        SUB             rsp, 136
        MOVDQA          dqword[rsp], xmm8
        MOVDQA          dqword[rsp + 16], xmm9
        MOVDQA          dqword[rsp + 32], xmm10
        MOVDQA          dqword[rsp + 48], xmm11
       ; MOVDQA          dqword[rsp + 64], xmm12
       ; MOVDQA          dqword[rsp + 80], xmm13
       ; MOVDQA          dqword[rsp + 96], xmm14
       ; MOVDQA          dqword[rsp + 112], xmm15
        ;
        MOVDQA          xmm0, dqword[seed1] ; = current
        MOVDQA          xmm1, dqword[seed2]
        MOVDQA          xmm2, dqword[seed3]
        MOVDQA          xmm3, dqword[seed4]
        MOVDQA          xmm4, xmm0
        MOVDQA          xmm5, xmm0
        MOVDQA          xmm6, xmm1
        MOVDQA          xmm7, xmm1
        MOVDQA          xmm8, xmm2
        MOVDQA          xmm9, xmm2
        MOVDQA          xmm10, xmm3
        MOVDQA          xmm11, xmm3
        PSLLQ           xmm4, 1 ; = right
        PSLLQ           xmm6, 1
        PSLLQ           xmm8, 1
        PSLLQ           xmm10, 1
        PSRLQ           xmm5, 1 ; = left
        PSRLQ           xmm7, 1
        PSRLQ           xmm9, 1
        PSRLQ           xmm11, 1
        POR             xmm0, xmm4 ; = current OR right
        POR             xmm1, xmm6
        POR             xmm2, xmm8
        POR             xmm3, xmm10
        PXOR            xmm0, xmm5 ; = XOR left
        PXOR            xmm1, xmm7
        PXOR            xmm2, xmm9
        PXOR            xmm3, xmm11
        MOVDQA          dqword[seed1], xmm0
        MOVDQA          dqword[seed2], xmm1
        MOVDQA          dqword[seed3], xmm2
        MOVDQA          dqword[seed4], xmm3
        PSLLQ           xmm0, 32 ; bit 32 of each qword
        PSLLQ           xmm1, 32
        PSLLQ           xmm2, 32
        PSLLQ           xmm3, 32
        MOVMSKPD        rax, xmm0 ; pack pseudo random bits into AL
        MOVMSKPD        rcx, xmm1
        MOVMSKPD        rdx, xmm2
        MOVMSKPD        r8, xmm3
        SHL             rcx, 2
        SHL             rdx, 4
        SHL             r8, 6
        OR              rax, rcx
        OR              rdx, r8
        OR              rax, rdx
        ; use cache to refresh the seed every 32 calls
        MOV             rcx, [cache_ptr]
        MOV             rdx, rax ; ??? copy output
        MOV             byte[cache + rcx], al
        ADD             rcx, 1
        AND             rdx, 11100000b ; ??? zero low 5 bits for use as pointer into init_matrix
        AND             rcx, 32-1
        JNZ             .skip
        ; ??? expand 32 byte cache into 64 byte seed
        MOVDQA          xmm0, dqword[cache1]
        MOVDQA          xmm1, dqword[cache2]
        MOVDQA          xmm2, dqword[init_matrix + rdx]
        MOVDQA          xmm3, dqword[init_matrix + rdx + 16]
        PXOR            xmm2, xmm0
        PADDB           xmm3, xmm1
        MOVDQA          dqword[seed1], xmm1
        MOVDQA          dqword[seed2], xmm2
        MOVDQA          dqword[seed3], xmm3
        MOVDQA          dqword[seed4], xmm0
.skip:
        MOV             [cache_ptr], rcx
        ;
        MOVDQA          xmm8, dqword[rsp]
        MOVDQA          xmm9, dqword[rsp + 16]
        MOVDQA          xmm10, dqword[rsp + 32]
        MOVDQA          xmm11, dqword[rsp + 48]
       ; MOVDQA          xmm12, dqword[rsp + 64]
       ; MOVDQA          xmm13, dqword[rsp + 80]
       ; MOVDQA          xmm14, dqword[rsp + 96]
       ; MOVDQA          xmm15, dqword[rsp + 112]
        ADD             rsp, 136
        RET             0



section '.bss' data readable writable
align 16
dd 0

section '.data' data readable writable
align 16
;; taken from SHA-2 constants
init_matrix             dd 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
                        dd 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
                        dd 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
                        dd 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
                        dd 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
                        dd 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
                        dd 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
                        dd 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2

align 16
cache:
cache1                  dq 0, 0
cache2                  dq 0, 0
cache_ptr               dq 0

align 16
seed:
seed1                   dq 0, 0
seed2                   dq 0, 0
seed3                   dq 0, 0
seed4                   dq 0, 0


_fmtbits                db 'Bits:',9,'%d',13,10,0
_fmtseed                db 'Seed:',9,'%.8X%.8X %.8X%.8X',13,10,0
_fmtcache               db 'Cache:',9,'%.8X%.8X %.8X%.8X',13,10,0
_pause                  db 'pause',0
_cnt                    dd 100
_fmtstats               db '%d:', 9, '%d', 13, 10, 0
_stats                  rd 256

section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          user32,'USER32.DLL',\
          msvcrt,'MSVCRT.DLL'

  include 'API\KERNEL32.INC'
  include 'API\USER32.INC'

  import  msvcrt,\
          system,'system',\
          printf,'printf'

29 Apr 2012, 03:58

hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany

hopcode 09 Jul 2012, 16:26

more like a diary, Twisted Evil

i am very glad that this thread goes further even without my
tribute. i would thank all people taking part to the discussion.
for my own side, i am totally fascinated from the Marsaglia's methods.
He has reduced me to silence and admiration. that should explain my absence on the thread.
all those methods are a powerful source of inspiration for me.
also, concretely the more is yet to come, hopefully in a better way.
Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑

09 Jul 2012, 16:26

Goto page Previous 1, 2, 3

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum