flat assembler
Message board for the users of flat assembler.

 Index > Main > Fast and small Multiplicative Random number generator Goto page Previous  1, 2, 3
Author
hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 05 Apr 2012, 07:05
AsmGuru62 wrote:
@hopcode: I looked at 'normal' distribution in Wikipedia and it resembles some kind of symmetrical curve (like a Bell Curve), so I assume that numbers are not generated in equal probability.
Hi all,
that symmetry is a "architectural". it tell us where/when the deviation happens but not directly the value of the outcome at that variance/deviation. rolling a dice, we have 1,2,3,4,5,6 as outcomes. considering 3.5 the mean, 1 and 6 have same variance but different values. it depends on where one chooses the mean. considering bytes, if we place the mean at 127.5 then 0 and 255, 1 and 254,2 and 253 may share a relationship that may lead to biased randomness.

"architectural" there denotes the fact that at the moment it is the basal structure for almost all happening phenomena in the universe ("almost" is after the fact that it is a continous function). they happens naturally "deviated" from an ideal "mean", and "almost" always according the rule 68/95/99, when one localizes them in a determined/closed environment. this supposedly "natural" rule is the warranty of an important aspect (read "phenomena") to be re-built in the algo, in order to get from it a final randomness.
revolution wrote:
Usually we generate a uniform distribution with a PRNG. This means that if we graph the counts of each output number we expect to see a flat line.

Taking a simple example of a PRNG that generates numbers from 0 to 15 inclusive in a uniform dist.
If we run our PRNG 16,000,000 times then we expect to see 1,000,000 0's and 1,000,000
1's ... up to 1,000,000 15's. (In most good quality PRNGs and RNGs we would usually see small deviations from this ideal expectation).

starting with uniformity is simpler bat a very bad approach.
because uniformity tells us essentially about the likelihood of the outcomes in a range.
rolling a dice we may have 1,2,3,4,5,6. likelihood is 1/6
for each k in/of the range. because empirically 1 balances 6, 2 balances 5 from the mean, we say uniformity is ok.

rolling now again we may have 1,2,2,3,4,5 for an unbalanced uniformity. for this reason we must go lokal to the outcomes to consider the deviation because, the random numbers we like are 1) choosen from a finite sample,2) they may be only "infinitively" uniform,3) but they are biased when uniform-ed.

1) are always a finite sample.
uniformity should to be achieved only globally, and after all outcomes.

2) becuse uniformity is only confirmed/checkable after having all outcomes
(consider pi digits found till now and their increasing likelihood)

3) as said above, better starting your algo considering parting the range in 8 and then using a z-score measure for it than, call it the way you like,
a biased algo that generates numbers in the range of 1 standard deviation only because someone told you it must be of uniform likelyhood !!!!
revolution wrote:
...Instead, let's say we want a normal distribution (a bell curve result). Then we need to modify our PRNG to generate...

sorry, definitvely RBH

Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑
05 Apr 2012, 07:05
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20002
revolution 05 Apr 2012, 12:05
hopcode: I think you misunderstand some things. Uniform distribution is by far the easiest to generate and to test for biases. That is why currently all good non-uniform generators start with a uniform output and later alter the output to fit into the desired distribution. Some people have tried to directly generate some non-uniform distributions but ultimately the results were very poor and slower than the more simple method of generating uniform first with known good algorithms and massaging for non-uniformity.

Did you finally understand the proper meanings of PRNG and RNG? This is also an important thing to understand when dealing with these things. Putting the "P" in front of "RNG" does not suggest it is suddenly of low quality, it is supposed to show the fundamental difference in how the numbers are generated and sourced.
05 Apr 2012, 12:05
hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 05 Apr 2012, 16:14
hopcode wrote:
...because uniformity tells us essentially about the
likelihood of the outcomes in a range...
just like each other
distribution tells us respectively of some other features of the outcomes in
the range. also one concrete important factor when choosing an algo
is the period it manages.
if i choose for example 2^32 -1 from the Park-Miller because i need 256
words of 16 bit as result, and i generate them singularly, and not serially
as the generator is programmed for,whatever the "linearity" it implements,
it means that locally from my 256 outcomes, i may find strange/biased
frequencies because truncating on 2^31 / 2^16 i may count out a repeated outcome every 2^14 times.
on the contrary, using Marsenne of a huge period, the deviation may be larger.
and this may lead to think that the output has no uniformity.for the same reason you cannot test 2 algos for the best one,
if they work on different periods. Park-Miller and mod-31s work on a 2^31-1 period, Marsenne on a 2^1937, if i am not wrong.
but the worst, they work good only when used as generators. for the same same reason, one should start thinking about

1) the period to manage
2) the mode: singularity/seriality of the algo

because those are mainly the few things that make algos different.

Cheers,

p.s. all the rest is verbosity overbloating over the time

_________________
⠓⠕⠏⠉⠕⠙⠑
05 Apr 2012, 16:14
kalambong

Joined: 08 Nov 2008
Posts: 165
kalambong 06 Apr 2012, 09:11
AsmGuru62 wrote:
Best tests you can run on your generator are DIEHARD tests.

Original Diehard test is no longer maintained

Alternatives can be found at

http://www.phy.duke.edu/~rgb/General/dieharder.php

and

http://csrc.nist.gov/groups/ST/toolkit/rng/documents/sts-2.1.1.zip
06 Apr 2012, 09:11
hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 15 Apr 2012, 12:52
amrt, Assembly Math Run Time
now and immediately ?

Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑
15 Apr 2012, 12:52
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 25 Apr 2012, 08:47
Has anyone tried using rule 30 ( http://en.wikipedia.org/wiki/Rule_30 ) to generate 'random' numbers? A quality SSE implementation would be nice...
25 Apr 2012, 08:47
AsmGuru62

Joined: 28 Jan 2004
Posts: 1582
AsmGuru62 25 Apr 2012, 15:09
I didn't get the algorithm on this Rule 30.
There are just some pictures and no clear explanation.
25 Apr 2012, 15:09
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 26 Apr 2012, 00:38
it is a cellular automaton.
there is an initial configuration of 0's and 1's on a doubly infinite sequence of cells.
The next state of a cell is determined by the current state of itself and both of its two neighbors. the actual rule for each the eight possibilities is given on the wikipedia site.
26 Apr 2012, 00:38
bzdashek

Joined: 15 Feb 2012
Posts: 147
Location: Tolstokvashino, Russia
bzdashek 26 Apr 2012, 05:21
tthsqe wrote:
it is a cellular automaton.
there is an initial configuration of 0's and 1's on a doubly infinite sequence of cells.
The next state of a cell is determined by the current state of itself and both of its two neighbors. the actual rule for each the eight possibilities is given on the wikipedia site.

Sorry, I'm technically challenged at maths.
Could you please explain a special case:
1. Sequence initialized by some number, let's say: 20.
2. We generated some members of the sequence, let's say 100

How do we select our next "random" number?
26 Apr 2012, 05:21
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 26 Apr 2012, 16:10
since 20 = 10100 in binary, we would initialize it with
...00001010000...
Rule 30 is the following update rule:
SELF = (LEFTNEIGHBOR xor (SELF or RIGHTNEIGHBOR))
the plot is shown using black for 1 and white for 0 and each generation on a new row.
Apparently, we can use the center column (blue) for a 'random' source of bits.

 Description: Filesize: 79.86 KB Viewed: 12286 Time(s)

26 Apr 2012, 16:10
bzdashek

Joined: 15 Feb 2012
Posts: 147
Location: Tolstokvashino, Russia
bzdashek 26 Apr 2012, 17:35
@tthsqe
Thanks for the explanation, tthsqe!
26 Apr 2012, 17:35
r22

Joined: 27 Dec 2004
Posts: 805
r22 28 Apr 2012, 02:35
@tthsqe - I was skeptical at first about how well Rule30 would translate to SSE, but constraining +/- INFINITY to ~64 bits and keeping a cache of the pseudo random bits produced to re-initialize the seed seems to do the trick.

Here's version 0.1 of Rule 30 ala SSE:
Code:
```format PE64 CONSOLE
entry start

include 'win64ax.inc'

start:
SUB             rsp, 8
;
stdcall         R30Init
;
.again:
stdcall         R30Next
cinvoke         printf, _fmtbits, rax
cinvoke         printf, _fmtseed, [seed1+4], [seed1], [seed1+12], [seed1+8]
cinvoke         printf, _fmtcache, [cache+4], [cache], [cache+12], [cache+8]

SUB             [_cnt], 1
JNS             .again

exit:
cinvoke  system, _pause
invoke  ExitProcess,0

R30Init:
MOV             rcx, [init_mul1]
RDTSC
ROR             rax, 17
MUL             rcx
MOV             [seed1], rax
MOV             rcx, [init_mul2]
RDTSC
ROL             rax, 5
MUL             rcx
MOV             [seed1 + 8], rax
RET             0

R30Next:
MOV             rdx, cache
MOV             rcx, [cache_ptr]
MOVDQA          xmm0, dqword[seed1] ; = current
MOVDQA          xmm1, xmm0
MOVDQA          xmm2, xmm0
PSLLQ           xmm1, 1 ; = right
PSRLQ           xmm2, 1 ; = left
POR             xmm0, xmm1 ; = current | right
PXOR            xmm0, xmm2 ; = left ^(current | right)
MOVDQA          xmm3, xmm0
PSLLQ           xmm3, 32 ; bit 32 is the pretend center (origin)
MOVDQA          dqword[seed1], xmm0 ; save new seed
MOVMSKPD        rax, xmm3 ; = 2 pseudo random bits
; save the bits to the cache
MOV             r9, rax ; = copy for cache
MOV             r8, rcx ; = copy of bit ptr into cache
ROL             r9, cl ; position bits to add to cache
SHR             r8, 6 ; = / 64
LEA             rdx, [rdx + r8 * 8]
OR              qword[rdx], r9
; if the cache is full then copy cache to seed
AND             rcx, 127
JNZ             .skip
MOVDQA          xmm0, dqword[cache]
PXOR            xmm1, xmm1
MOVDQA          dqword[seed1], xmm0
MOVDQA          dqword[cache], xmm1
.skip:
MOV             [cache_ptr], rcx
RET             0

align 16
dd 0

align 16
init_mul1               dq 0cdefab1223455673h

align 16
cache                   dq 0, 0
cache_ptr               dq 0

align 16
seed1                   dq 0, 0

_fmtbits                db 'Bits:',9,'%d',13,10,0
_fmtseed                db 'Seed:',9,'%.8X%.8X %.8X%.8X',13,10,0
_fmtcache               db 'Cache:',9,'%.8X%.8X %.8X%.8X',13,10,0
_pause                  db 'pause',0
_cnt                    dd 100

section '.idata' import data readable writeable

library kernel32,'KERNEL32.DLL',\
user32,'USER32.DLL',\
msvcrt,'MSVCRT.DLL'

include 'API\KERNEL32.INC'
include 'API\USER32.INC'

import  msvcrt,\
system,'system',\
printf,'printf'
```

The most obvious optimization from here is quadrupling the output so you produce 8 pseudo random bits at a time. The would allow interleaving of the SIMD instructions avoiding write/read stalls and simplify the caching a bit etc blah blah.
28 Apr 2012, 02:35
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 28 Apr 2012, 03:15
@r22, I think I see what you are doing (two size-64 configurations in parallel and reading from bit 32 in each; the seed is updated when 64 generations have been produced in each).
However, since you are not taking care of the edges, the cellular automaton is not correct past the ~32nd generation, and I would worry about the quality of the later bits. In the plot below I put 32bit integers in seed[0:31] and seed[64:95] and ran
Code:
```mixup:               movdqa  xmm0,dqword[seed]
movdqa  xmm1,xmm0
movdqa  xmm2,xmm0
psllq  xmm1,1
psrlq  xmm2,1
por  xmm0,xmm1
pxor  xmm0,xmm2
movdqa  dqword[seed],xmm0
ret    ```

You can see that it loses quality quickly, but it might not be a problem as the seed should be updated before the 64th generation.

 Description: Filesize: 109.81 KB Viewed: 12215 Time(s)

28 Apr 2012, 03:15
r22

Joined: 27 Dec 2004
Posts: 805
r22 29 Apr 2012, 03:58
Here's version 0.2 of Rule 30 ala SSE.
This update creates a whole byte of pseudo random goodness per call.
All that's left is to interleave the SSE instructions to get better throughput.
Also, it refreshes the seed every 32 generations instead of 64 generations, but to do so I did some "expanding of the 32byte cache to 64byte seed, which could probably use some work.

Code:
```format PE64 CONSOLE
entry start

include 'win64ax.inc'

start:
SUB             rsp, 8
;
stdcall            R30Init
;
MOV             [_cnt], 7FFFFFFh
.again:
stdcall         R30Next
ADD             dword[_stats + rax * 4], 1
; cinvoke         printf, _fmtbits, rax
SUB             [_cnt], 1
JNS             .again
;
MOV             [_cnt], 255
.print:
MOV             eax, dword[_cnt]
cinvoke         printf, _fmtstats, [_cnt], [_stats + rax * 4]
SUB             [_cnt], 1
JNS             .print
exit:
cinvoke  system, _pause
invoke  ExitProcess,0

;;
;; Initialize the seed using the current time and some constants
;;
R30Init:
RDTSC
MOV             rdx, seed
MOV             rcx, init_matrix
MOV             r8, 7
.fill:
XOR             rax, [rcx + r8 * 8]
ROR             rax, 7
MOV             [rdx + r8 * 8], rax
SUB             r8, 1
JNS             .fill
RET             0

R30Next:
SUB             rsp, 136
MOVDQA          dqword[rsp], xmm8
MOVDQA          dqword[rsp + 16], xmm9
MOVDQA          dqword[rsp + 32], xmm10
MOVDQA          dqword[rsp + 48], xmm11
; MOVDQA          dqword[rsp + 64], xmm12
; MOVDQA          dqword[rsp + 80], xmm13
; MOVDQA          dqword[rsp + 96], xmm14
; MOVDQA          dqword[rsp + 112], xmm15
;
MOVDQA          xmm0, dqword[seed1] ; = current
MOVDQA          xmm1, dqword[seed2]
MOVDQA          xmm2, dqword[seed3]
MOVDQA          xmm3, dqword[seed4]
MOVDQA          xmm4, xmm0
MOVDQA          xmm5, xmm0
MOVDQA          xmm6, xmm1
MOVDQA          xmm7, xmm1
MOVDQA          xmm8, xmm2
MOVDQA          xmm9, xmm2
MOVDQA          xmm10, xmm3
MOVDQA          xmm11, xmm3
PSLLQ           xmm4, 1 ; = right
PSLLQ           xmm6, 1
PSLLQ           xmm8, 1
PSLLQ           xmm10, 1
PSRLQ           xmm5, 1 ; = left
PSRLQ           xmm7, 1
PSRLQ           xmm9, 1
PSRLQ           xmm11, 1
POR             xmm0, xmm4 ; = current OR right
POR             xmm1, xmm6
POR             xmm2, xmm8
POR             xmm3, xmm10
PXOR            xmm0, xmm5 ; = XOR left
PXOR            xmm1, xmm7
PXOR            xmm2, xmm9
PXOR            xmm3, xmm11
MOVDQA          dqword[seed1], xmm0
MOVDQA          dqword[seed2], xmm1
MOVDQA          dqword[seed3], xmm2
MOVDQA          dqword[seed4], xmm3
PSLLQ           xmm0, 32 ; bit 32 of each qword
PSLLQ           xmm1, 32
PSLLQ           xmm2, 32
PSLLQ           xmm3, 32
MOVMSKPD        rax, xmm0 ; pack pseudo random bits into AL
MOVMSKPD        rcx, xmm1
MOVMSKPD        rdx, xmm2
MOVMSKPD        r8, xmm3
SHL             rcx, 2
SHL             rdx, 4
SHL             r8, 6
OR              rax, rcx
OR              rdx, r8
OR              rax, rdx
; use cache to refresh the seed every 32 calls
MOV             rcx, [cache_ptr]
MOV             rdx, rax ; ??? copy output
MOV             byte[cache + rcx], al
AND             rdx, 11100000b ; ??? zero low 5 bits for use as pointer into init_matrix
AND             rcx, 32-1
JNZ             .skip
; ??? expand 32 byte cache into 64 byte seed
MOVDQA          xmm0, dqword[cache1]
MOVDQA          xmm1, dqword[cache2]
MOVDQA          xmm2, dqword[init_matrix + rdx]
MOVDQA          xmm3, dqword[init_matrix + rdx + 16]
PXOR            xmm2, xmm0
MOVDQA          dqword[seed1], xmm1
MOVDQA          dqword[seed2], xmm2
MOVDQA          dqword[seed3], xmm3
MOVDQA          dqword[seed4], xmm0
.skip:
MOV             [cache_ptr], rcx
;
MOVDQA          xmm8, dqword[rsp]
MOVDQA          xmm9, dqword[rsp + 16]
MOVDQA          xmm10, dqword[rsp + 32]
MOVDQA          xmm11, dqword[rsp + 48]
; MOVDQA          xmm12, dqword[rsp + 64]
; MOVDQA          xmm13, dqword[rsp + 80]
; MOVDQA          xmm14, dqword[rsp + 96]
; MOVDQA          xmm15, dqword[rsp + 112]
RET             0

align 16
dd 0

align 16
;; taken from SHA-2 constants
init_matrix             dd 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
dd 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
dd 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
dd 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
dd 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
dd 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
dd 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
dd 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2

align 16
cache:
cache1                  dq 0, 0
cache2                  dq 0, 0
cache_ptr               dq 0

align 16
seed:
seed1                   dq 0, 0
seed2                   dq 0, 0
seed3                   dq 0, 0
seed4                   dq 0, 0

_fmtbits                db 'Bits:',9,'%d',13,10,0
_fmtseed                db 'Seed:',9,'%.8X%.8X %.8X%.8X',13,10,0
_fmtcache               db 'Cache:',9,'%.8X%.8X %.8X%.8X',13,10,0
_pause                  db 'pause',0
_cnt                    dd 100
_fmtstats               db '%d:', 9, '%d', 13, 10, 0
_stats                  rd 256

section '.idata' import data readable writeable

library kernel32,'KERNEL32.DLL',\
user32,'USER32.DLL',\
msvcrt,'MSVCRT.DLL'

include 'API\KERNEL32.INC'
include 'API\USER32.INC'

import  msvcrt,\
system,'system',\
printf,'printf'

```
29 Apr 2012, 03:58
hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 09 Jul 2012, 16:26
more like a diary,
i am very glad that this thread goes further even without my
tribute. i would thank all people taking part to the discussion.
for my own side, i am totally fascinated from the Marsaglia's methods.
He has reduced me to silence and admiration. that should explain my absence on the thread.
all those methods are a powerful source of inspiration for me.
also, concretely the more is yet to come, hopefully in a better way.
Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑
09 Jul 2012, 16:26
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area
Goto page Previous  1, 2, 3

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum