flat assembler
Message board for the users of flat assembler.
Index
> Main > Fast and small Multiplicative Random number generator Goto page Previous 1, 2, 3 |
Author |
|
revolution 05 Apr 2012, 12:05
hopcode: I think you misunderstand some things. Uniform distribution is by far the easiest to generate and to test for biases. That is why currently all good non-uniform generators start with a uniform output and later alter the output to fit into the desired distribution. Some people have tried to directly generate some non-uniform distributions but ultimately the results were very poor and slower than the more simple method of generating uniform first with known good algorithms and massaging for non-uniformity.
Did you finally understand the proper meanings of PRNG and RNG? This is also an important thing to understand when dealing with these things. Putting the "P" in front of "RNG" does not suggest it is suddenly of low quality, it is supposed to show the fundamental difference in how the numbers are generated and sourced. |
|||
05 Apr 2012, 12:05 |
|
hopcode 05 Apr 2012, 16:14
hopcode wrote: ...because uniformity tells us essentially about the distribution tells us respectively of some other features of the outcomes in the range. also one concrete important factor when choosing an algo is the period it manages. if i choose for example 2^32 -1 from the Park-Miller because i need 256 words of 16 bit as result, and i generate them singularly, and not serially as the generator is programmed for,whatever the "linearity" it implements, it means that locally from my 256 outcomes, i may find strange/biased frequencies because truncating on 2^31 / 2^16 i may count out a repeated outcome every 2^14 times. on the contrary, using Marsenne of a huge period, the deviation may be larger. and this may lead to think that the output has no uniformity.for the same reason you cannot test 2 algos for the best one, if they work on different periods. Park-Miller and mod-31s work on a 2^31-1 period, Marsenne on a 2^1937, if i am not wrong. but the worst, they work good only when used as generators. for the same same reason, one should start thinking about 1) the period to manage 2) the mode: singularity/seriality of the algo because those are mainly the few things that make algos different. Cheers, p.s. all the rest is verbosity overbloating over the time _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
05 Apr 2012, 16:14 |
|
kalambong 06 Apr 2012, 09:11
AsmGuru62 wrote: Best tests you can run on your generator are DIEHARD tests. Original Diehard test is no longer maintained Alternatives can be found at http://www.phy.duke.edu/~rgb/General/dieharder.php and http://csrc.nist.gov/groups/ST/toolkit/rng/documents/sts-2.1.1.zip |
|||
06 Apr 2012, 09:11 |
|
hopcode 15 Apr 2012, 12:52
amrt, Assembly Math Run Time
now and immediately ? http://sites.google.com/site/x64lab/home/reloaded-algorithms Cheers, _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
15 Apr 2012, 12:52 |
|
tthsqe 25 Apr 2012, 08:47
Has anyone tried using rule 30 ( http://en.wikipedia.org/wiki/Rule_30 ) to generate 'random' numbers? A quality SSE implementation would be nice...
|
|||
25 Apr 2012, 08:47 |
|
AsmGuru62 25 Apr 2012, 15:09
I didn't get the algorithm on this Rule 30.
There are just some pictures and no clear explanation. |
|||
25 Apr 2012, 15:09 |
|
tthsqe 26 Apr 2012, 00:38
it is a cellular automaton.
there is an initial configuration of 0's and 1's on a doubly infinite sequence of cells. The next state of a cell is determined by the current state of itself and both of its two neighbors. the actual rule for each the eight possibilities is given on the wikipedia site. |
|||
26 Apr 2012, 00:38 |
|
bzdashek 26 Apr 2012, 05:21
tthsqe wrote: it is a cellular automaton. Sorry, I'm technically challenged at maths. Could you please explain a special case: 1. Sequence initialized by some number, let's say: 20. 2. We generated some members of the sequence, let's say 100 How do we select our next "random" number? |
|||
26 Apr 2012, 05:21 |
|
tthsqe 26 Apr 2012, 16:10
since 20 = 10100 in binary, we would initialize it with
...00001010000... Rule 30 is the following update rule: SELF = (LEFTNEIGHBOR xor (SELF or RIGHTNEIGHBOR)) the plot is shown using black for 1 and white for 0 and each generation on a new row. Apparently, we can use the center column (blue) for a 'random' source of bits.
|
||||||||||
26 Apr 2012, 16:10 |
|
bzdashek 26 Apr 2012, 17:35
@tthsqe
Thanks for the explanation, tthsqe! |
|||
26 Apr 2012, 17:35 |
|
r22 28 Apr 2012, 02:35
@tthsqe - I was skeptical at first about how well Rule30 would translate to SSE, but constraining +/- INFINITY to ~64 bits and keeping a cache of the pseudo random bits produced to re-initialize the seed seems to do the trick.
Here's version 0.1 of Rule 30 ala SSE: Code: format PE64 CONSOLE entry start include 'win64ax.inc' section '.text' code readable executable start: SUB rsp, 8 ; stdcall R30Init ; .again: stdcall R30Next cinvoke printf, _fmtbits, rax cinvoke printf, _fmtseed, [seed1+4], [seed1], [seed1+12], [seed1+8] cinvoke printf, _fmtcache, [cache+4], [cache], [cache+12], [cache+8] SUB [_cnt], 1 JNS .again exit: cinvoke system, _pause invoke ExitProcess,0 R30Init: MOV rcx, [init_mul1] RDTSC ROR rax, 17 MUL rcx MOV [seed1], rax MOV rcx, [init_mul2] RDTSC ROL rax, 5 MUL rcx MOV [seed1 + 8], rax RET 0 R30Next: MOV rdx, cache MOV rcx, [cache_ptr] MOVDQA xmm0, dqword[seed1] ; = current MOVDQA xmm1, xmm0 MOVDQA xmm2, xmm0 PSLLQ xmm1, 1 ; = right PSRLQ xmm2, 1 ; = left POR xmm0, xmm1 ; = current | right PXOR xmm0, xmm2 ; = left ^(current | right) MOVDQA xmm3, xmm0 PSLLQ xmm3, 32 ; bit 32 is the pretend center (origin) MOVDQA dqword[seed1], xmm0 ; save new seed MOVMSKPD rax, xmm3 ; = 2 pseudo random bits ; save the bits to the cache MOV r9, rax ; = copy for cache MOV r8, rcx ; = copy of bit ptr into cache ROL r9, cl ; position bits to add to cache SHR r8, 6 ; = / 64 LEA rdx, [rdx + r8 * 8] OR qword[rdx], r9 ; if the cache is full then copy cache to seed ADD rcx, 2 AND rcx, 127 JNZ .skip MOVDQA xmm0, dqword[cache] PXOR xmm1, xmm1 MOVDQA dqword[seed1], xmm0 MOVDQA dqword[cache], xmm1 .skip: MOV [cache_ptr], rcx RET 0 section '.bss' data readable writable align 16 dd 0 section '.data' data readable writable align 16 init_mul1 dq 0cdefab1223455673h init_mul2 dq 0deadb33f1337357ah align 16 cache dq 0, 0 cache_ptr dq 0 align 16 seed1 dq 0, 0 _fmtbits db 'Bits:',9,'%d',13,10,0 _fmtseed db 'Seed:',9,'%.8X%.8X %.8X%.8X',13,10,0 _fmtcache db 'Cache:',9,'%.8X%.8X %.8X%.8X',13,10,0 _pause db 'pause',0 _cnt dd 100 section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',\ user32,'USER32.DLL',\ msvcrt,'MSVCRT.DLL' include 'API\KERNEL32.INC' include 'API\USER32.INC' import msvcrt,\ system,'system',\ printf,'printf' The most obvious optimization from here is quadrupling the output so you produce 8 pseudo random bits at a time. The would allow interleaving of the SIMD instructions avoiding write/read stalls and simplify the caching a bit etc blah blah. |
|||
28 Apr 2012, 02:35 |
|
tthsqe 28 Apr 2012, 03:15
@r22, I think I see what you are doing (two size-64 configurations in parallel and reading from bit 32 in each; the seed is updated when 64 generations have been produced in each).
However, since you are not taking care of the edges, the cellular automaton is not correct past the ~32nd generation, and I would worry about the quality of the later bits. In the plot below I put 32bit integers in seed[0:31] and seed[64:95] and ran Code: mixup: movdqa xmm0,dqword[seed] movdqa xmm1,xmm0 movdqa xmm2,xmm0 psllq xmm1,1 psrlq xmm2,1 por xmm0,xmm1 pxor xmm0,xmm2 movdqa dqword[seed],xmm0 ret You can see that it loses quality quickly, but it might not be a problem as the seed should be updated before the 64th generation.
|
||||||||||
28 Apr 2012, 03:15 |
|
r22 29 Apr 2012, 03:58
Here's version 0.2 of Rule 30 ala SSE.
This update creates a whole byte of pseudo random goodness per call. All that's left is to interleave the SSE instructions to get better throughput. Also, it refreshes the seed every 32 generations instead of 64 generations, but to do so I did some "expanding of the 32byte cache to 64byte seed, which could probably use some work. Code: format PE64 CONSOLE entry start include 'win64ax.inc' section '.text' code readable executable start: SUB rsp, 8 ; stdcall R30Init ; MOV [_cnt], 7FFFFFFh .again: stdcall R30Next ADD dword[_stats + rax * 4], 1 ; cinvoke printf, _fmtbits, rax SUB [_cnt], 1 JNS .again ; MOV [_cnt], 255 .print: MOV eax, dword[_cnt] cinvoke printf, _fmtstats, [_cnt], [_stats + rax * 4] SUB [_cnt], 1 JNS .print exit: cinvoke system, _pause invoke ExitProcess,0 ;; ;; Initialize the seed using the current time and some constants ;; R30Init: RDTSC MOV rdx, seed MOV rcx, init_matrix MOV r8, 7 .fill: XOR rax, [rcx + r8 * 8] ROR rax, 7 MOV [rdx + r8 * 8], rax SUB r8, 1 JNS .fill RET 0 R30Next: SUB rsp, 136 MOVDQA dqword[rsp], xmm8 MOVDQA dqword[rsp + 16], xmm9 MOVDQA dqword[rsp + 32], xmm10 MOVDQA dqword[rsp + 48], xmm11 ; MOVDQA dqword[rsp + 64], xmm12 ; MOVDQA dqword[rsp + 80], xmm13 ; MOVDQA dqword[rsp + 96], xmm14 ; MOVDQA dqword[rsp + 112], xmm15 ; MOVDQA xmm0, dqword[seed1] ; = current MOVDQA xmm1, dqword[seed2] MOVDQA xmm2, dqword[seed3] MOVDQA xmm3, dqword[seed4] MOVDQA xmm4, xmm0 MOVDQA xmm5, xmm0 MOVDQA xmm6, xmm1 MOVDQA xmm7, xmm1 MOVDQA xmm8, xmm2 MOVDQA xmm9, xmm2 MOVDQA xmm10, xmm3 MOVDQA xmm11, xmm3 PSLLQ xmm4, 1 ; = right PSLLQ xmm6, 1 PSLLQ xmm8, 1 PSLLQ xmm10, 1 PSRLQ xmm5, 1 ; = left PSRLQ xmm7, 1 PSRLQ xmm9, 1 PSRLQ xmm11, 1 POR xmm0, xmm4 ; = current OR right POR xmm1, xmm6 POR xmm2, xmm8 POR xmm3, xmm10 PXOR xmm0, xmm5 ; = XOR left PXOR xmm1, xmm7 PXOR xmm2, xmm9 PXOR xmm3, xmm11 MOVDQA dqword[seed1], xmm0 MOVDQA dqword[seed2], xmm1 MOVDQA dqword[seed3], xmm2 MOVDQA dqword[seed4], xmm3 PSLLQ xmm0, 32 ; bit 32 of each qword PSLLQ xmm1, 32 PSLLQ xmm2, 32 PSLLQ xmm3, 32 MOVMSKPD rax, xmm0 ; pack pseudo random bits into AL MOVMSKPD rcx, xmm1 MOVMSKPD rdx, xmm2 MOVMSKPD r8, xmm3 SHL rcx, 2 SHL rdx, 4 SHL r8, 6 OR rax, rcx OR rdx, r8 OR rax, rdx ; use cache to refresh the seed every 32 calls MOV rcx, [cache_ptr] MOV rdx, rax ; ??? copy output MOV byte[cache + rcx], al ADD rcx, 1 AND rdx, 11100000b ; ??? zero low 5 bits for use as pointer into init_matrix AND rcx, 32-1 JNZ .skip ; ??? expand 32 byte cache into 64 byte seed MOVDQA xmm0, dqword[cache1] MOVDQA xmm1, dqword[cache2] MOVDQA xmm2, dqword[init_matrix + rdx] MOVDQA xmm3, dqword[init_matrix + rdx + 16] PXOR xmm2, xmm0 PADDB xmm3, xmm1 MOVDQA dqword[seed1], xmm1 MOVDQA dqword[seed2], xmm2 MOVDQA dqword[seed3], xmm3 MOVDQA dqword[seed4], xmm0 .skip: MOV [cache_ptr], rcx ; MOVDQA xmm8, dqword[rsp] MOVDQA xmm9, dqword[rsp + 16] MOVDQA xmm10, dqword[rsp + 32] MOVDQA xmm11, dqword[rsp + 48] ; MOVDQA xmm12, dqword[rsp + 64] ; MOVDQA xmm13, dqword[rsp + 80] ; MOVDQA xmm14, dqword[rsp + 96] ; MOVDQA xmm15, dqword[rsp + 112] ADD rsp, 136 RET 0 section '.bss' data readable writable align 16 dd 0 section '.data' data readable writable align 16 ;; taken from SHA-2 constants init_matrix dd 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5 dd 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174 dd 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da dd 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967 dd 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85 dd 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070 dd 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3 dd 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2 align 16 cache: cache1 dq 0, 0 cache2 dq 0, 0 cache_ptr dq 0 align 16 seed: seed1 dq 0, 0 seed2 dq 0, 0 seed3 dq 0, 0 seed4 dq 0, 0 _fmtbits db 'Bits:',9,'%d',13,10,0 _fmtseed db 'Seed:',9,'%.8X%.8X %.8X%.8X',13,10,0 _fmtcache db 'Cache:',9,'%.8X%.8X %.8X%.8X',13,10,0 _pause db 'pause',0 _cnt dd 100 _fmtstats db '%d:', 9, '%d', 13, 10, 0 _stats rd 256 section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',\ user32,'USER32.DLL',\ msvcrt,'MSVCRT.DLL' include 'API\KERNEL32.INC' include 'API\USER32.INC' import msvcrt,\ system,'system',\ printf,'printf' |
|||
29 Apr 2012, 03:58 |
|
hopcode 09 Jul 2012, 16:26
more like a diary,
i am very glad that this thread goes further even without my tribute. i would thank all people taking part to the discussion. for my own side, i am totally fascinated from the Marsaglia's methods. He has reduced me to silence and admiration. that should explain my absence on the thread. all those methods are a powerful source of inspiration for me. also, concretely the more is yet to come, hopefully in a better way. Cheers, _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
09 Jul 2012, 16:26 |
|
Goto page Previous 1, 2, 3 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.