flat assembler
Message board for the users of flat assembler.
Index
> Windows > Optimizing pseudo random generator |
Author |
|
tthsqe 12 Dec 2012, 09:05
this seems to be memory bound.
If I take out the writes to memory (the two movntdq), I get ~1.8billion numbers per second on one core of an i5-3320M. If I do write the results to memory (un-comment the two movntdq), I get ~0.8billion numbers per second. I would just say that your computer has a low memory bandwidth. And, do you know if the c++ test you mentioned was writing the results back to memory? this is what I was testing: Code: format PE GUI 5.0 entry Start include 'win32ax.inc' section '.text' code readable executable ;private static uint GetUint() ;{ ; V = 36969 * (V & 65535) + (V >> 16); ; U = 18000 * (U & 65535) + (U >> 16); ; return (V << 16) + U; ;} random: ; ecx: count , esi: buffer push ebp mov ebp,esp and esp,-16 sub esp,64 mov eax,65535 movd xmm0,eax pshufd xmm0,xmm0,0 movdqa dqword[esp+16*0],xmm0 mov eax,16 movd xmm0,eax pshufd xmm0,xmm0,0 movdqa dqword[esp+16*1],xmm0 mov eax,36969 movd xmm0,eax pshufd xmm0,xmm0,0 movdqa dqword[esp+16*2],xmm0 mov eax,18000 movd xmm0,eax pshufd xmm0,xmm0,0 movdqa dqword[esp+16*3],xmm0 shr ecx,1 ; 2x unroll movaps xmm0,dqword[mainseed+16*0] ; xmm0=V (Re-Use, Re-Calculate) movaps xmm1,dqword[mainseed+16*1] ; xmm1=U (Re-Use, Re-Calculate) movaps xmm2,xmm0 ; xmm2=V (Re-Use, Re-Calculate) movaps xmm3,xmm1 ; xmm3=U (Re-Use, Re-Calculate) movaps xmm4,dqword[mainseed+16*1] ; xmm4=V (Re-Use, Re-Calculate) movaps xmm5,dqword[mainseed+16*2] ; xmm5=U (Re-Use, Re-Calculate) movaps xmm6,xmm4 ; xmm6=V (Re-Use, Re-Calculate) movaps xmm7,xmm5 ; xmm7=U (Re-Use, Re-Calculate) ; Generate random numbers align 64 .l1: pand xmm0,dqword[esp+16*0] pand xmm1,dqword[esp+16*0] pand xmm4,dqword[esp+16*0] pand xmm5,dqword[esp+16*0] psrad xmm2,dqword[esp+16*1] psrad xmm3,dqword[esp+16*1] psrad xmm6,dqword[esp+16*1] psrad xmm7,dqword[esp+16*1] pmulld xmm0,dqword[esp+16*2] pmulld xmm1,dqword[esp+16*3] pmulld xmm4,dqword[esp+16*2] pmulld xmm5,dqword[esp+16*3] paddd xmm0,xmm2 paddd xmm1,xmm3 paddd xmm4,xmm6 paddd xmm5,xmm7 movdqa xmm2,xmm0 movdqa xmm3,xmm1 movdqa xmm6,xmm4 movdqa xmm7,xmm5 pslld xmm0,dqword[esp+16*1] pslld xmm4,dqword[esp+16*1] paddd xmm0,xmm1 paddd xmm4,xmm5 ; movntdq[esi+16*0],xmm0 ; movntdq[esi+16*1],xmm4 movdqa xmm0,xmm2 movdqa xmm4,xmm6 add esi,32 sub ecx,1 jnz .l1 mov esp,ebp pop ebp ret Start: invoke QueryPerformanceFrequency,Frequency mov eax,[Count] shl eax,4 invoke VirtualAlloc,0,eax,MEM_COMMIT,PAGE_READWRITE mov [Result],eax invoke QueryPerformanceCounter,Time mov ecx,[Count] mov esi,[Result] call random mov esi,dword[Time+4*0] mov edi,dword[Time+4*1] invoke QueryPerformanceCounter,Time sub dword[Time+4*0],esi sbb dword[Time+4*1],edi fild qword[Time] fild qword[Frequency] fild dword[Count] fmulp st1,st0 fdivrp st1,st0 fadd st0,st0 ; 4 numbers per count fadd st0,st0 ; fstp qword[Rate] invoke sprintf,Message,MessageFormat,double [Rate] invoke MessageBoxA,0,Message,Caption,MB_OK invoke VirtualFree,[Result],0,MEM_RELEASE invoke ExitProcess,0 section '.data' data readable writeable align 16 mainseed dd 1,2,3,4 dd 5,6,7,8 dd 1,3,5,6 dd 7,5,4,3 align 8 Time dq 0 Frequency dq 0 Rate dq 0 align 4 Result dd 0 Count dd 20000000 align 1 Caption db 'test',0 MessageFormat db 'numbers per second: %f',0 Message rb 200 section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',\ user32,'USER32.DLL',\ gdi32,'GDI32.DLL',\ msvcrt,'MSVCRT.DLL' include 'api\kernel32.inc' include 'api\user32.inc' include 'api\gdi32.inc' import msvcrt,\ sprintf,'sprintf' |
|||
12 Dec 2012, 09:05 |
|
nmake 12 Dec 2012, 09:22
I do not know if the c++ program wrote anything back. But I do know it was windows 64 and not 32. You can read about it here: http://www.dimkovic.com/node/22
|
|||
12 Dec 2012, 09:22 |
|
tthsqe 12 Dec 2012, 10:34
well, ok - I think that answers it.
You have every reason to relax, dude. That code does not write the results back to a huge array and was run on a 3.8 GHz Core i7 2687W. My slightly modified version of your code got 1.8 billion numbers per second on my 3.3 GHz i5-3320M. We have 3.8/3.3 * 1.8 billion = 2 billion expected numbers per second on their machine. Try to write some 64 bit code and get it up to 3 billion numbers per second. |
|||
12 Dec 2012, 10:34 |
|
cod3b453 12 Dec 2012, 19:02
I think you can reduce the calculation for both U and V to a single instruction:
Code: ; xmm1 = 0001 4650 0001 9069 ; xmm0 = UUUU uuuu VVVV vvvv pmaddwd xmm0,xmm1 ; xmm0 = (18000*u)+(U shr 16) (36969*v)+(V shr 16) |
|||
12 Dec 2012, 19:02 |
|
nmake 12 Dec 2012, 23:24
tthsqe this is fast for 32 bit. I think, it can probably become faster too. cod3b453, good find. I will try it, I have to admit I was peeking at the mul/add instruction but never thought of doing it your way. Very smart way, I'm sure it will give a good boost.
Can we convert this algo into using a base without using division in the end? A base of 100 outputs a number between 0-99 naturally. But division is costy. Btw. more smart tricks and implementations is wanted. |
|||
12 Dec 2012, 23:24 |
|
nmake 13 Dec 2012, 01:15
I managed to set it all up using the pmaddwd to run 2 billion iterations in 481 milliseconds, without implementing the punpcklwd and paddd yet. But I believe this will become very fast. (on a single core) and this is using only 1 sse register with data, if you add 8 of them together, it will become tremendously fast.
|
|||
13 Dec 2012, 01:15 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.