flat assembler
Message board for the users of flat assembler.
Index
> Linux > 3D Simplex Noise |
Author |
|
randall 11 May 2012, 11:40
I have implemented 3D Simplex noise which is improved version of the Perlin noise algorithm. My implementation is almost two times faster than C++ version from GLM library (http://glm.g-truc.net/). Maybe it will be useful for someone. I will use this function for procedural terrain generation and rendering.
Performance: about 3.2 Msamples/sec on Core2 CPU 6300 @ 1.86GHz
|
||||||||||||||||||||
11 May 2012, 11:40 |
|
gunblade 12 May 2012, 02:18
Very neat..
I assume you've probably came accross this before.. but there's a program called Terragen that does just that.. It's been a while since I looked at it/used it, but since I saw this post I looked it up again, and it looks like they've gone all commercial (Although they have also improved it greatly.. its gone well beyond just a land generator..): http://www.planetside.co.uk However the "classic" version is still available, and so is a limited free version of the version 2 (limited resolution/quality).. Its just a shame its not open source, would have been a good codebase to compare to.. Still, some kind of raytracer (depends on the speed requirements i guess), and you could shift that from 2D noise to 3D noise/landscapes. Nice work - my only issue is that white pixel just above the bottom dark blob.. - what's introducing that? Will cause a weird looking glitch if/when that is rendered.. (especially if you use light = higher land). |
|||
12 May 2012, 02:18 |
|
randall 12 May 2012, 02:39
Yes, I know Terragen. Nice program.
I will be implementing simple ray marcher to render landscapes. Something like this: http://www.iquilezles.org/www/articles/terrainmarching/terrainmarching.htm Yes, these white pixels are odd (there is another one in the middle left) but they are also present in C++ version so I will leave it for now. Thanks. Last edited by randall on 12 May 2012, 12:31; edited 1 time in total |
|||
12 May 2012, 02:39 |
|
Madis731 16 May 2012, 10:41
I just noticed that ABS() macro. Three instructions is a bit much I think.
Code: macro ABS { andps xmm0,dqword[absps] ; clear sign-bit } absps dd 7FFFFFFFh,7FFFFFFFh,7FFFFFFFh,7FFFFFFFh Are you afraid the memory access (even if in cache) is slower than three single-clock (which of two are dependant) instructions? There's a solution for that. Cache absps value in a register (xmm15 or in 32-bit land xmm7) and use Code: andps xmm0,xmm15 ; xmm7 in 32-bit land Then I noticed the DOT4() macro. I used to struggle with it myself wanting to use PSHUFD. First I thought that its for integer and makes it slower, but actually you can get rid of some instructions that way: Code: mulps xmm0,xmm1 pshufd xmm1,xmm0,0x55 ; All dwords are taken from the source register pshufd xmm2,xmm0,0xaa pshufd xmm3,xmm0,0xff pshufd xmm0,xmm0,0x00 addps xmm0,xmm1 addps xmm2,xmm3 addps xmm0,xmm2 Tell me if they made any difference to this number: randall wrote: Performance: about 3.2 Msamples/sec on Core2 CPU 6300 @ 1.86GHz |
|||
16 May 2012, 10:41 |
|
r22 16 May 2012, 16:28
Code: ;; DOT4 if you have SSE3 instructions available MULPS xmm0, xmm1 HADDPS xmm0, xmm0 HADDPS xmm0, xmm0 |
|||
16 May 2012, 16:28 |
|
macgub 17 May 2012, 11:35
I ported randall code into KolibriOS and MenuetOS64.
http://macgub.co.pl/menuet/snoise.zip Last edited by macgub on 15 Feb 2022, 17:19; edited 2 times in total |
|||
17 May 2012, 11:35 |
|
CandyMan 17 May 2012, 16:08
Code: format Flat on "vitamin.exe" entry Start32 stack 8k use64 ; Code16 = 90h ; 16-bit compatibility mode code selector Code32 = 38h ; 32-bit compatibility mode code selector Data32 = 30h ; 32-bit compatibility mode data selector Code64 = 28h ; 64-bit code selector BufferSize = 32k ; transfer buffer ; macro int No { int No+80h } ; struc RMCS ;real mode call structure { .rEDI dd ? ;+0 virtual at .rEDI .rDI dw ? end virtual .rESI dd ? ;+4 virtual at .rESI .rSI dw ? end virtual .rEBP dd ? ;+8 virtual at .rEBP .rBP dw ? end virtual .Reserve dd ? ;+12 .rEBX dd ? ;+16 virtual at .rEBX .rBX dw ? end virtual virtual at .rBX .rBL db ? .rBH db ? end virtual .rEDX dd ? ;+20 virtual at .rEDX .rDX dw ? end virtual virtual at .rDX .rDL db ? .rDH db ? end virtual .rECX dd ? ;+24 virtual at .rECX .rCX dw ? end virtual virtual at .rCX .rCL db ? .rCH db ? end virtual .rEAX dd ? ;+28 virtual at .rEAX .rAX dw ? end virtual virtual at .rAX .rAL db ? .rAH db ? end virtual .rFL dw ? ;+32 .rES dw ? ;+34 .rDS dw ? ;+36 .rFS dw ? ;+38 .rGS dw ? ;+40 .rCSIP dd ? ;+42 virtual at .rCSIP .rIP dw ? .rCS dw ? end virtual .rSSSP dd ? ;+46 virtual at .rSSSP .rSP dw ? .rSS dw ? end virtual } ; virtual at 0 RMCS RMCS end virtual ; ;I: rcx,rdx ;O: rax WriteToFile: push r8 rcx rdx rsi rdi rbp xor r8d,r8d jrcxz .End mov rbp,rcx mov rsi,rdx .Loop: mov ecx,BufferSize sub rbp,rcx jnc .Write add rbp,rcx mov ecx,ebp xor ebp,ebp .Write: push rcx mov edi,[LinBuff] shr rcx,3 rep movsq mov cl,[rsp] and cl,111b rep movsb pop rcx mov ah,40h call DosIntWithBufferZero add r8,rax or rbp,rbp jnz .Loop .End: mov rax,r8 pop rbp rdi rsi rdx rcx r8 ret DosIntWithBufferZero: xor edx,edx DosIntWithBuffer: push rbx rcx rdi lea edi,[Regs] mov [rdi+RMCS.rAH],ah mov eax,[SegBuff] mov [rdi+RMCS.rDS],ax mov [rdi+RMCS.rES],ax mov [rdi+RMCS.rEBX],ebx mov [rdi+RMCS.rECX],ecx mov [rdi+RMCS.rEDX],edx call DosInt movzx eax,[rdi+RMCS.rAX] bt dword [rdi+RMCS.rFL],0 pop rdi rcx rbx ret DosInt: mov bl,21h mov ax,0300h xor bh,bh xor ecx,ecx mov [rdi+RMCS.rSSSP],ecx mov [rdi+RMCS.rFL],1 int 31h ret Start32: use32 jmp Code64:Start64 use64 ;------------------------------------------------------------------------------- ; NAME: DOT3 ; IN: xmm0 | ? z0 y0 x0 | ; IN: xmm1 | ? z1 y1 x1 | ; OUT: xmm0 | s s s s | s = x0*x1+y0*y1+z0*z1 ;------------------------------------------------------------------------------- macro DOT3 { mulps xmm0,xmm1 movaps xmm1,xmm0 movaps xmm2,xmm0 shufps xmm0,xmm0,0x00 shufps xmm1,xmm1,0x55 shufps xmm2,xmm2,0xAA addps xmm0,xmm1 addps xmm0,xmm2 } ;------------------------------------------------------------------------------- ; NAME: DOT4 ; IN: xmm0 | w0 z0 y0 x0 | ; IN: xmm1 | w1 z1 y1 x1 | ; OUT: xmm0 | s s s s | s = x0*x1+y0*y1+z0*z1+w0*w1 ;------------------------------------------------------------------------------- macro DOT4 { mulps xmm0,xmm1 movaps xmm1,xmm0 movaps xmm2,xmm0 movaps xmm3,xmm0 shufps xmm0,xmm0,0x00 shufps xmm1,xmm1,0x55 shufps xmm2,xmm2,0xAA shufps xmm3,xmm3,0xFF addps xmm0,xmm1 addps xmm2,xmm3 addps xmm0,xmm2 } ;------------------------------------------------------------------------------- ; NAME: FLOOR ; IN: xmm0 | w z y x | ; OUT: xmm0 | floor(w) floor(z) floor(y) floor(x) | ;------------------------------------------------------------------------------- macro FLOOR { cvttps2dq xmm1,xmm0 psrld xmm0,31 psubd xmm1,xmm0 cvtdq2ps xmm0,xmm1 } ;------------------------------------------------------------------------------- ; NAME: STEP ; IN: xmm0 | ew ez ey ex | edge vector ; IN: xmm1 | w z y x | value vector ; OUT: xmm0 | step(ew,w) step(ez,z) step(ey,y) step(ex,x) | ;------------------------------------------------------------------------------- macro STEP { cmpltps xmm1,xmm0 andnps xmm1,dqword [g_1_0] movaps xmm0,xmm1 } ;------------------------------------------------------------------------------- ; NAME: MOD289 ; IN: xmm0 | w z y x | ; OUT: xmm0 | mod289(w) mod289(z) mod289(y) mod289(x) | ; mod289(s) = s - floor(s * (1.0/289.0)) * 289.0 ;------------------------------------------------------------------------------- macro MOD289 { movaps xmm2,xmm0 mulps xmm0,dqword [g_1_div_289] FLOOR mulps xmm0,dqword [g_289_0] subps xmm2,xmm0 movaps xmm0,xmm2 } ;------------------------------------------------------------------------------- ; NAME: PERMUTE ; IN: xmm0 | w z y x | ; OUT: xmm0 | perm(w) perm(z) perm(y) perm(x) | ; perm(s) = mod289(((s*34.0)+1.0)*s) ;------------------------------------------------------------------------------- macro PERMUTE { movaps xmm1,xmm0 mulps xmm0,dqword [g_34_0] addps xmm0,dqword [g_1_0] mulps xmm0,xmm1 MOD289 } ;------------------------------------------------------------------------------- ; NAME: ABS ; IN: xmm0 | w z y x | ; OUT: xmm0 | abs(w) abs(z) abs(y) abs(x) | ;------------------------------------------------------------------------------- macro ABS { xorps xmm1,xmm1 ; xmm1 = | 0 0 0 0 | subps xmm1,xmm0 ; xmm1 = neg(x) maxps xmm0,xmm1 ; xmm0 = abs(x) } ;------------------------------------------------------------------------------- ; NAME: SNoise3 ; DESC: 3D Simplex noise (https://github.com/ashima/webgl-noise) ; IN: xmm0 | ? z y x | ; OUT: xmm0 | s s s s | s is noise value [-1.0,1.0] ;------------------------------------------------------------------------------- even 16 SNoise3: v equ rbp-16 i equ rbp-32 x0 equ rbp-48 x1 equ rbp-64 x2 equ rbp-80 x3 equ rbp-96 i1 equ rbp-112 i2 equ rbp-128 push rbp mov rbp,rsp sub rsp,256 movaps [v],xmm0 ; save input on the stack ; ; Compute corners (x0, x1, x2, x3) ; ; i = floor(v + dot(v, C.yyy)) movaps xmm1,dqword [g_snoise_C] shufps xmm1,xmm1,0x55 ; xmm1 = C.yyy DOT3 ; xmm0 = dot(xmm0,xmm1) addps xmm0,[v] FLOOR movaps [i],xmm0 ; x0 = v - i + dot(i, C.xxx) movaps xmm3,[v] subps xmm3,xmm0 movaps xmm1,dqword [g_snoise_C] shufps xmm1,xmm1,0x00 ; xmm1 = C.xxx DOT3 ; xmm0 = dot(xmm0,xmm1) addps xmm3,xmm0 movaps [x0],xmm3 ; compute i1 and i2 movaps xmm0,xmm3 shufps xmm0,xmm0,11001001b ; xmm0 = | w x z y | movaps xmm1,xmm3 STEP movaps xmm7,xmm0 ; xmm7 = g movaps xmm6,dqword [g_1_0] subps xmm6,xmm7 ; xmm6 = 1.0 - g = l shufps xmm6,xmm6,11010010b ; xmm6 = | w y x z | movaps xmm0,xmm7 movaps xmm1,xmm7 minps xmm0,xmm6 ; xmm0 = min(g.xyz, l.zxy) maxps xmm1,xmm6 ; xmm1 = max(g.xyz, l.zxy) movaps [i1],xmm0 ; xmm0 = i1 movaps [i2],xmm1 ; xmm1 = i2 ; compute x1, x2 and x3 movaps xmm7,[x0] ; xmm7 = x0 movaps xmm6,xmm7 ; xmm6 = x0 movaps xmm5,xmm7 ; xmm5 = x0 movaps xmm4,dqword [g_snoise_C] movaps xmm3,xmm4 movaps xmm2,dqword [g_snoise_D] shufps xmm4,xmm4,0x00 ; xmm4 = C.xxx shufps xmm3,xmm3,0x55 ; xmm3 = C.yyy shufps xmm2,xmm2,0x55 ; xmm2 = D.yyy subps xmm5,xmm0 ; xmm5 = x0 - i1 subps xmm6,xmm1 ; xmm6 = x0 - i2 subps xmm7,xmm2 ; xmm7 = x0 - D.yyy addps xmm5,xmm4 ; xmm5 = x0 - i1 + C.xxx addps xmm6,xmm3 ; xmm6 = x0 - i2 + C.yyy movaps [x1],xmm5 movaps [x2],xmm6 movaps [x3],xmm7 ; ; Compute permutations (p) ; movaps xmm0,[i] MOD289 movaps xmm7,xmm0 movaps xmm6,xmm0 movaps xmm5,xmm0 shufps xmm7,xmm7,10101010b ; xmm7 = i.zzzz shufps xmm6,xmm6,01010101b ; xmm6 = i.yyyy shufps xmm5,xmm5,00000000b ; xmm5 = i.xxxx movaps xmm4,[i1] ; xmm4 = i1 movaps xmm3,[i2] ; xmm3 = i2 ; movaps xmm0,xmm4 ; xmm0 = i1 movaps xmm1,xmm3 ; xmm1 = i2 shufps xmm0,xmm0,10101010b ; xmm0 = | i1.z i1.z i1.z i1.z | shufps xmm1,xmm1,10101010b ; xmm1 = | i2.z i2.z i2.z i2.z | andps xmm0,dqword [g_mask_0010];xmm0 = | 0 0 i1.z 0 | andps xmm1,dqword [g_mask_0100];xmm1 = | 0 i2.z 0 0 | orps xmm0,xmm1 ; xmm0 = | 0 i2.z i1.z 0 | orps xmm0,dqword [g_1_0_w] ; xmm0 = | 1 i2.z i1.z 0 | addps xmm0,xmm7 ; xmm0 = i.zzzz + | 1 i2.z i1.z 0 | PERMUTE movaps xmm8,xmm0 ; xmm8 = p movaps xmm0,xmm4 ; xmm0 = i1 movaps xmm1,xmm3 ; xmm1 = i2 shufps xmm0,xmm0,01010101b ; xmm0 = | i1.y i1.y i1.y i1.y | shufps xmm1,xmm1,01010101b ; xmm1 = | i2.y i2.y i2.y i2.y | andps xmm0,dqword [g_mask_0010];xmm0 = | 0 0 i1.y 0 | andps xmm1,dqword [g_mask_0100];xmm1 = | 0 i2.y 0 0 | orps xmm0,xmm1 ; xmm0 = | 0 i2.y i1.y 0 | orps xmm0,dqword [g_1_0_w] ; xmm0 = | 1 i2.y i1.y 0 | addps xmm0,xmm6 ; xmm0 = i.yyyy + | 1 i2.y i1.y 0 | addps xmm0,xmm8 PERMUTE movaps xmm8,xmm0 ; xmm8 = p movaps xmm0,xmm4 ; xmm0 = i1 movaps xmm1,xmm3 ; xmm1 = i2 shufps xmm0,xmm0,00000000b ; xmm0 = | i1.x i1.x i1.x i1.x | shufps xmm1,xmm1,00000000b ; xmm1 = | i2.x i2.x i2.x i2.x | andps xmm0,dqword [g_mask_0010];xmm0 = | 0 0 i1.x 0 | andps xmm1,dqword [g_mask_0100];xmm1 = | 0 i2.x 0 0 | orps xmm0,xmm1 ; xmm0 = | 0 i2.x i1.x 0 | orps xmm0,dqword [g_1_0_w] ; xmm0 = | 1 i2.x i1.x 0 | addps xmm0,xmm5 ; xmm0 = i.xxxx + | 1 i2.x i1.x 0 | addps xmm0,xmm8 PERMUTE movaps xmm8,xmm0 ; xmm8 = p ; ; Compute gradients ; movaps xmm0,dqword [g_snoise_D] movaps xmm1,xmm0 shufps xmm0,xmm0,11100111b ; xmm0 = | D.w D.z D.y D.w | shufps xmm1,xmm1,11001000b ; xmm1 = | D.w D.x D.z D.x | mulps xmm0,dqword [g_1_div_7] subps xmm0,xmm1 movaps xmm7,xmm0 ; xmm7 = ns ; xmm8 = j = p - 49.0 * floor(p * ns.z * ns.z) shufps xmm0,xmm0,10101010b ; xmm0 = ns.zzzz mulps xmm0,xmm0 mulps xmm0,xmm8 FLOOR mulps xmm0,dqword [g_49_0] subps xmm8,xmm0 ; xmm8 = j ; x_ = floor(j * ns.zzzz) movaps xmm0,xmm7 shufps xmm0,xmm0,10101010b ; xmm0 = ns.zzzz mulps xmm0,xmm8 FLOOR ; xmm0 = x_ movaps xmm6,xmm0 ; xmm6 = x_ ; y_ = floor(j - 7.0 * x_) mulps xmm0,dqword [g_7_0] movaps xmm1,xmm8 subps xmm1,xmm0 movaps xmm0,xmm1 FLOOR movaps xmm5,xmm0 ; xmm5 = y_ ; x = x_ * ns.xxxx + ns.yyyy ; y = y_ * ns.xxxx + ns.yyyy movaps xmm0,xmm7 ; xmm0 = ns movaps xmm1,xmm7 ; xmm1 = ns shufps xmm0,xmm0,00000000b ; xmm0 = ns.xxxx shufps xmm1,xmm1,01010101b ; xmm1 = ns.yyyy mulps xmm6,xmm0 ; xmm6 = x_ * ns.xxxx mulps xmm5,xmm0 ; xmm5 = y_ * ns.xxxx addps xmm6,xmm1 ; xmm6 = x = x_ * ns.xxxx + ns.yyyy addps xmm5,xmm1 ; xmm5 = y = y_ * ns.xxxx + ns.yyyy ; h = 1.0 - abs(x) - abs(y) movaps xmm4,dqword [g_1_0] ; xmm4 = h = | 1 1 1 1 | movaps xmm0,xmm6 ; xmm0 = x ABS ; xmm0 = abs(x) movaps xmm3,xmm0 ; xmm3 = abs(x) movaps xmm0,xmm5 ; xmm0 = y ABS ; xmm0 = abs(y) subps xmm4,xmm3 ; xmm4 = h = 1.0 - abs(x) subps xmm4,xmm0 ; xmm4 = h = 1.0 - abs(x) - abs(y) ; b0 = vec4(x.xy, y.xy) movaps xmm0,xmm6 ; xmm0 = x movaps xmm1,xmm5 ; xmm1 = y unpcklps xmm0,xmm1 ; xmm0 = | y.y x.y y.x x.x | shufps xmm0,xmm0,11011000b ; xmm0 = | y.y y.x x.y x.x | movaps xmm7,xmm0 ; xmm7 = b0 ; b1 = vec4(x.zw, y.zw) movaps xmm0,xmm6 ; xmm0 = x movaps xmm1,xmm5 ; xmm1 = y unpckhps xmm0,xmm1 ; xmm0 = | y.w x.w y.z x.z | shufps xmm0,xmm0,11011000b ; xmm0 = | y.w y.z x.w x.z | movaps xmm3,xmm0 ; xmm3 = b1 ; s0 = floor(b0) * 2.0 + 1.0 movaps xmm0,xmm7 FLOOR addps xmm0,xmm0 addps xmm0,dqword [g_1_0] movaps xmm15,xmm0 ; xmm15 = s0 ; s1 = floor(b1) * 2.0 + 1.0 movaps xmm0,xmm3 FLOOR addps xmm0,xmm0 addps xmm0,dqword [g_1_0] movaps xmm14,xmm0 ; xmm14 = s1 ; sh = -step(h, vec4(0.0)) movaps xmm0,xmm4 ; xmm0 = h xorps xmm1,xmm1 ; xmm1 = | 0 0 0 0 | STEP xorps xmm1,xmm1 subps xmm1,xmm0 movaps xmm13,xmm1 ; xmm13 = sh ; a0 = b0.xzyw + s0.xzyw * sh.xxyy shufps xmm7,xmm7,11011000b ; xmm7 = b0 = | w y z x | shufps xmm15,xmm15,11011000b ; xmm15 = s0 = | w y z x | movaps xmm0,xmm13 shufps xmm0,xmm0,01010000b ; xmm0 = | y y x x | mulps xmm0,xmm15 addps xmm0,xmm7 movaps xmm7,xmm0 ; xmm7 = a0 ; a1 = b1.xzyw + s1.xzyw * sh.zzww shufps xmm3,xmm3,11011000b ; xmm3 = b1 = | w y z x | shufps xmm14,xmm14,11011000b ; xmm14 = s1 = | w y z x | shufps xmm13,xmm13,11111010b ; xmm13 = sh = | w w z z | mulps xmm14,xmm13 addps xmm14,xmm3 movaps xmm6,xmm14 ; xmm6 = a1 ; p0 = vec3(a0.xy, h.x) movaps xmm0,xmm7 ; xmm0 = a0 shufps xmm0,xmm4,00000100b ; | h.x h.x a0.y a0.x | movaps xmm5,xmm0 ; xmm5 = p0 ; p1 = vec3(a0.zw, h.y) shufps xmm7,xmm4,01011110b ; xmm7 = p1 = | h.y h.y a0.w a0.z | ; p2 = vec3(a1.xy, h.z) movaps xmm0,xmm6 ; xmm0 = a1 shufps xmm0,xmm4,10100100b ; | h.z h.z a1.y a1.x | movaps xmm3,xmm0 ; xmm3 = p2 ; p3 = vec3(a1.zw, h.w) shufps xmm6,xmm4,11111110b ; xmm6 = p3 = | h.w h.w a1.w a1.z | ; movaps xmm4,xmm3 ; xmm4 = p2 ; ; Normalize gradients ; ; xmm5 = p0, xmm7 = p1, xmm4 = p2, xmm6 = p3 ; ; xmm15 = dot(p0, p0) movaps xmm0,xmm5 ; xmm0 = p0 movaps xmm1,xmm5 ; xmm1 = p0 DOT3 movaps xmm15,xmm0 ; xmm15 = dot(p0, p0) ; xmm14 = dot(p1, p1) movaps xmm0,xmm7 ; xmm0 = p1 movaps xmm1,xmm7 ; xmm1 = p1 DOT3 movaps xmm14,xmm0 ; xmm14 = dot(p1, p1) ; xmm13 = dot(p2, p2) movaps xmm0,xmm4 ; xmm0 = p2 movaps xmm1,xmm4 ; xmm1 = p2 DOT3 movaps xmm13,xmm0 ; xmm13 = dot(p2, p2) ; xmm12 = dot(p3, p3) movaps xmm0,xmm6 ; xmm0 = p3 movaps xmm1,xmm6 ; xmm1 = p3 DOT3 movaps xmm12,xmm0 ; xmm12 = dot(p3, p3) ; movaps xmm0,dqword [g_taylor_scale] movaps xmm1,dqword [g_taylor_bias] mulps xmm15,xmm0 mulps xmm14,xmm0 mulps xmm13,xmm0 mulps xmm12,xmm0 addps xmm15,xmm1 addps xmm14,xmm1 addps xmm13,xmm1 addps xmm12,xmm1 ; normalize mulps xmm5,xmm15 ; xmm5 = p0 mulps xmm7,xmm14 ; xmm7 = p1 mulps xmm4,xmm13 ; xmm4 = p2 mulps xmm6,xmm12 ; xmm6 = p3 ; ; Mix final noise value ; ; xmm15 = dot(x0, x0) movaps xmm0,[x0] ; xmm0 = x0 movaps xmm1,xmm0 ; xmm1 = x0 DOT3 movaps xmm15,xmm0 ; xmm15 = dot(x0, x0) ; xmm14 = dot(x1, x1) movaps xmm0,[x1] ; xmm0 = x1 movaps xmm1,xmm0 ; xmm1 = x1 DOT3 movaps xmm14,xmm0 ; xmm14 = dot(x1, x1) ; xmm13 = dot(x2, x2) movaps xmm0,[x2] ; xmm0 = x2 movaps xmm1,xmm0 ; xmm1 = x2 DOT3 movaps xmm13,xmm0 ; xmm13 = dot(x2, x2) ; xmm12 = dot(x3, x3) movaps xmm0,[x3] ; xmm0 = x3 movaps xmm1,xmm0 ; xmm1 = x3 DOT3 movaps xmm12,xmm0 ; xmm12 = dot(x3, x3) ; andps xmm15,dqword [g_mask_0001] andps xmm14,dqword [g_mask_0010] andps xmm13,dqword [g_mask_0100] andps xmm12,dqword [g_mask_1000] orps xmm15,xmm14 orps xmm13,xmm12 orps xmm15,xmm13 movaps xmm0,dqword [g_0_6] subps xmm0,xmm15 maxps xmm0,dqword [g_0_0] ; xmm0 = m mulps xmm0,xmm0 mulps xmm0,xmm0 movaps xmm10,xmm0 ; xmm10 = m^4 ; ; xmm15 = dot(x0, p0) movaps xmm0,[x0] ; xmm0 = x0 movaps xmm1,xmm5 ; xmm1 = p0 DOT3 movaps xmm15,xmm0 ; xmm15 = dot(x0, p0) ; xmm14 = dot(x1, p1) movaps xmm0,[x1] ; xmm0 = x1 movaps xmm1,xmm7 ; xmm1 = p1 DOT3 movaps xmm14,xmm0 ; xmm14 = dot(x1, p1) ; xmm13 = dot(x2, p2) movaps xmm0,[x2] ; xmm0 = x2 movaps xmm1,xmm4 ; xmm1 = p2 DOT3 movaps xmm13,xmm0 ; xmm13 = dot(x2, p2) ; xmm12 = dot(x3, p3) movaps xmm0,[x3] ; xmm0 = x3 movaps xmm1,xmm6 ; xmm1 = p3 DOT3 movaps xmm12,xmm0 ; xmm12 = dot(x3, p3) ; put all above dots into xmm15 andps xmm15,dqword [g_mask_0001] andps xmm14,dqword [g_mask_0010] andps xmm13,dqword [g_mask_0100] andps xmm12,dqword [g_mask_1000] orps xmm15,xmm14 orps xmm13,xmm12 orps xmm15,xmm13 ; movaps xmm0,xmm10 ; xmm0 = m^4 movaps xmm1,xmm15 DOT4 mulps xmm0,dqword [g_42_0] mov rsp,rbp pop rbp ret restore v,i,x0,x1,x2,x3,i1,i2 ;------------------------------------------------------------------------------- ; NAME: Main ; DESC: Program main function. ;------------------------------------------------------------------------------- even 16 Main: ImgPtr equ rbp-8 push rbp mov rbp,rsp sub rsp,128 ; alloc memory for the image lea eax,[MemStrt] mov [ImgPtr],rax mov ebx,eax ; begin loops xor r13d,r13d ; .LoopY index .LoopY: xor r12d,r12d ; .LoopX index .LoopX: ; compute xorps xmm0,xmm0 xorps xmm1,xmm1 cvtsi2ss xmm0,r12d ; xmm0 = | 0 0 0 x | cvtsi2ss xmm1,r13d ; xmm1 = | 0 0 0 y | unpcklps xmm0,xmm1 ; xmm0 = | 0 0 y x | divps xmm0,dqword [g_size] addps xmm0,xmm0 addps xmm0,xmm0 call SNoise3 mulps xmm0,dqword [g_0_5] addps xmm0,dqword [g_0_5] ; clamp to [0.0,1.0] minps xmm0,dqword [g_1_0] maxps xmm0,dqword [g_0_0] ; convert from [0.0,1.0] to [0,255] mulps xmm0,dqword [g_255_0] cvttps2dq xmm0,xmm0 movd eax,xmm0 mov [rbx+2],al ; red pshufd xmm1,xmm0,00000001b movd eax,xmm1 mov [rbx+1],al ; green pshufd xmm1,xmm0,00000010b movd eax,xmm1 mov [rbx+0],al ; blue mov byte [rbx+3],255 ; alpha ; advance pixel pointer add rbx,4-1 ; continue .LoopX inc r12d cmp r12d,SIZE jne .LoopX ; continue .LoopY inc r13d cmp r13d,SIZE jne .LoopY ; create TGA file mov ah,3Ch lea edx,[g_tga_name] xor ecx,ecx int 21h xchg ebx,eax ; write header lea edx,[g_tga_head] mov ecx,18 call WriteToFile ; write pixel data mov edx,[ImgPtr] mov ecx,SIZE*SIZE*3 call WriteToFile ; close file mov ah,3Eh int 21h mov rsp,rbp pop rbp ret restore ImgPtr ;------------------------------------------------------------------------------- ; NAME: Debug ;------------------------------------------------------------------------------- even 16 Debug: v equ rbp-16 push rbp mov rbp,rsp sub rsp,128 mov dword [v+0],1.2 mov dword [v+4],2.4 mov dword [v+8],3.5 mov dword [v+12],0.0 movaps xmm0,[v] call SNoise3 mov rsp,rbp pop rbp ret restore v ;------------------------------------------------------------------------------- ; NAME: Start64 ; DESC: Program entry point. ;------------------------------------------------------------------------------- Start64: push 0 syscall mov [ExitAddr],r8 mov [BufferVar],rcx call Main xor al,al jmp [ExitAddr] ;------------------------------------------------------------------------------- g_tga_name db 'snoise.tga',0 g_tga_head db 0,0,2,9 dup 0 db (SIZE and 0x00FF),(SIZE and 0xFF00) shr 8 db (SIZE and 0x00FF),(SIZE and 0xFF00) shr 8,32-8,0 even 16 SIZE = 600 ;800 g_size dd 4 dup 600.0 ;800.0 g_snoise_C dd 0.166666667,0.333333333,0.0,0.0 g_snoise_D dd 0.0,0.5,1.0,2.0 g_0_0 dd 4 dup 0.0 g_0_5 dd 4 dup 0.5 g_0_6 dd 4 dup 0.6 g_1_0 dd 4 dup 1.0 g_7_0 dd 4 dup 7.0 g_34_0 dd 4 dup 34.0 g_42_0 dd 4 dup 42.0 g_49_0 dd 4 dup 49.0 g_255_0 dd 4 dup 255.0 g_289_0 dd 4 dup 289.0 g_1_div_7 dd 4 dup 0.142857142857 g_1_div_289 dd 4 dup 0.003460208 g_mask_0001 dd 0xFFFFFFFF,0x00000000,0x00000000,0x00000000 g_mask_0010 dd 0x00000000,0xFFFFFFFF,0x00000000,0x00000000 g_mask_0100 dd 0x00000000,0x00000000,0xFFFFFFFF,0x00000000 g_mask_1000 dd 0x00000000,0x00000000,0x00000000,0xFFFFFFFF g_1_0_w dd 0.0,0.0,0.0,1.0 g_taylor_bias dd 4 dup 1.79284291400159 g_taylor_scale dd 4 dup -0.85373472095314 ;------------------------------------------------------------------------------- ExitAddr dq ? ; BufferVar: LinBuff dd ? ;transfer buffer *linear address SegBuff dd ? ;transfer buffer segment address ; Regs RMCS ;------------------------------------------------------------------------------- even 16 MemStrt rb SIZE*SIZE*3 ; fasm you can download from: http://board.flatassembler.net/topic.php?t=12811 _________________ smaller is better |
|||
17 May 2012, 16:08 |
|
Madis731 13 Jul 2012, 05:18
I just thought about it today. Don't you only need 3D noise for clouds and really complex terrain? For simple heightmaps a 2D noise will do fine, which I guess is faster in implementation.
An example where 3D noise is used: http://dl.dropbox.com/u/12637402/tutorial1/scale64.png (from this http://forums.bukkit.org/threads/intermediate-wgen-more-interesting-terrain-using-3d-simplex-noise.71813/) |
|||
13 Jul 2012, 05:18 |
|
randall 13 Jul 2012, 17:58
Madis731 wrote: I just thought about it today. Don't you only need 3D noise for clouds and really complex terrain? For simple heightmaps a 2D noise will do fine, which I guess is faster in implementation. Yes, you are right. 2D noise is enough. I have started to work on 2D simplex noise (which is much simpler and faster) but haven't finished yet. Currently I am more in C++ and OpenGL world... Anyway thanks for the tip. |
|||
13 Jul 2012, 17:58 |
|
Madis731 13 Jul 2012, 19:27
I dug into http://webstaff.itn.liu.se/~stegu/simplexnoise/simplexnoise.pdf today and thought that the explanation for 2D simplex was really nice and SSE implementation would be feasable and extendable (to AVX for example). Usually textures are power of 2 sized squares (512, 4096) and dealing with SSE/AVX having these constraints is great because anything above 32 will guarantee you don't have problems with alignment or edges.
|
|||
13 Jul 2012, 19:27 |
|
Madis731 18 Jul 2012, 16:57
Here's my poke at 2D. I don't like the lookup table very much and that mod 12 would be totally unnecessary if I were to replace the LUT with some bit wizardry. It takes about 220 clocks per iteration right now so making it fully vectorized could potentially mean 55 or 28 clk/px with SSE or AVX respectively (that is ideally - of course).
Code: snoise2:; xmm0 : ?, ?, y, x ; xmm13: F2 ; xmm14: G2 ; xmm15: G2b2m1 push rbx movd xmm13,[F2] pshufd xmm3,xmm0,01010101b pshufd xmm4,xmm0,00000000b movaps xmm6,xmm3 movaps xmm7,xmm4 movaps xmm2,xmm3 addps xmm2,xmm4 mulps xmm2,xmm13 ; s = (xin+yin)*F2 addps xmm3,xmm2 ; xin+s cvttps2dq xmm0,xmm3 ; TODO: Optimize psrld xmm3,31 psubd xmm0,xmm3 cvtdq2ps xmm3,xmm0 ; i = floor(xin+s) addps xmm4,xmm2 ; yin+s cvttps2dq xmm0,xmm4 ; TODO: Optimize psrld xmm4,31 psubd xmm0,xmm4 cvtdq2ps xmm4,xmm0 ; j = floor(yin+s) movaps xmm5,xmm3 addps xmm5,xmm4 mulps xmm5,xmm14 ; t = (i+j)*G2 subps xmm6,xmm3 ; x0 = xin-i+t addps xmm6,xmm5 subps xmm7,xmm4 ; y0 = yin-j+t addps xmm7,xmm5 movaps xmm12,xmm7 cmpltps xmm12,xmm6 movaps xmm13,xmm6 cmpleps xmm13,xmm7 psrld xmm12,31 cvtdq2ps xmm12,xmm12 ; i1 = y0<x0 ? 1 : 0 psrld xmm13,31 cvtdq2ps xmm13,xmm13 ; j1 = y0<x0 ? 0 : 1 ; movaps xmm13,xmm6 ; cmpleps xmm13,xmm7 ; psrld xmm13,31 ; cvtdq2ps xmm13,xmm13 ; j1 = x0<y0 ? 1 : 0 movaps xmm8,xmm6 subps xmm8,xmm12 addps xmm8,xmm14 ; x1 = x0-i1+G2 movaps xmm9,xmm7 subps xmm9,xmm13 addps xmm9,xmm14 ; y1 = y0-ji+G2 movaps xmm10,xmm6 addps xmm10,xmm15 ; x2 = x0 + (-1+2*G2) movaps xmm11,xmm7 addps xmm11,xmm15 ; y2 = y0 + (-1+2*G2) cvtps2dq xmm3,xmm3 cvtps2dq xmm4,xmm4 cvtps2dq xmm12,xmm12 cvtps2dq xmm13,xmm13 movd ebx,xmm3 and ebx,0FFh ; ii movd ecx,xmm4 and ecx,0FFh ; jj movd r8d,xmm12 and r8d,1 ; i1 movd r9d,xmm13 and r9d,1 ; j1 movzx eax,[perm+rcx] movzx eax,[perm+rbx+rax] call mod12 ; gi0 mov r10d,eax movzx eax,[perm+rcx+r9] add r8,rax movzx eax,[perm+rbx+r8] call mod12 ; gi1 mov r11d,eax movzx eax,[perm+rcx+1] movzx eax,[perm+rbx+rax+1] call mod12 ; gi2 mov r12d,eax movaps xmm0,dqword[g_0_5] movaps xmm1,xmm0 movaps xmm2,xmm0 movaps xmm12,xmm6 mulps xmm12,xmm12 ; x0*x0 movaps xmm13,xmm7 mulps xmm13,xmm13 ; y0*y0 subps xmm0,xmm12 subps xmm0,xmm13 ; t0 = 0.5 - x0*x0 - y0*y0 movaps xmm12,xmm0 cmpltps xmm12,dqword[g_0_0] movd eax,xmm12 cmp eax,0 jne .t0lt0 mulps xmm0,xmm0 ; t0 *= t0 mulps xmm0,xmm0 ; t0 *= t0 mulss xmm6,[grad3+r10*8+0] mulss xmm7,[grad3+r10*8+4] addps xmm6,xmm7 ; g[0]*x+g[1]*y mulps xmm0,xmm6 ; dot(g3[gi0],x0,y0) jmp .t0ge0 .t0lt0: xorps xmm0,xmm0 .t0ge0: movaps xmm12,xmm8 mulps xmm12,xmm12 movaps xmm13,xmm9 mulps xmm13,xmm13 subps xmm1,xmm12 subps xmm1,xmm13 movaps xmm12,xmm1 cmpltps xmm12,dqword[g_0_0] movd eax,xmm12 cmp eax,0 jne .t1lt0 mulps xmm1,xmm1 mulps xmm1,xmm1 mulss xmm8,[grad3+r11*8+0] mulss xmm9,[grad3+r11*8+4] addps xmm8,xmm9 mulps xmm1,xmm8 ;dot(g3[gi1],x1,y1) jmp .t1ge0 .t1lt0: xorps xmm1,xmm1 .t1ge0: movaps xmm12,xmm10 mulps xmm12,xmm12 movaps xmm13,xmm11 mulps xmm13,xmm13 subps xmm2,xmm12 subps xmm2,xmm13 movaps xmm12,xmm2 cmpltps xmm12,dqword[g_0_0] movd eax,xmm12 cmp eax,0 jne .t2lt0 mulps xmm2,xmm2 mulps xmm2,xmm2 mulss xmm10,[grad3+r12*8+0] mulss xmm11,[grad3+r12*8+4] addps xmm10,xmm11 mulps xmm2,xmm10 ;dot(g3[gi2],x2,y2) jmp .t2ge0 .t2lt0: xorps xmm2,xmm2 .t2ge0: addps xmm0,xmm1 addps xmm0,xmm2 mulps xmm0,dqword[g_70_0] pop rbx ret mod12:;Snippet MOD12 (find largest integer and subtract) push rbx mov edx,055555556h mov ebx,eax mul edx ; divide by 3 and edx,0FFFFFFFCh lea edx,[edx*3] sub ebx,edx mov eax,ebx pop rbx ret Code: main: mov ebx,screen3 movd xmm13,[F2] movd xmm14,[G2] movd xmm15,[G2b2m1] ; begin loops xor edx,edx ; .LoopY index .LoopY: xor ecx,ecx ; .LoopX index .LoopX: ; compute push rcx rdx shl rdx,32 add rdx,rcx movq xmm0,rdx cvtdq2ps xmm0,xmm0 mulps xmm0,dqword[g_size] call snoise2 pop rdx rcx mulps xmm0,dqword [g_0_5] addps xmm0,dqword [g_0_5] ; clamp to [0.0,1.0] minps xmm0,dqword [g_1_0] maxps xmm0,dqword [g_0_0] ; convert from [0.0,1.0] to [0,255] mulps xmm0,dqword [g_255_0] cvttps2dq xmm0,xmm0 movd eax,xmm0 imul eax,010101h mov [rbx],eax ; advance pixel pointer add ebx,3 ; continue .LoopX inc ecx cmp ecx,SIZE jne .LoopX ; continue .LoopY inc edx cmp edx,SIZE jne .LoopY ret Defines: Code: SIZE = 512 SIZE_D equ 0.0078125 ; 4/512 align 4 F2 dd 4 dup 0.3660254 G2 dd 4 dup 0.2113249 G2b2m1 dd 4 dup -0.5773503 align 16 g_size dd 4 dup SIZE_D g_0_0 dd 4 dup 0.0 g_0_5 dd 4 dup 0.5 g_1_0 dd 4 dup 1.0 g_70_0 dd 4 dup 70.0 g_255_0 dd 4 dup 255.0 grad3 dd 1.0,1.0, -1.0,1.0, 1.0,-1.0,-1.0,-1.0 dd 1.0,0.0, -1.0,0.0, 1.0,0.0, -1.0,0.0 dd 0.0,1.0, 0.0,-1.0, 0.0,1.0, 0.0,-1.0 perm db 151,160,137,91,90,15,131,13,201,95,96,53,194,233,7,225 db 140,36,103,30,69,142,8,99,37,240,21,10,23,190,6,148 db 247,120,234,75,0,26,197,62,94,252,219,203,117,35,11,32 db 57,177,33,88,237,149,56,87,174,20,125,136,171,168,68,175 db 74,165,71,134,139,48,27,166,77,146,158,231,83,111,229,122 db 60,211,133,230,220,105,92,41,55,46,245,40,244,102,143,54 db 65,25,63,161,1,216,80,73,209,76,132,187,208,89,18,169 db 200,196,135,130,116,188,159,86,164,100,109,198,173,186,3,64 db 52,217,226,250,124,123,5,202,38,147,118,126,255,82,85,212 db 207,206,59,227,47,16,58,17,182,189,28,42,223,183,170,213 db 119,248,152,2,44,154,163,70,221,153,101,155,167,43,172,9 db 129,22,39,253,19,98,108,110,79,113,224,232,178,185,112,104 db 218,246,97,228,251,34,242,193,238,210,144,12,191,179,162,241 db 81,51,145,235,249,14,239,107,49,192,214,31,181,199,106,157 db 184,84,204,176,115,121,50,45,127,4,150,254,138,236,205,93 db 222,114,67,29,24,72,243,141,128,195,78,66,215,61,156,180
|
||||||||||
18 Jul 2012, 16:57 |
|
randall 20 Jul 2012, 14:57
Very nice! Thanks for sharing.
|
|||
20 Jul 2012, 14:57 |
|
kalambong 06 Sep 2012, 06:30
gunblade wrote: Very neat.. There is a closely related program call Terramaker http://www.terraproject.de/terramaker/ It has a lot different kinds of perlins to choose from |
|||
06 Sep 2012, 06:30 |
|
catafest 07 Dec 2012, 17:11
can you make a 32 ELF ? I don't have a 64 procesor . Thank's
|
|||
07 Dec 2012, 17:11 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.