flat assembler
Message board for the users of flat assembler.
Index
> Projects and Ideas > Flexer (FASM Lexer) 
Author 

yakupcemilk 24 Aug 2024, 18:29
Thank you.
Last edited by yakupcemilk on 24 Aug 2024, 19:27; edited 2 times in total 

24 Aug 2024, 18:29 

bitRAKE 24 Aug 2024, 21:23
Claude 3.5 Sonnet can do some basic functions:
Code: ; Function to calculate the dot product of two 3D vectors ; and return the result as a floatingpoint number ; ; Arguments: ; RCX: Pointer to the first vector (3 doubleprecision floats) ; RDX: Pointer to the second vector (3 doubleprecision floats) ; ; Returns: ; XMM0: Dot product result (doubleprecision float) dot_product_3d: ; Adjust stack to preserve XMM6XMM8 (16 bytes each) sub rsp, 48 movdqu [rsp], xmm6 movdqu [rsp + 16], xmm7 movdqu [rsp + 32], xmm8 ; Load vector components movsd xmm0, [rcx] ; x1 movsd xmm1, [rcx + 8] ; y1 movsd xmm2, [rcx + 16] ; z1 movsd xmm6, [rdx] ; x2 movsd xmm7, [rdx + 8] ; y2 movsd xmm8, [rdx + 16] ; z2 ; Perform elementwise multiplication and accumulate mulsd xmm0, xmm6 ; x1 * x2 mulsd xmm1, xmm7 ; y1 * y2 mulsd xmm2, xmm8 ; z1 * z2 addsd xmm0, xmm1 addsd xmm0, xmm2 ; Result is now in xmm0 ; Restore preserved registers and adjust stack movdqu xmm8, [rsp + 32] movdqu xmm7, [rsp + 16] movdqu xmm6, [rsp] add rsp, 48 ret The x86 has gotten much better over the last year. _________________ ¯\(°_o)/¯ “languages are not safe  uses can be” Bjarne Stroustrup 

24 Aug 2024, 21:23 

bitRAKE 24 Aug 2024, 21:29
Me wrote: What about a spheresphere intersection test? Code: ; Function to test intersection between two spheres ; ; Arguments: ; RCX: Pointer to first sphere (x, y, z, radius  4 doubles) ; RDX: Pointer to second sphere (x, y, z, radius  4 doubles) ; ; Returns: ; RAX: 1 if spheres intersect, 0 otherwise sphere_intersection: ; Preserve XMM6XMM9 as per Win64 ABI sub rsp, 64 movdqu [rsp], xmm6 movdqu [rsp + 16], xmm7 movdqu [rsp + 32], xmm8 movdqu [rsp + 48], xmm9 ; Load sphere 1 data movsd xmm0, [rcx] ; x1 movsd xmm1, [rcx + 8] ; y1 movsd xmm2, [rcx + 16] ; z1 movsd xmm3, [rcx + 24] ; r1 ; Load sphere 2 data movsd xmm4, [rdx] ; x2 movsd xmm5, [rdx + 8] ; y2 movsd xmm6, [rdx + 16] ; z2 movsd xmm7, [rdx + 24] ; r2 ; Calculate distance vector subsd xmm0, xmm4 ; dx = x1  x2 subsd xmm1, xmm5 ; dy = y1  y2 subsd xmm2, xmm6 ; dz = z1  z2 ; Calculate squared distance mulsd xmm0, xmm0 ; dx^2 mulsd xmm1, xmm1 ; dy^2 mulsd xmm2, xmm2 ; dz^2 addsd xmm0, xmm1 addsd xmm0, xmm2 ; xmm0 now contains squared distance ; Calculate sum of radii addsd xmm3, xmm7 ; r1 + r2 mulsd xmm3, xmm3 ; (r1 + r2)^2 ; Compare squared distance with squared sum of radii comisd xmm0, xmm3 setbe al ; Set AL to 1 if distance <= sum of radii, 0 otherwise movzx rax, al ; Zeroextend AL to RAX ; Restore preserved XMM registers movdqu xmm9, [rsp + 48] movdqu xmm8, [rsp + 32] movdqu xmm7, [rsp + 16] movdqu xmm6, [rsp] add rsp, 64 ret 

24 Aug 2024, 21:29 

bitRAKE 24 Aug 2024, 21:41
Imagine the flood of code with random errors at every level, but then there is: https://www.swebench.com/ AI to resolve software errors. The best current systems are about 20% and improving. Mostly, highlevel language  which they are more accurate in.


24 Aug 2024, 21:41 

bitRAKE 25 Aug 2024, 03:30
The important thing to understand about the LLMs is that language use drives the output. If the question is a beginner question or the terminology is incorrect, this warps the perspective of the conversation.
Above I specifically asked for a demonstration of the Windows 64bit ABI  the model does that regardless of efficiency. Then I refined the context to just create isolated functions. The user might need to reset the interface to clear the perspective  once the problem is refined. Especially, with the long context models. Usually, I can prime the model with the first few lines of AVX2 code  how I want to load the registers, the order I want the data to be processed in; and the model will continue using those constraints. Code: vmovapd ymm0, [rdi] ; Load x, y, z, and radius1 into ymm0 vmovapd ymm1, [rsi] ; Load x', y', z', and radius2 into ymm1 vaddsd xmm3, xmm0, xmm1 ; radius1 + radius2 in lower part of ymm7 vsubpd ymm0, ymm0, ymm1 ; Compute x1x2, y1y2, z1z2, (r1r2 is discarded) vmulpd ymm0, ymm0, ymm0 ; Square the differences ; Horizontal addition to sum squared differences for the distance vextractf128 xmm1, ymm0, 1 ; Extract upper half of ymm0 into xmm2 vaddpd xmm0, xmm0, xmm1 ; Add the high and low parts of ymm0 vpermilpd xmm1, xmm0, 0b01 ; Shuffle to get the z component into lower xmm0 vaddsd xmm0, xmm0, xmm1 ; Final sum: x^2 + y^2 + z^2 in xmm0 vmulsd xmm3, xmm3, xmm3 ; Square (radius1 + radius2) vucomisd xmm0, xmm3 ; Compare distance squared (xmm0) with radius squared (xmm7) setbe al ; Set AL if distance squared is less than or equal 

25 Aug 2024, 03:30 

bitRAKE 06 Sep 2024, 11:54
Anthropic wrote: Some of Anthropic's prompt engineering experts—Amanda Askell (Alignment Finetuning), Alex Albert (Developer Relations), David Hershey (Applied AI), and Zack Witten (Prompt Engineering)—reflect on how prompt engineering has evolved, practical tips, and thoughts on how prompting might change as AI capabilities grow. ... many interesting perspective on getting better responses from the models. 

06 Sep 2024, 11:54 

Roman 06 Sep 2024, 15:41
spheresphere intersection sse.
Code: ;data align 16 Sfer1 dd 5.0,4.0,6.0,0 Sfer2 dd 5.0,4.0,4.0,0 radius dd 9.0,2.0 ;radius1 & radius2 ; Load sphere 1 data movaps xmm0,dqword [ecx] subps xmm0,dqword [edx] movss xmm1, [radius] ; r1 ; Load radius1 & radius2 addss xmm1, [radius+4] ; r1+r2 mulss xmm1, xmm1 ; (r1 + r2)^2 mulps xmm0,xmm0 haddps xmm0,xmm0 haddps xmm0,xmm0 ; Compare squared distance with squared sum of radii comiss xmm0, xmm1 setbe al ; Set AL to 1 if distance <= sum of radii, 0 otherwise movzx eax, al ; Zeroextend AL to RAX 

06 Sep 2024, 15:41 

bitRAKE 07 Sep 2024, 14:14
SSE
Code: {const:16} .A dd 1.0,1.0,1.0,1.0 movaps xmm0,dqword [rcx] ; {r1, x1, y1, z1} movaps xmm1,dqword [rdx] ; {r2, x2, y2, z2} mulps xmm1, dqword[.A] ; {1.0,1.0,1.0,1.0} addps xmm0, xmm1 ; {r1+r2, x2x1, y2y1, z2z1} mulps xmm0, xmm0 ; Square all elements dpps xmm0, dqword[.A], 11110001b ; position result with low nibble ... this code works very well because we want to unroll and gather many intersections. For example, if we had millions of spheres. We do AABB partitioning into subgroups and then intersection testing. 25% less memory bandwidth. (Of course, we have vfmadd231ps on later processors.) _________________ ¯\(°_o)/¯ “languages are not safe  uses can be” Bjarne Stroustrup Last edited by bitRAKE on 09 Sep 2024, 00:37; edited 2 times in total 

07 Sep 2024, 14:14 

revolution 07 Sep 2024, 14:25
In case anyone isn't aware: dpps requires SSE4.1 support.


07 Sep 2024, 14:25 

Roman 07 Sep 2024, 16:25
Dpps little slow.
But nice for coding. Sad not exist sse cross product one asm command. Intel should have created instruction cross, but not dpps. My opinion. 

07 Sep 2024, 16:25 

revolution 08 Sep 2024, 01:16
Roman wrote: Dpps little slow. For example it can help with Icache thrashing because it can make the code smaller. Also, if it gets used more Intel/AMD might allocate more silicon for it to improve the performance in future CPUs. There isn't any way to know from simply reading a line of code whether it will be "slow" or not. Always test your assumptions. 

08 Sep 2024, 01:16 

Roman 08 Sep 2024, 01:22


08 Sep 2024, 01:22 

revolution 08 Sep 2024, 01:37
You can't judge performance by looking at timings for a single instruction. Don't be misled by the numbers, they mean nothing in isolation.
Always test your assumptions. Don't blindly read a number and assume it is valid for everything everywhere. 

08 Sep 2024, 01:37 

uu 08 Sep 2024, 08:27
I admire those who can write code in SSE and AVX.
I only used two SSE instructions before in my programs. Code: xorps xmm1,xmm1 movups [rsi],xmm1 That was beause I could move 16byte data with a single operation. 

08 Sep 2024, 08:27 

< Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.