flat assembler
Message board for the users of flat assembler.

Index > Linux > Circle gone wrong

Goto page 1, 2, 3  Next
Author
Thread Post new topic Reply to topic
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 20 Mar 2013, 11:06
Hello all.

Inspired by Randall's example I've started attempting to code my own rendering program. I intend to use real perspective. The sight rays are calculated based on user input for the size of screen and distance from it, so it should appear that the screen is simply a window into the 3D environment.

I succeeded in getting the image generation code working, and it took quite a long time to do that since this is my biggest project yet in any programming language, since assembly is my first.

I was able to render some colorful 2D images, and made some fantastic mistakes, but now I am ready to move on to the rendering section. Right now I am writing a test function to plot an anti-aliased 2D circle on the screen to get a grasp on how to use the SSE instructions.

I can't figure out what is going wrong. I'm using pythagorean theorem to find the distance to the circle center. If this is within a pixel of the circle radius, the pixel brightness is set according to how far within the pixel the point is.

But it won't work right. I don't have a clue what could be wrong and I've tried everything I know of.

Code:
;-------------
; POVPixel      Determines color at the end of input ray
; Input:        xmm0.xyz?       Ray starts at origin 0,0,0
; Output:       xmm1.rgb?       What's transparency useful for?
;-------------
align 64

        POVPixel:
        push    rcx
        ; Test: plot simple anti-aliased 2D circle
        movaps  xmm3,xmm0
        andps   xmm0,dqword [andXY__ps]
        movaps  xmm1,dqword [testcir.center]
        subps   xmm1,xmm0                       ; circle coord - pixel coord
        mulps   xmm1,xmm1                       ; difference squared
        phaddd  xmm1,xmm1                       ; squares added
        sqrtss  xmm1,xmm1                       ; square root of sum            sqrt(a^2+b^2)
        subss   xmm1,[testcir.radius]           ; This is not doing ANYTHING, but it should
        pabsd   xmm1,xmm1                       
        comiss  xmm1,[Lpix]
        jb      @f
        inc     rdx
        divss   xmm1,[Lpix]                     ; Here on appears to work right
        mulss   xmm1,[bytesizeps]
        movaps  dqword [dqbufx],xmm1
        vbroadcastss    xmm1,[dqbufx]
        andps   xmm1,dqword [andRGB_ps]
        jmp     .ret
        @@:
        xorps   xmm1,xmm1
        .ret:
        movaps  xmm0,xmm3
        pop     rcx
        ret                                            

Code:
struc   circle  cx,cy,cz,r

                        {
        .center dd      cx
                dd      cy
                dd      cz
                dd      0
        .radius dd      r
                rd      3
                }
align 64
testcir circle  -64.0,-64.0,0.0,64.0         


For some reason the subss instruction just isn't doing anything. I've tried loading the radius in multiple ways, all with the same result, so it can't be a mistake with data definitions.


Description:
Filesize: 13.11 KB
Viewed: 23917 Time(s)

kean8.png


Post 20 Mar 2013, 11:06
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20300
Location: In your JS exploiting you and your system
revolution 21 Mar 2013, 04:55
If you could post a full piece of test code that can actually be compiled and tested without us having to add bits around it to make it work then you might get some more help. The more work you make us do for you then the less chance you will get the help you want. It is just a fact of life, people are generally lazy so we have to make it easier for them to help us.

Am I right in assuming that you are trying to compute the distance from a point to the centre of a circle? I'm not sure how this gives you antialiasing. I would expect you would get a bright spot in the centre and the image gets progressively dimmer towards the outside further away from the centre.
Post 21 Mar 2013, 04:55
View user's profile Send private message Visit poster's website Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 21 Mar 2013, 05:57
Thanks for the response! I would post the full code, but I didn't expect anyone to try and run it, so I just pasted the relevant parts. I thought I must be making some simple mistake that would be easy to spot. Tell me how to make it more convenient, and I will do.

You are right, except I subtract the radius from the distance, return if the difference is larger than a pixel. If not, I divide the difference by the length of a pixel, which gives me a fraction which I multiply the pixel brightness by. Thus, if the circle edge lies on the boundary between two pixels, it will have a distance of .5 pixels from each pixel; thus each pixel will have half the brightness. This should be minimally computationally intensive, unless the div instruction takes too long, and I wonder if using reciprocal and multiply could be faster. I have a TSC routine I could use to time it. I think the best looking AA scheme is probably not this simple though, and may depend on color calibration. Eventually I'd like to try subpixel AA.

Here is a zip file containing all the necessary files. I borrow from the Linux 64 examples to write debug info in the terminal and began with the basic scheme of Randall's code.

To run the compiled program, cd to its directory and use ./kean6. You may need to give it executable permissions.


Description:
Download
Filename: kean_6.zip
Filesize: 11.8 KB
Downloaded: 810 Time(s)

Post 21 Mar 2013, 05:57
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 21 Mar 2013, 06:02
Here is what the terminal looks like if it terminate successfully. W is rdx, which counts the number of times a pixel hits the circle length, so I know if the conditional logic is working. X Y and Z are the pixel coordinates, but not converted to integer so they are just garbage and are only an indication.

Code:
$ ./kean6
Started

One fourth finished at: 
X: 3263102976
Y: 3263102976
Z: 0
W: 1108
One fourth finished at: 
X: 1045168128
Y: 0
Z: 0
W: 1389
One fourth finished at: 
X: 1115704320
Y: 1115619328
Z: 0
W: 1389
Render successful
Created image successfully
Termination successful
    
Post 21 Mar 2013, 06:02
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20300
Location: In your JS exploiting you and your system
revolution 21 Mar 2013, 07:54
keantoken wrote:
I would post the full code, but I didn't expect anyone to try and run it, so I just pasted the relevant parts. I thought I must be making some simple mistake that would be easy to spot. Tell me how to make it more convenient, and I will do.
Well it goes a little further than that. Often it happens here that someone posts some piece of code they think is causing a problem but later find it is some other part completely that is the real cause. And from our point of view we would not know what you have done in the rest of the code so we can't know where the problem might be. Anyhow, I see you have posted a file for someone to test with, so now we can feel better about taking time to look at the code to find the problem.

Although I personally don't run Linux so I won't be able to test it, but there are others here that might be able to help now.
Post 21 Mar 2013, 07:54
View user's profile Send private message Visit poster's website Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 21 Mar 2013, 09:38
I've tried running it in several debuggers. They can't simply display the machine code for some reason and let you step through the execution. They require some "symbol" file and won't do anything unless you provide it, even though they are perfectly capable of disassembly. I think this is moronic.
Post 21 Mar 2013, 09:38
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 21 Mar 2013, 11:19
On a related topic, it seems based on the timing charts here that, an rcpps followed by mulps could be much faster than a divps:

http://instlatx64.atw.hu/

For my processor at least, which is an AMD FX8350.

I think this is probably because rcpps is approximate, in the words of the FASM manual. But HOW approximate I wonder? Approximate enough to cause noticeable aliasing on for instance, color gradients?
Post 21 Mar 2013, 11:19
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20300
Location: In your JS exploiting you and your system
revolution 21 Mar 2013, 11:31
I would suggest that you get it working first before you start to worry about how fast it is or isn't. There is not much point in having a lightning fast function that doesn't work as needed.

As for rcpps, the Intel manual explains it all. Max error is ~ 1.5 * 2^-12 (i.e. you get about 12 good bits).
Post 21 Mar 2013, 11:31
View user's profile Send private message Visit poster's website Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 21 Mar 2013, 15:52
Well, here is the riddle. If I take out phaddd, the result is either no output or something glitchy with no clear center. With phaddd, it's not a circle but it is centered on the specified coordinates for the circle and moves accordingly. BUT, if I take out sqrtss, nothing changes. Absolutely nothing. That's preposterous. Tell me how that's possible according to the architecture specifications.

After comiss I was using an unsigned conditional instead of signed; when corrected to jg instead of jb, the star is only an outline (attached).


Description:
Filesize: 8.98 KB
Viewed: 23883 Time(s)

kean9.png


Post 21 Mar 2013, 15:52
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 21 Mar 2013, 15:53
Looks like your problem is you're shifting between INTEGER and SINGLE FP instructions.
Code:
        mulps   xmm1,xmm1                       ; difference squared
        phaddd  xmm1,xmm1                       ; squares added
        sqrtss  xmm1,xmm1                       ; square root of sum            sqrt(a^2+b^2)
        subss   xmm1,[testcir.radius]           ; This is not doing ANYTHING, but it should
        pabsd   xmm1,xmm1  
    


MULPS - Multiply SINGLE Precision
PHADDD - Packed Horizontal Add DWORD
I beleive you want HADDPS
SUBSS - Subtract Scalar SINGLE
PABSD - Packed Absolute[ Value] DWORD
I believe you want to just mask off the sign bit ANDPS with a mask of 7fffffff
Post 21 Mar 2013, 15:53
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 21 Mar 2013, 17:13
mulps is a multiply on packed single-precision values. mulss is for scalars.

There is no such thing as haddps, at least to FASM. I guess that's a problem?

You'll notice all instructions after phaddd are scalar. It doesn't matter what's in the 3 upper packed dwords, because phaddd gives us the X^2+Y^2 sum in the first packed value. We can then proceed to operate on that with scalar instructions (but it would work just as well with packed instructions).

From the FASM manual:

Quote:
Each of the SSE arithmetic operations has two variants. When the mnemonic ends
with ps, the source operand can be a 128–bit memory location or a SSE register, the
destination operand must be a SSE register and the operation is performed on packed
four single precision values, for each pair of the corresponding data elements separately,
2.1. THE X86 ARCHITECTURE INSTRUCTIONS
41
the result is stored in the destination register. When the mnemonic ends with ss, the
source operand can be a 32–bit memory location or a SSE register, the destination
operand must be a SSE register and the operation is performed on single precision
values, only low double words of SSE registers are used in this case, the result is stored
in the low double word of destination register.
addps and addss add the values, subps and subss substract the source value from
destination value, mulps and mulss multiply the values, divps and divss divide the
destination value by the source value, rcpps and rcpss compute the approximate re-
ciprocal of the source value, sqrtps and sqrtss compute the square root of the source
value, rsqrtps and rsqrtss compute the approximate reciprocal of square root of the
source value, maxps and maxss compare the source and destination values and return
the greater one, minps and minss compare the source and destination values and return
the lesser one.


I'm looking through the Intel instruction reference and I can't tell what the difference is between phaddd and haddps:

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
Post 21 Mar 2013, 17:13
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 21 Mar 2013, 17:18
Well, it is not straightforward to me AT ALL, but it SEEMS that when you call something a "dword" it means it's an integer, and when you call it a "single precision" it's a float. I did not make the connection and nowhere did I find this very critical information made explicit

Now it jumps out at me that FASM not supporting haddps is a very serious omission.
Post 21 Mar 2013, 17:18
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 21 Mar 2013, 17:18
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 21:22; edited 1 time in total
Post 21 Mar 2013, 17:18
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 21 Mar 2013, 18:34
I swear, I searched for it in the manual and everything and I didn't find it!

I loosened the hit conditions and got this image. The circular aliasing patterns would seem to indicate it's working, but now I'm having trouble getting the conditional jump to work.


Description:
Filesize: 630.88 KB
Viewed: 23872 Time(s)

kean11.jpg


Post 21 Mar 2013, 18:34
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 22 Mar 2013, 00:32
I finally got it working!!! Thanks for your help.

Apparently the computed pixel length needs to be 1.5x (1.414x?) the actual pixel length to get really smooth AA, so I added a coefficient. Unfortunately the rcpss->mulss approximation causes worse aliasing, and I wonder if there is a fix for this.

The program treats pixels as circular, but since pixels are longer diagonally, non-orthagonal circle edges will be thicker. I think there is a simple way to dynamically correct the coefficient, without needing interpolation, while adding only 5 or so clock cycles.

From there on I think I'll try subpixel AA, and also rotating subpixel AA for CRT monitors.

So how does it look? It looks pretty smooth on my CRT monitor.


Description:
Filesize: 33.23 KB
Viewed: 23861 Time(s)

kean17.png


Post 22 Mar 2013, 00:32
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 22 Mar 2013, 01:20
Nope, I was wrong, the rcp->mul method gives perfectly matching results with the div instruction in this case. My mistake. 12 bits of accuracy is good enough here.

EDIT: okay, there are 10 pixels that are a value of 1 instead of 0. Not noticeable though.

AND it turns out the program does adjust for pixel irregularities as an aspect of the algorithm
Post 22 Mar 2013, 01:20
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 22 Mar 2013, 03:38
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 21:21; edited 1 time in total
Post 22 Mar 2013, 03:38
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 22 Mar 2013, 05:30
The comiss instruction is confusing. It uses unsigned conditionals instead of signed - so when I changed ja to jg, I broke the program. So it does an unsigned comparison of signed single-precision floats. WHICH ONE IS IT!? All I know is that it only worked correctly when I used the unsigned conditional. But the values I am comparing are signed float! It seems assembly programming is going to make me a more detail-oriented person.

If I'm reading the chart for my FX-8350 correctly, using sqrtps instead of sqrtss even when you're only using scalars will be 13 times faster (!!). This seems important. Am I right about this?

http://instlatx64.atw.hu/

I just coded in a TSC timer and found no difference between sqrtss xmm1,xmm1 and sqrtps xmm1,xmm1. BUT, when I did sqrtss xmm2,xmm1, speed dropped by 1/3! If I use sqrtps xmm2,xmm1 instead, speed just barely catches up to where it was before. Funny thing is I did this to decrease the total clock count, but I think speed is actually slightly worse even with sqrtps. SO, the instruction timing charts don't neccessarily help. Hmm.

However I did ascertain that the rcp->mul sequence is 12% faster than divss.
Post 22 Mar 2013, 05:30
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20300
Location: In your JS exploiting you and your system
revolution 22 Mar 2013, 14:18
Also remember that all the timings will change for each different CPU and Mobo/RAM combination that it is used on. There is no one-size-fits-all for speed comparisons.
Post 22 Mar 2013, 14:18
View user's profile Send private message Visit poster's website Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 23 Mar 2013, 02:37
I could make the program draw up to 4 circles at once, and this would be more efficient than running multiple instances of the program. However I don't think this program would really be useful to anyone, since graphics is usually done on the GPU these days, and all I'm doing is rendering to an image file. If I were to take performance seriously I would use multiple CPU cores. On 1 CPU core on my FX8350, the program makes theoretically 94FPS (depending on the total area of the circle drawn), which would be multiplied by the number of cores it's run on. One major bottleneck is that several instructions run for every pixel, and maybe I could fix this by skipping pixels using inscribed and cirumscribed octagons without taking up more CPU time than necessary. For now, I think I'll just use it for experiments and fun.

I've noticed instructions which have a memory address as the destination operand tend to take twice as long, so I use registers whenever possible
Post 23 Mar 2013, 02:37
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.