flat assembler
Message board for the users of flat assembler.
 flat assembler > Linux > Circle gone wrong Goto page 1, 2, 3  Next
Author
keantoken

Joined: 19 Mar 2008
Posts: 69

# Circle gone wrong

Hello all.

Inspired by Randall's example I've started attempting to code my own rendering program. I intend to use real perspective. The sight rays are calculated based on user input for the size of screen and distance from it, so it should appear that the screen is simply a window into the 3D environment.

I succeeded in getting the image generation code working, and it took quite a long time to do that since this is my biggest project yet in any programming language, since assembly is my first.

I was able to render some colorful 2D images, and made some fantastic mistakes, but now I am ready to move on to the rendering section. Right now I am writing a test function to plot an anti-aliased 2D circle on the screen to get a grasp on how to use the SSE instructions.

I can't figure out what is going wrong. I'm using pythagorean theorem to find the distance to the circle center. If this is within a pixel of the circle radius, the pixel brightness is set according to how far within the pixel the point is.

But it won't work right. I don't have a clue what could be wrong and I've tried everything I know of.

 Code: ;------------- ; POVPixel      Determines color at the end of input ray ; Input:        xmm0.xyz?       Ray starts at origin 0,0,0 ; Output:       xmm1.rgb?       What's transparency useful for? ;------------- align 64         POVPixel:         push    rcx         ; Test: plot simple anti-aliased 2D circle         movaps  xmm3,xmm0         andps   xmm0,dqword [andXY__ps]         movaps  xmm1,dqword [testcir.center]         subps   xmm1,xmm0                       ; circle coord - pixel coord         mulps   xmm1,xmm1                       ; difference squared         phaddd  xmm1,xmm1                       ; squares added         sqrtss  xmm1,xmm1                       ; square root of sum            sqrt(a^2+b^2)         subss   xmm1,[testcir.radius]           ; This is not doing ANYTHING, but it should         pabsd   xmm1,xmm1                                comiss  xmm1,[Lpix]         jb      @f         inc     rdx         divss   xmm1,[Lpix]                     ; Here on appears to work right         mulss   xmm1,[bytesizeps]         movaps  dqword [dqbufx],xmm1         vbroadcastss    xmm1,[dqbufx]         andps   xmm1,dqword [andRGB_ps]         jmp     .ret         @@:         xorps   xmm1,xmm1         .ret:         movaps  xmm0,xmm3         pop     rcx         ret

 Code: struc   circle  cx,cy,cz,r                         {         .center dd      cx                 dd      cy                 dd      cz                 dd      0         .radius dd      r                 rd      3                 } align 64 testcir circle  -64.0,-64.0,0.0,64.0

For some reason the subss instruction just isn't doing anything. I've tried loading the radius in multiple ways, all with the same result, so it can't be a mistake with data definitions.

 Description: Filesize: 13.11 KB Viewed: 1523 Time(s)

20 Mar 2013, 11:06
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 15324
Location: Bigweld Industries
If you could post a full piece of test code that can actually be compiled and tested without us having to add bits around it to make it work then you might get some more help. The more work you make us do for you then the less chance you will get the help you want. It is just a fact of life, people are generally lazy so we have to make it easier for them to help us.

Am I right in assuming that you are trying to compute the distance from a point to the centre of a circle? I'm not sure how this gives you antialiasing. I would expect you would get a bright spot in the centre and the image gets progressively dimmer towards the outside further away from the centre.
21 Mar 2013, 04:55
keantoken

Joined: 19 Mar 2008
Posts: 69
Thanks for the response! I would post the full code, but I didn't expect anyone to try and run it, so I just pasted the relevant parts. I thought I must be making some simple mistake that would be easy to spot. Tell me how to make it more convenient, and I will do.

You are right, except I subtract the radius from the distance, return if the difference is larger than a pixel. If not, I divide the difference by the length of a pixel, which gives me a fraction which I multiply the pixel brightness by. Thus, if the circle edge lies on the boundary between two pixels, it will have a distance of .5 pixels from each pixel; thus each pixel will have half the brightness. This should be minimally computationally intensive, unless the div instruction takes too long, and I wonder if using reciprocal and multiply could be faster. I have a TSC routine I could use to time it. I think the best looking AA scheme is probably not this simple though, and may depend on color calibration. Eventually I'd like to try subpixel AA.

Here is a zip file containing all the necessary files. I borrow from the Linux 64 examples to write debug info in the terminal and began with the basic scheme of Randall's code.

To run the compiled program, cd to its directory and use ./kean6. You may need to give it executable permissions.

21 Mar 2013, 05:57
keantoken

Joined: 19 Mar 2008
Posts: 69
Here is what the terminal looks like if it terminate successfully. W is rdx, which counts the number of times a pixel hits the circle length, so I know if the conditional logic is working. X Y and Z are the pixel coordinates, but not converted to integer so they are just garbage and are only an indication.

 Code: \$ ./kean6 Started One fourth finished at:  X: 3263102976 Y: 3263102976 Z: 0 W: 1108 One fourth finished at:  X: 1045168128 Y: 0 Z: 0 W: 1389 One fourth finished at:  X: 1115704320 Y: 1115619328 Z: 0 W: 1389 Render successful Created image successfully Termination successful
21 Mar 2013, 06:02
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 15324
Location: Bigweld Industries
 keantoken wrote: I would post the full code, but I didn't expect anyone to try and run it, so I just pasted the relevant parts. I thought I must be making some simple mistake that would be easy to spot. Tell me how to make it more convenient, and I will do.
Well it goes a little further than that. Often it happens here that someone posts some piece of code they think is causing a problem but later find it is some other part completely that is the real cause. And from our point of view we would not know what you have done in the rest of the code so we can't know where the problem might be. Anyhow, I see you have posted a file for someone to test with, so now we can feel better about taking time to look at the code to find the problem.

Although I personally don't run Linux so I won't be able to test it, but there are others here that might be able to help now.
21 Mar 2013, 07:54
keantoken

Joined: 19 Mar 2008
Posts: 69
I've tried running it in several debuggers. They can't simply display the machine code for some reason and let you step through the execution. They require some "symbol" file and won't do anything unless you provide it, even though they are perfectly capable of disassembly. I think this is moronic.
21 Mar 2013, 09:38
keantoken

Joined: 19 Mar 2008
Posts: 69
On a related topic, it seems based on the timing charts here that, an rcpps followed by mulps could be much faster than a divps:

http://instlatx64.atw.hu/

For my processor at least, which is an AMD FX8350.

I think this is probably because rcpps is approximate, in the words of the FASM manual. But HOW approximate I wonder? Approximate enough to cause noticeable aliasing on for instance, color gradients?
21 Mar 2013, 11:19
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 15324
Location: Bigweld Industries
I would suggest that you get it working first before you start to worry about how fast it is or isn't. There is not much point in having a lightning fast function that doesn't work as needed.

As for rcpps, the Intel manual explains it all. Max error is ~ 1.5 * 2^-12 (i.e. you get about 12 good bits).
21 Mar 2013, 11:31
keantoken

Joined: 19 Mar 2008
Posts: 69
Well, here is the riddle. If I take out phaddd, the result is either no output or something glitchy with no clear center. With phaddd, it's not a circle but it is centered on the specified coordinates for the circle and moves accordingly. BUT, if I take out sqrtss, nothing changes. Absolutely nothing. That's preposterous. Tell me how that's possible according to the architecture specifications.

After comiss I was using an unsigned conditional instead of signed; when corrected to jg instead of jb, the star is only an outline (attached).

 Description: Filesize: 8.98 KB Viewed: 1473 Time(s)

21 Mar 2013, 15:52
r22

Joined: 27 Dec 2004
Posts: 805
Looks like your problem is you're shifting between INTEGER and SINGLE FP instructions.
 Code: mulps   xmm1,xmm1                       ; difference squared         phaddd  xmm1,xmm1                       ; squares added         sqrtss  xmm1,xmm1                       ; square root of sum            sqrt(a^2+b^2)         subss   xmm1,[testcir.radius]           ; This is not doing ANYTHING, but it should         pabsd   xmm1,xmm1

MULPS - Multiply SINGLE Precision
SUBSS - Subtract Scalar SINGLE
PABSD - Packed Absolute[ Value] DWORD
I believe you want to just mask off the sign bit ANDPS with a mask of 7fffffff
21 Mar 2013, 15:53
keantoken

Joined: 19 Mar 2008
Posts: 69
mulps is a multiply on packed single-precision values. mulss is for scalars.

There is no such thing as haddps, at least to FASM. I guess that's a problem?

You'll notice all instructions after phaddd are scalar. It doesn't matter what's in the 3 upper packed dwords, because phaddd gives us the X^2+Y^2 sum in the first packed value. We can then proceed to operate on that with scalar instructions (but it would work just as well with packed instructions).

From the FASM manual:

 Quote: Each of the SSE arithmetic operations has two variants. When the mnemonic ends with ps, the source operand can be a 128–bit memory location or a SSE register, the destination operand must be a SSE register and the operation is performed on packed four single precision values, for each pair of the corresponding data elements separately, 2.1. THE X86 ARCHITECTURE INSTRUCTIONS 41 the result is stored in the destination register. When the mnemonic ends with ss, the source operand can be a 32–bit memory location or a SSE register, the destination operand must be a SSE register and the operation is performed on single precision values, only low double words of SSE registers are used in this case, the result is stored in the low double word of destination register. addps and addss add the values, subps and subss substract the source value from destination value, mulps and mulss multiply the values, divps and divss divide the destination value by the source value, rcpps and rcpss compute the approximate re- ciprocal of the source value, sqrtps and sqrtss compute the square root of the source value, rsqrtps and rsqrtss compute the approximate reciprocal of square root of the source value, maxps and maxss compare the source and destination values and return the greater one, minps and minss compare the source and destination values and return the lesser one.

I'm looking through the Intel instruction reference and I can't tell what the difference is between phaddd and haddps:

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
21 Mar 2013, 17:13
keantoken

Joined: 19 Mar 2008
Posts: 69
Well, it is not straightforward to me AT ALL, but it SEEMS that when you call something a "dword" it means it's an integer, and when you call it a "single precision" it's a float. I did not make the connection and nowhere did I find this very critical information made explicit

Now it jumps out at me that FASM not supporting haddps is a very serious omission.
21 Mar 2013, 17:18
HaHaAnonymous

Joined: 02 Dec 2012
Posts: 1171
Location: Unknown
Stupid post removed.

Last edited by HaHaAnonymous on 28 Feb 2015, 21:22; edited 1 time in total
21 Mar 2013, 17:18
keantoken

Joined: 19 Mar 2008
Posts: 69
I swear, I searched for it in the manual and everything and I didn't find it!

I loosened the hit conditions and got this image. The circular aliasing patterns would seem to indicate it's working, but now I'm having trouble getting the conditional jump to work.

 Description: Filesize: 630.88 KB Viewed: 1462 Time(s)

21 Mar 2013, 18:34
keantoken

Joined: 19 Mar 2008
Posts: 69
I finally got it working!!! Thanks for your help.

Apparently the computed pixel length needs to be 1.5x (1.414x?) the actual pixel length to get really smooth AA, so I added a coefficient. Unfortunately the rcpss->mulss approximation causes worse aliasing, and I wonder if there is a fix for this.

The program treats pixels as circular, but since pixels are longer diagonally, non-orthagonal circle edges will be thicker. I think there is a simple way to dynamically correct the coefficient, without needing interpolation, while adding only 5 or so clock cycles.

From there on I think I'll try subpixel AA, and also rotating subpixel AA for CRT monitors.

So how does it look? It looks pretty smooth on my CRT monitor.

 Description: Filesize: 33.23 KB Viewed: 1451 Time(s)

22 Mar 2013, 00:32
keantoken

Joined: 19 Mar 2008
Posts: 69
Nope, I was wrong, the rcp->mul method gives perfectly matching results with the div instruction in this case. My mistake. 12 bits of accuracy is good enough here.

EDIT: okay, there are 10 pixels that are a value of 1 instead of 0. Not noticeable though.

AND it turns out the program does adjust for pixel irregularities as an aspect of the algorithm
22 Mar 2013, 01:20
HaHaAnonymous

Joined: 02 Dec 2012
Posts: 1171
Location: Unknown
Stupid post removed.

Last edited by HaHaAnonymous on 28 Feb 2015, 21:21; edited 1 time in total
22 Mar 2013, 03:38
keantoken

Joined: 19 Mar 2008
Posts: 69
The comiss instruction is confusing. It uses unsigned conditionals instead of signed - so when I changed ja to jg, I broke the program. So it does an unsigned comparison of signed single-precision floats. WHICH ONE IS IT!? All I know is that it only worked correctly when I used the unsigned conditional. But the values I am comparing are signed float! It seems assembly programming is going to make me a more detail-oriented person.

If I'm reading the chart for my FX-8350 correctly, using sqrtps instead of sqrtss even when you're only using scalars will be 13 times faster (!!). This seems important. Am I right about this?

http://instlatx64.atw.hu/

I just coded in a TSC timer and found no difference between sqrtss xmm1,xmm1 and sqrtps xmm1,xmm1. BUT, when I did sqrtss xmm2,xmm1, speed dropped by 1/3! If I use sqrtps xmm2,xmm1 instead, speed just barely catches up to where it was before. Funny thing is I did this to decrease the total clock count, but I think speed is actually slightly worse even with sqrtps. SO, the instruction timing charts don't neccessarily help. Hmm.

However I did ascertain that the rcp->mul sequence is 12% faster than divss.
22 Mar 2013, 05:30
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 15324
Location: Bigweld Industries
Also remember that all the timings will change for each different CPU and Mobo/RAM combination that it is used on. There is no one-size-fits-all for speed comparisons.
22 Mar 2013, 14:18
keantoken

Joined: 19 Mar 2008
Posts: 69
I could make the program draw up to 4 circles at once, and this would be more efficient than running multiple instances of the program. However I don't think this program would really be useful to anyone, since graphics is usually done on the GPU these days, and all I'm doing is rendering to an image file. If I were to take performance seriously I would use multiple CPU cores. On 1 CPU core on my FX8350, the program makes theoretically 94FPS (depending on the total area of the circle drawn), which would be multiplied by the number of cores it's run on. One major bottleneck is that several instructions run for every pixel, and maybe I could fix this by skipping pixels using inscribed and cirumscribed octagons without taking up more CPU time than necessary. For now, I think I'll just use it for experiments and fun.

I've noticed instructions which have a memory address as the destination operand tend to take twice as long, so I use registers whenever possible
23 Mar 2013, 02:37
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------Blog General----------------MainDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsCompiler InternalsIDE DevelopmentOS ConstructionNon-x86 architecturesHigh Level LanguagesProgramming Language DesignProjects and IdeasExamples and Tutorials Other----------------FeedbackHeapTest Area
Goto page 1, 2, 3  Next

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum