flat assembler
Message board for the users of flat assembler.
Index
> Linux > Circle gone wrong Goto page 1, 2, 3 Next |
Author |
|
revolution 21 Mar 2013, 04:55
If you could post a full piece of test code that can actually be compiled and tested without us having to add bits around it to make it work then you might get some more help. The more work you make us do for you then the less chance you will get the help you want. It is just a fact of life, people are generally lazy so we have to make it easier for them to help us.
Am I right in assuming that you are trying to compute the distance from a point to the centre of a circle? I'm not sure how this gives you antialiasing. I would expect you would get a bright spot in the centre and the image gets progressively dimmer towards the outside further away from the centre. |
|||
21 Mar 2013, 04:55 |
|
keantoken 21 Mar 2013, 05:57
Thanks for the response! I would post the full code, but I didn't expect anyone to try and run it, so I just pasted the relevant parts. I thought I must be making some simple mistake that would be easy to spot. Tell me how to make it more convenient, and I will do.
You are right, except I subtract the radius from the distance, return if the difference is larger than a pixel. If not, I divide the difference by the length of a pixel, which gives me a fraction which I multiply the pixel brightness by. Thus, if the circle edge lies on the boundary between two pixels, it will have a distance of .5 pixels from each pixel; thus each pixel will have half the brightness. This should be minimally computationally intensive, unless the div instruction takes too long, and I wonder if using reciprocal and multiply could be faster. I have a TSC routine I could use to time it. I think the best looking AA scheme is probably not this simple though, and may depend on color calibration. Eventually I'd like to try subpixel AA. Here is a zip file containing all the necessary files. I borrow from the Linux 64 examples to write debug info in the terminal and began with the basic scheme of Randall's code. To run the compiled program, cd to its directory and use ./kean6. You may need to give it executable permissions.
|
|||||||||||
21 Mar 2013, 05:57 |
|
keantoken 21 Mar 2013, 06:02
Here is what the terminal looks like if it terminate successfully. W is rdx, which counts the number of times a pixel hits the circle length, so I know if the conditional logic is working. X Y and Z are the pixel coordinates, but not converted to integer so they are just garbage and are only an indication.
Code: $ ./kean6 Started One fourth finished at: X: 3263102976 Y: 3263102976 Z: 0 W: 1108 One fourth finished at: X: 1045168128 Y: 0 Z: 0 W: 1389 One fourth finished at: X: 1115704320 Y: 1115619328 Z: 0 W: 1389 Render successful Created image successfully Termination successful |
|||
21 Mar 2013, 06:02 |
|
revolution 21 Mar 2013, 07:54
keantoken wrote: I would post the full code, but I didn't expect anyone to try and run it, so I just pasted the relevant parts. I thought I must be making some simple mistake that would be easy to spot. Tell me how to make it more convenient, and I will do. Although I personally don't run Linux so I won't be able to test it, but there are others here that might be able to help now. |
|||
21 Mar 2013, 07:54 |
|
keantoken 21 Mar 2013, 09:38
I've tried running it in several debuggers. They can't simply display the machine code for some reason and let you step through the execution. They require some "symbol" file and won't do anything unless you provide it, even though they are perfectly capable of disassembly. I think this is moronic.
|
|||
21 Mar 2013, 09:38 |
|
keantoken 21 Mar 2013, 11:19
On a related topic, it seems based on the timing charts here that, an rcpps followed by mulps could be much faster than a divps:
http://instlatx64.atw.hu/ For my processor at least, which is an AMD FX8350. I think this is probably because rcpps is approximate, in the words of the FASM manual. But HOW approximate I wonder? Approximate enough to cause noticeable aliasing on for instance, color gradients? |
|||
21 Mar 2013, 11:19 |
|
revolution 21 Mar 2013, 11:31
I would suggest that you get it working first before you start to worry about how fast it is or isn't. There is not much point in having a lightning fast function that doesn't work as needed.
As for rcpps, the Intel manual explains it all. Max error is ~ 1.5 * 2^-12 (i.e. you get about 12 good bits). |
|||
21 Mar 2013, 11:31 |
|
keantoken 21 Mar 2013, 15:52
Well, here is the riddle. If I take out phaddd, the result is either no output or something glitchy with no clear center. With phaddd, it's not a circle but it is centered on the specified coordinates for the circle and moves accordingly. BUT, if I take out sqrtss, nothing changes. Absolutely nothing. That's preposterous. Tell me how that's possible according to the architecture specifications.
After comiss I was using an unsigned conditional instead of signed; when corrected to jg instead of jb, the star is only an outline (attached).
|
||||||||||
21 Mar 2013, 15:52 |
|
r22 21 Mar 2013, 15:53
Looks like your problem is you're shifting between INTEGER and SINGLE FP instructions.
Code: mulps xmm1,xmm1 ; difference squared phaddd xmm1,xmm1 ; squares added sqrtss xmm1,xmm1 ; square root of sum sqrt(a^2+b^2) subss xmm1,[testcir.radius] ; This is not doing ANYTHING, but it should pabsd xmm1,xmm1 MULPS - Multiply SINGLE Precision PHADDD - Packed Horizontal Add DWORD I beleive you want HADDPS SUBSS - Subtract Scalar SINGLE PABSD - Packed Absolute[ Value] DWORD I believe you want to just mask off the sign bit ANDPS with a mask of 7fffffff |
|||
21 Mar 2013, 15:53 |
|
keantoken 21 Mar 2013, 17:13
mulps is a multiply on packed single-precision values. mulss is for scalars.
There is no such thing as haddps, at least to FASM. I guess that's a problem? You'll notice all instructions after phaddd are scalar. It doesn't matter what's in the 3 upper packed dwords, because phaddd gives us the X^2+Y^2 sum in the first packed value. We can then proceed to operate on that with scalar instructions (but it would work just as well with packed instructions). From the FASM manual: Quote: Each of the SSE arithmetic operations has two variants. When the mnemonic ends I'm looking through the Intel instruction reference and I can't tell what the difference is between phaddd and haddps: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf |
|||
21 Mar 2013, 17:13 |
|
keantoken 21 Mar 2013, 17:18
Well, it is not straightforward to me AT ALL, but it SEEMS that when you call something a "dword" it means it's an integer, and when you call it a "single precision" it's a float. I did not make the connection and nowhere did I find this very critical information made explicit
Now it jumps out at me that FASM not supporting haddps is a very serious omission. |
|||
21 Mar 2013, 17:18 |
|
HaHaAnonymous 21 Mar 2013, 17:18
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 21:22; edited 1 time in total |
|||
21 Mar 2013, 17:18 |
|
keantoken 21 Mar 2013, 18:34
I swear, I searched for it in the manual and everything and I didn't find it!
I loosened the hit conditions and got this image. The circular aliasing patterns would seem to indicate it's working, but now I'm having trouble getting the conditional jump to work.
|
||||||||||
21 Mar 2013, 18:34 |
|
keantoken 22 Mar 2013, 00:32
I finally got it working!!! Thanks for your help.
Apparently the computed pixel length needs to be 1.5x (1.414x?) the actual pixel length to get really smooth AA, so I added a coefficient. Unfortunately the rcpss->mulss approximation causes worse aliasing, and I wonder if there is a fix for this. The program treats pixels as circular, but since pixels are longer diagonally, non-orthagonal circle edges will be thicker. I think there is a simple way to dynamically correct the coefficient, without needing interpolation, while adding only 5 or so clock cycles. From there on I think I'll try subpixel AA, and also rotating subpixel AA for CRT monitors. So how does it look? It looks pretty smooth on my CRT monitor.
|
||||||||||
22 Mar 2013, 00:32 |
|
keantoken 22 Mar 2013, 01:20
Nope, I was wrong, the rcp->mul method gives perfectly matching results with the div instruction in this case. My mistake. 12 bits of accuracy is good enough here.
EDIT: okay, there are 10 pixels that are a value of 1 instead of 0. Not noticeable though. AND it turns out the program does adjust for pixel irregularities as an aspect of the algorithm |
|||
22 Mar 2013, 01:20 |
|
HaHaAnonymous 22 Mar 2013, 03:38
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 21:21; edited 1 time in total |
|||
22 Mar 2013, 03:38 |
|
keantoken 22 Mar 2013, 05:30
The comiss instruction is confusing. It uses unsigned conditionals instead of signed - so when I changed ja to jg, I broke the program. So it does an unsigned comparison of signed single-precision floats. WHICH ONE IS IT!? All I know is that it only worked correctly when I used the unsigned conditional. But the values I am comparing are signed float! It seems assembly programming is going to make me a more detail-oriented person.
If I'm reading the chart for my FX-8350 correctly, using sqrtps instead of sqrtss even when you're only using scalars will be 13 times faster (!!). This seems important. Am I right about this? http://instlatx64.atw.hu/ I just coded in a TSC timer and found no difference between sqrtss xmm1,xmm1 and sqrtps xmm1,xmm1. BUT, when I did sqrtss xmm2,xmm1, speed dropped by 1/3! If I use sqrtps xmm2,xmm1 instead, speed just barely catches up to where it was before. Funny thing is I did this to decrease the total clock count, but I think speed is actually slightly worse even with sqrtps. SO, the instruction timing charts don't neccessarily help. Hmm. However I did ascertain that the rcp->mul sequence is 12% faster than divss. |
|||
22 Mar 2013, 05:30 |
|
revolution 22 Mar 2013, 14:18
Also remember that all the timings will change for each different CPU and Mobo/RAM combination that it is used on. There is no one-size-fits-all for speed comparisons.
|
|||
22 Mar 2013, 14:18 |
|
keantoken 23 Mar 2013, 02:37
I could make the program draw up to 4 circles at once, and this would be more efficient than running multiple instances of the program. However I don't think this program would really be useful to anyone, since graphics is usually done on the GPU these days, and all I'm doing is rendering to an image file. If I were to take performance seriously I would use multiple CPU cores. On 1 CPU core on my FX8350, the program makes theoretically 94FPS (depending on the total area of the circle drawn), which would be multiplied by the number of cores it's run on. One major bottleneck is that several instructions run for every pixel, and maybe I could fix this by skipping pixels using inscribed and cirumscribed octagons without taking up more CPU time than necessary. For now, I think I'll just use it for experiments and fun.
I've noticed instructions which have a memory address as the destination operand tend to take twice as long, so I use registers whenever possible |
|||
23 Mar 2013, 02:37 |
|
Goto page 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.