flat assembler
Message board for the users of flat assembler.

Index > Linux > Circle gone wrong

Goto page Previous  1, 2, 3  Next
Author
Thread Post new topic Reply to topic
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 23 Mar 2013, 17:29
The timing charts are still useful though. By using rcpss->rsqrtss, I improved speed by %20. Since both instructions are accurate to 1.5*2**-12, you get a total accuracy of 1.5*2**-11. That means that for a 1280 size screen, you will have barely less than a pixel of error.

I just discovered that internal CPU optimization can be very important... Both of these do the same thing equally fast:

Code:
andps    xmm1,dqword [unsignmask]    

Code:
cpeqps    xmm2,xmm2
psrld    xmm2
andps    xmm1,xmm2    


Perhaps on a different CPU the first would be slower? BUT, if I replace xmm2 with xmm7 in the lower version, it's 1/3 slower! Apparently, the cpu bets on the next register used being adjacent to the last register. So, lesson learned: use adjacent registers when possible. Perhaps start with a middle register so that you have two adjacent registers to choose from.
Post 23 Mar 2013, 17:29
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 23 Mar 2013, 22:07
Well, I went from 94FPS to 147FPS. That's a 56% increase. I think I am ready to try polygon culling.
Post 23 Mar 2013, 22:07
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 24 Mar 2013, 05:52
With circumscribed square culling, I managed 160FPS (20 instructions per pixel; 22-49 total instructions in the loop). So I'm up to 70% faster than the first working code version. I believe I'm reaching the point of diminishing returns without major modifications to the code. Unfortunately, I use a few AVX instructions so the program won't run on either of the other computers in my house. I think I should remove the AVX instructions so that I can do comparisons.

What I don't understand, is why movaps xmm,[m128] doesn't cause any speed hit over movaps xmm,xmm. I thought memory access was supposed to be a bottleneck, but that does not appear to be the case. I'm running at 4GHz with 1880MHz RAM.

I also think I've gotten to know the instructions well enough to move on to an attempt at 3D. Just like my circle program, I want ti to be anti-aliased from the start, and to have a raytracing system that could be extended to reflections. Understanding pipelines a bit, I now understand why Plucker coordinates might be more efficient, but I still have to wrap my head around the math. It seems the streets are just filled with uops for the taking, and the more parallel calculations you can slip in the better.

I think the quality of hand-optimized code could be much more expedient if there were an interactive assembler with a visual indication of pipelines, timings and latencies, drawing from multiple sources. Like the information HLL's use to optimize code except show it to the coder intuitively so they have control over the pipelines.

I had thought that as long as you can fill up the latencies with meaningful parallel computations, the instructions should only take as long as their timing cycles. But I found that switching to instructions with minimal latencies, with lenience towards clock cycles, actually brought the largest improvements in speed.

Eventually I want to try something with FFT resampling, but that is something else quite new to me.

Attached is what happens when you turn the line width up to 128 pixels. This takes only 7 clocks extra per pixel! And the histogram shows it's in good health as well.


Description:
Filesize: 242.7 KB
Viewed: 21187 Time(s)

kean19.png


Post 24 Mar 2013, 05:52
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20300
Location: In your JS exploiting you and your system
revolution 24 Mar 2013, 07:17
keantoken wrote:
What I don't understand, is why movaps xmm,[m128] doesn't cause any speed hit over movaps xmm,xmm. I thought memory access was supposed to be a bottleneck, but that does not appear to be the case. I'm running at 4GHz with 1880MHz RAM.
You have just discovered the vagaries of optimising for a complex modern CPU. The CPUs today are very complex and applying simple rules like "memory is a bottleneck" does not apply any more. There are many many other things that affect the run times. Including, but not limited to, code alignment, cache size/associativity, write buffer utilisation, previous instructions executed, future instruction executed, mobo devices active, etc.
keantoken wrote:
I think the quality of hand-optimized code could be much more expedient if there were an interactive assembler with a visual indication of pipelines, timings and latencies, drawing from multiple sources. Like the information HLL's use to optimize code except show it to the coder intuitively so they have control over the pipelines.
That would be almost impossible with any modern CPU. No software system could know the precise internal state of the CPU. Timing predictions would be meaningless, or at the very least, very inaccurate.
keantoken wrote:
I had thought that as long as you can fill up the latencies with meaningful parallel computations, the instructions should only take as long as their timing cycles. But I found that switching to instructions with minimal latencies, with lenience towards clock cycles, actually brought the largest improvements in speed.
Perhaps so, on your current CPU. Other CPUs/systems will no doubt exhibit different behaviour.
Post 24 Mar 2013, 07:17
View user's profile Send private message Visit poster's website Reply with quote
comrade



Joined: 16 Jun 2003
Posts: 1150
Location: Russian Federation
comrade 24 Mar 2013, 08:52
revolution wrote:
keantoken wrote:
I think the quality of hand-optimized code could be much more expedient if there were an interactive assembler with a visual indication of pipelines, timings and latencies, drawing from multiple sources. Like the information HLL's use to optimize code except show it to the coder intuitively so they have control over the pipelines.
That would be almost impossible with any modern CPU. No software system could know the precise internal state of the CPU. Timing predictions would be meaningless, or at the very least, very inaccurate.


There is a feedback loop between the CPU and the program for some areas such as the branch predictor. I don't have a link handy, but years ago AMD CPUs came with a feature that allowed you to record the results of the internal branch predictor. The intention was that a JIT compiler could use this information to optimize the code on the fly based on runtime trace data.

_________________
comrade (comrade64@live.com; http://comrade.ownz.com/)
Post 24 Mar 2013, 08:52
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 24 Mar 2013, 11:35
Very good papers about optimization and microarchitectures:
http://www.agner.org/optimize/

Some tips how to render fast using CPU:
http://www.iquilezles.org/www/articles/cputiles/cputiles.htm

Great tool for performance measuring on Linux:
http://code.google.com/p/likwid/
Post 24 Mar 2013, 11:35
View user's profile Send private message Visit poster's website Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 25 Mar 2013, 00:47
Thanks, I am checking those out. I have actually been going through the Agner manuals while working on this program.

Splitting the image into tiles was my first instinct when I thought about using multi-core. However different parts of an image render at different speeds so some threads may lag behind. So it seems you would want to have a way for the slowest thread to fork if another thread finishes soon so the extra core can be recycled.

I stupidly use the vbroadcastss instruction in multiple places, so most people probably won't be able to run that code. Haha... I've replaced them with suitable legacy code that doesn't cause any slowdown.

I do something fairly odd with the image conversion. At the time, I left it that way because it worked, but I still have no explanation why it worked or why I couldn't make it work another way; I guess I don't understand the float format and conversions very well. It may be an advantage in that I only have to use an addss in the render loop rather than a mulss. When I have the pixel brightness value, I add 255.0 to it (-255.0 for white-on-black). Then, after converting it to integer, I take the second byte in the dword as the 8-bit pixel color rather than the first byte. This results in perfect behavior and histograms.
Post 25 Mar 2013, 00:47
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 25 Mar 2013, 15:53
keantoken wrote:

Splitting the image into tiles was my first instinct when I thought about using multi-core. However different parts of an image render at different speeds so some threads may lag behind. So it seems you would want to have a way for the slowest thread to fork if another thread finishes soon so the extra core can be recycled.


Each thread will take next tile (from the global pool) as soon as it finishes the current one. In that way it doesn't matter that some tiles will be completed faster.
Post 25 Mar 2013, 15:53
View user's profile Send private message Visit poster's website Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 26 Mar 2013, 18:54
I guess you mean splitting it into more tiles than cores? More tiles means less total potential lag, but also more resources taken with syscalls and tile negotiation. Splitting tiles by Y axis makes sense because you only need to update the X coordinate until you reach the end of the line. If you render line by line, Y code only executes 1/1280 of the time, so if you stick your tile negotiation and syscalls there, you should have minimal interference.

To reduce the chances of gross rendering time differences for tiles, it is probably best to interleave the tiles, so slow sections are shared by threads. Burden sharing across threads favors square interleaved tiles, but interleaved line tiles should be enough I think.

I think this will be my approach once I figure out how to create multiple threads and how they work.
Post 26 Mar 2013, 18:54
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 02 Apr 2013, 06:06
I've included the beginnings of a separate project I'm messing with. I'd like some input on the readability and effectiveness of the layout and structure of the inline comments. If you don't read the rest of this post I'd at least like your input on this.

Code:
;----------------
; Graph
;
; Plots data and labels axes in a compact, minimalistic style
;
; Features:
;
; Plotting:
; 1: per-axis modes: anti-aliased, nearest
; 2: Connect mode
;       a: straight or curve (FFT interpolation; periodic FFT for periodic data such as phase)
; 3: Space mode
;       b: voronoi
; 4: Animations
; 5: background image/watermark
; 6: triggered area shading
;
; Axes:
; 1: Simple pixel font alphanumeric labels; vertical or horizontal
;       a: vertical, horizontal, or axis-aligned perpendicular or parallel
;       b: staggering
;       c: label rout packing w/staggering
;       d: superscript and subscript
;       e: on-plot point labels
;       f: transparent labels
;       g: label type: integer, fraction (reciprocal), ratio, normalized, exponent (scientific, engineering (mnemonic, exponent))
; 2: Flexible tick marks
; 3: Linear, logarithmic
;       a: octave, decade, decibel
; 4: 2 spatial dimensions; further dimensions via multiple lines or color
; 5: Modes:
; 6: Domain flag
; 7:
;
; Coordinates:
; 1: Cartesian, polar, multi-polar (modular?)
;       a: phase-wrap for polar and color representation of phase
;
; Modules:
; 1: Datatype specific transforms (hist, FFT, polar->cart) available through their respective modules
;
; Post-processing
; 1: blank border of specified pixels
; 2: integer resizing (x2, x4...)


format ELF64 executable 3
entry Main

Plot:
        ; Step domain until all data points are collected

        X       = r15
        Y       = r14
        Wxyps   = xmm0
        RGBps   = xmm1
        Spx     = dqword [Xpx]
; Initiate scanloop

        mov     Y,[Ypx]
; Start scanloop
Scan:
        mov     X,[Xpx]
.X:
        ; For ever pixel, its position is subtracted from that of every window point, and if it is farther than a pixel away from any point, it is skipped
                ; distance; |W(x,y)-P(x,y)|
                ; skip if further than a pixel away
                cmpltps
                movq    rdx,
                cmp     rdx,[ffffffffps]
                je      @f
                ; For AA:
                ; RGB% = 1-|W(x,y)-P(x,y)|/GPP
                rcpss                           ; 1.5*10**-11 accuracy; 11 or 12 bits
                mulss                           ; rcp->mul is a fast approximation of div. In an image you won't notice the difference, but you may notice the speedup.
                subss                           ; minus or plus 255 for black-on-white or white-on-black
                ; Convert to RGBA
                pshufb  RGBps,[RGBAmaskps]       ; format conversion
                movd    eax,RGBps

        @@:
        ; Store computed value
        mov     [edi],eax                       ; Will using eax clear all 64-bit registers or just rax?
        xor     eax,eax                         ; in case the next pixel is skipped and this value is left in eax
        ; Bump pixel coordinates and window coordinates
        andps   Wxyps,
        dec     X
        jnz     .X

        add     Wy,[GPPy]
        dec     Y
        jnz     Scan

; Trigger point labels and store coordinates on stack
        return

Label:
; calculate plot label sizes
        ; clipped labels trigger warning if image size is specified
; Process point label triggers on stack.
        return

Main:
; Calculate image metrics
        ; plot section
                ; Pixel resolution
                ; grid points per pixel (Pixel length or GPP)
                        ; plot size (px) / window size = pixel length
; mmap image size
; Render plot
        call    plot
; Render axes/labels
        call    Label
; Combine plot and labels
        ; Shift plot up and to the right

                            


While I cooperate with Evan to try and get EDB to work on my machine, I cannot progress with this program so I'm starting on another one. This time I'm experimenting with trying to skeletonize the program from the start using comments and then fill in between the comments once the structure is clear enough, sort of like an interlaced image. By outlining the features I want to implement from the start and making best guesses on the type of code and how it is best optimized, I can attempt to set out with the optimal program layout and call/jump structure, as well as effective and readable inline documentation.

I have hit a snag in trying to implement multi-core. The only debugger which does what I need is EDB, but for some reason it really hates me and crashes whenever I open a file. The developer told me I must have crashed it more than everyone else combined. This reminds me of FreshIDE, which I still can't run because it's just too buggy. I seem to have problems no one else does, both on my new computer and on my old one, and across multiple Linux distributions.

As far as the circle program, I think I could reduce render time significantly by inserting the circle drawing function inline with the scan loop, which would eliminate an extra call and the associated lags and false dependencies. I could also unroll the scan loop to an amount that I knew would be a factor of the total resolution; 5 seems low enough to support even strange resolutions while I think 20 should support all desktop resolutions. This would reduce code jumps to up to 1/20; essentially code jumps would no longer be a factor in performance, if they were to begin with. According to Agner a call takes about 13 cycles; that's a significant portion of the 21 cycles my code takes to execute.
Post 02 Apr 2013, 06:06
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 02 Apr 2013, 10:47
I am using fdbg and it works really well. What are the features that you need and are not present in fdbg?
Post 02 Apr 2013, 10:47
View user's profile Send private message Visit poster's website Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 01 Aug 2017, 10:33
So... I dropped this project for 4 years and never touched assembly during that period. But that seems to be okay because I've just been able to implement interleaved multithreading. It just splits the image into scanlines and then has each thread render it's own lines.

My problem right now is that I can only have each thread write it's own file with an incremented filename. I don't know how to create a shared memory space that all threads update together.

I have been trying to use sys_mmap with file backing. My guess is that I need to so sys_open, passing the file descriptor to sys_mmap with the appropriate flags, but one of those flags is the memory pagesize and I don't know how to get that number.

Any suggestions on how to share an mmap between threads?
Post 01 Aug 2017, 10:33
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20300
Location: In your JS exploiting you and your system
revolution 01 Aug 2017, 11:29
Threads in a single process all share the same memory. So you can allocate a section of memory to hold the entire output. Then each thread can store the pixels to the appropriate memory position. Then when all the threads are finished save the memory section to disk.
Post 01 Aug 2017, 11:29
View user's profile Send private message Visit poster's website Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 01 Aug 2017, 11:44
What is shared between threads depends on whether you call sys_clone or sys_fork and what flags you use. I am using sys_fork because I could never figure out how to use sys_clone. As far as I can tell no memory space was shared because otherwise at least one of the output images would have the entire image contents.

I should be able to use the CLONE_VM flag with sys_clone to share virtual memory (which I assume means mmap), although I might have to rewrite the spawn loop to do it.
Post 01 Aug 2017, 11:44
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 01 Aug 2017, 12:25
But fork (and by extension, sys_fork) creates a new process, so of course it doesn't share the memory by default.

You need to use clone. I think the classic case of thread flags is CLONE_VM | CLONE_THREAD | CLONE_FS | CLONE_FILES | CLONE_SIGHAND

(or omit the CLONE_THREAD, or other stuff if you don't need them shared)
Post 01 Aug 2017, 12:25
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 01 Aug 2017, 14:16
Are there any examples? I only found an assembly example using fork.
Post 01 Aug 2017, 14:16
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 01 Aug 2017, 15:27
You should really learn C (at least to read it) when using Linux, even if you don't program in it, since almost all reference material is C. Wink

But I'll try give an (untested!) example with raw syscall. For info, see here: http://man7.org/linux/man-pages/man2/clone.2.html (note that it first describes the glibc interface, not the raw syscall)

The "signature" for x86-64 (what you need?) is like:
Code:
long clone(unsigned long flags, void* child_stack, int* ptid, int* ctid, unsigned long newtls);    
(not valid C calling convention obviously)

The last 3 parameters are optional and depend on flags; for simplest case they are not needed, but you probably want ptid so your parent can know about the child. To use ptid (a pointer to a memory for the parent that receives the thread ID of the child) use the flag CLONE_PARENT_SETTID.

You will have to allocate space for the thread's stack yourself and then pass a pointer to the end of the allocated memory in the second parameter. That's up to you.

Useful references (bookmark them Wink):

syscalls table (what goes in eax): 32-bit | 64-bit
Flags for clone are found in this file: https://github.com/torvalds/linux/blob/master/include/uapi/linux/sched.h

So I guess something like:
Code:
mov eax, 56          ; should be sys_clone
mov edi, 0x00110F00  ; CLONE_VM | CLONE_THREAD | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_PARENT_SETTID
mov rsi, end_of_allocated_stack
lea rdx, [rsp+ptid]  ; assuming it's a local variable, change it as needed
syscall    
Note that the syscall operates just like fork, i.e. the child will start execution at the same point (where syscall returns), inspect return value.

Of course I don't recommend using hardcoded values, you should use sed or something to convert the #define into FASM constants. Wink
Post 01 Aug 2017, 15:27
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 01 Aug 2017, 16:08
So, at risk of sounding stupid, am I right that I need to push a bunch of empty space into the stack, however much I think my threads will need, and then tell sys_clone to put their stack pointers at the right position? Or do I need a to use another system call to allocate stack space?

Also, why [rsp+ptid]? Does this just give each thread quick unique stacks or is there a mathematical reason to use ptid here?

I haven't used the stack in my program except if some syscalls needed it, I'm not sure why I need to since I do everything in mmap.
Post 01 Aug 2017, 16:08
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 01 Aug 2017, 17:03
The ptid is for the parent, not for the new thread. It's where the sys_clone will store the ID of the thread, so that you can access it from the parent, in case you want to do stuff with the thread from the parent. Think of it as an extra return value for the parent (not for the new thread). Of course, if you let the thread run wild without the parent knowing, then you won't need it (in such case don't supply the flag).

The reason I placed it on the stack is because of good programming practice to avoid globals in multi-threaded code. You can place it in a global variable if that global is only accessed by the parent thread. Or you can place it on a allocated data structure accessed by the parent only as well, works fine too.

About the stack, yes you need to allocate new space with sys_mmap, sys_brk or something like that (or use a user-land library, depends on your use case). Using the same stack for both threads can be dangerous and is quite hacky.
Post 01 Aug 2017, 17:03
View user's profile Send private message Reply with quote
keantoken



Joined: 19 Mar 2008
Posts: 69
keantoken 01 Aug 2017, 17:41
By globals do you mean variables which all threads could change at any time? I have a long list of variables in the RW segment of my program. Using sys_fork they are not shared between threads, but I don't know how sys_clone will work. For each thread to have it's own separate RW segment would be easiest to make work with the program as it is already.
Post 01 Aug 2017, 17:41
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.