flat assembler
Message board for the users of flat assembler.
Index
> Linux > Circle gone wrong Goto page Previous 1, 2, 3 Next |
Author |
|
keantoken 23 Mar 2013, 22:07
Well, I went from 94FPS to 147FPS. That's a 56% increase. I think I am ready to try polygon culling.
|
|||
23 Mar 2013, 22:07 |
|
keantoken 24 Mar 2013, 05:52
With circumscribed square culling, I managed 160FPS (20 instructions per pixel; 22-49 total instructions in the loop). So I'm up to 70% faster than the first working code version. I believe I'm reaching the point of diminishing returns without major modifications to the code. Unfortunately, I use a few AVX instructions so the program won't run on either of the other computers in my house. I think I should remove the AVX instructions so that I can do comparisons.
What I don't understand, is why movaps xmm,[m128] doesn't cause any speed hit over movaps xmm,xmm. I thought memory access was supposed to be a bottleneck, but that does not appear to be the case. I'm running at 4GHz with 1880MHz RAM. I also think I've gotten to know the instructions well enough to move on to an attempt at 3D. Just like my circle program, I want ti to be anti-aliased from the start, and to have a raytracing system that could be extended to reflections. Understanding pipelines a bit, I now understand why Plucker coordinates might be more efficient, but I still have to wrap my head around the math. It seems the streets are just filled with uops for the taking, and the more parallel calculations you can slip in the better. I think the quality of hand-optimized code could be much more expedient if there were an interactive assembler with a visual indication of pipelines, timings and latencies, drawing from multiple sources. Like the information HLL's use to optimize code except show it to the coder intuitively so they have control over the pipelines. I had thought that as long as you can fill up the latencies with meaningful parallel computations, the instructions should only take as long as their timing cycles. But I found that switching to instructions with minimal latencies, with lenience towards clock cycles, actually brought the largest improvements in speed. Eventually I want to try something with FFT resampling, but that is something else quite new to me. Attached is what happens when you turn the line width up to 128 pixels. This takes only 7 clocks extra per pixel! And the histogram shows it's in good health as well.
|
||||||||||
24 Mar 2013, 05:52 |
|
revolution 24 Mar 2013, 07:17
keantoken wrote: What I don't understand, is why movaps xmm,[m128] doesn't cause any speed hit over movaps xmm,xmm. I thought memory access was supposed to be a bottleneck, but that does not appear to be the case. I'm running at 4GHz with 1880MHz RAM. keantoken wrote: I think the quality of hand-optimized code could be much more expedient if there were an interactive assembler with a visual indication of pipelines, timings and latencies, drawing from multiple sources. Like the information HLL's use to optimize code except show it to the coder intuitively so they have control over the pipelines. keantoken wrote: I had thought that as long as you can fill up the latencies with meaningful parallel computations, the instructions should only take as long as their timing cycles. But I found that switching to instructions with minimal latencies, with lenience towards clock cycles, actually brought the largest improvements in speed. |
|||
24 Mar 2013, 07:17 |
|
comrade 24 Mar 2013, 08:52
revolution wrote:
There is a feedback loop between the CPU and the program for some areas such as the branch predictor. I don't have a link handy, but years ago AMD CPUs came with a feature that allowed you to record the results of the internal branch predictor. The intention was that a JIT compiler could use this information to optimize the code on the fly based on runtime trace data. |
|||
24 Mar 2013, 08:52 |
|
randall 24 Mar 2013, 11:35
Very good papers about optimization and microarchitectures:
http://www.agner.org/optimize/ Some tips how to render fast using CPU: http://www.iquilezles.org/www/articles/cputiles/cputiles.htm Great tool for performance measuring on Linux: http://code.google.com/p/likwid/ |
|||
24 Mar 2013, 11:35 |
|
keantoken 25 Mar 2013, 00:47
Thanks, I am checking those out. I have actually been going through the Agner manuals while working on this program.
Splitting the image into tiles was my first instinct when I thought about using multi-core. However different parts of an image render at different speeds so some threads may lag behind. So it seems you would want to have a way for the slowest thread to fork if another thread finishes soon so the extra core can be recycled. I stupidly use the vbroadcastss instruction in multiple places, so most people probably won't be able to run that code. Haha... I've replaced them with suitable legacy code that doesn't cause any slowdown. I do something fairly odd with the image conversion. At the time, I left it that way because it worked, but I still have no explanation why it worked or why I couldn't make it work another way; I guess I don't understand the float format and conversions very well. It may be an advantage in that I only have to use an addss in the render loop rather than a mulss. When I have the pixel brightness value, I add 255.0 to it (-255.0 for white-on-black). Then, after converting it to integer, I take the second byte in the dword as the 8-bit pixel color rather than the first byte. This results in perfect behavior and histograms. |
|||
25 Mar 2013, 00:47 |
|
randall 25 Mar 2013, 15:53
keantoken wrote:
Each thread will take next tile (from the global pool) as soon as it finishes the current one. In that way it doesn't matter that some tiles will be completed faster. |
|||
25 Mar 2013, 15:53 |
|
keantoken 26 Mar 2013, 18:54
I guess you mean splitting it into more tiles than cores? More tiles means less total potential lag, but also more resources taken with syscalls and tile negotiation. Splitting tiles by Y axis makes sense because you only need to update the X coordinate until you reach the end of the line. If you render line by line, Y code only executes 1/1280 of the time, so if you stick your tile negotiation and syscalls there, you should have minimal interference.
To reduce the chances of gross rendering time differences for tiles, it is probably best to interleave the tiles, so slow sections are shared by threads. Burden sharing across threads favors square interleaved tiles, but interleaved line tiles should be enough I think. I think this will be my approach once I figure out how to create multiple threads and how they work. |
|||
26 Mar 2013, 18:54 |
|
keantoken 02 Apr 2013, 06:06
I've included the beginnings of a separate project I'm messing with. I'd like some input on the readability and effectiveness of the layout and structure of the inline comments. If you don't read the rest of this post I'd at least like your input on this.
Code: ;---------------- ; Graph ; ; Plots data and labels axes in a compact, minimalistic style ; ; Features: ; ; Plotting: ; 1: per-axis modes: anti-aliased, nearest ; 2: Connect mode ; a: straight or curve (FFT interpolation; periodic FFT for periodic data such as phase) ; 3: Space mode ; b: voronoi ; 4: Animations ; 5: background image/watermark ; 6: triggered area shading ; ; Axes: ; 1: Simple pixel font alphanumeric labels; vertical or horizontal ; a: vertical, horizontal, or axis-aligned perpendicular or parallel ; b: staggering ; c: label rout packing w/staggering ; d: superscript and subscript ; e: on-plot point labels ; f: transparent labels ; g: label type: integer, fraction (reciprocal), ratio, normalized, exponent (scientific, engineering (mnemonic, exponent)) ; 2: Flexible tick marks ; 3: Linear, logarithmic ; a: octave, decade, decibel ; 4: 2 spatial dimensions; further dimensions via multiple lines or color ; 5: Modes: ; 6: Domain flag ; 7: ; ; Coordinates: ; 1: Cartesian, polar, multi-polar (modular?) ; a: phase-wrap for polar and color representation of phase ; ; Modules: ; 1: Datatype specific transforms (hist, FFT, polar->cart) available through their respective modules ; ; Post-processing ; 1: blank border of specified pixels ; 2: integer resizing (x2, x4...) format ELF64 executable 3 entry Main Plot: ; Step domain until all data points are collected X = r15 Y = r14 Wxyps = xmm0 RGBps = xmm1 Spx = dqword [Xpx] ; Initiate scanloop mov Y,[Ypx] ; Start scanloop Scan: mov X,[Xpx] .X: ; For ever pixel, its position is subtracted from that of every window point, and if it is farther than a pixel away from any point, it is skipped ; distance; |W(x,y)-P(x,y)| ; skip if further than a pixel away cmpltps movq rdx, cmp rdx,[ffffffffps] je @f ; For AA: ; RGB% = 1-|W(x,y)-P(x,y)|/GPP rcpss ; 1.5*10**-11 accuracy; 11 or 12 bits mulss ; rcp->mul is a fast approximation of div. In an image you won't notice the difference, but you may notice the speedup. subss ; minus or plus 255 for black-on-white or white-on-black ; Convert to RGBA pshufb RGBps,[RGBAmaskps] ; format conversion movd eax,RGBps @@: ; Store computed value mov [edi],eax ; Will using eax clear all 64-bit registers or just rax? xor eax,eax ; in case the next pixel is skipped and this value is left in eax ; Bump pixel coordinates and window coordinates andps Wxyps, dec X jnz .X add Wy,[GPPy] dec Y jnz Scan ; Trigger point labels and store coordinates on stack return Label: ; calculate plot label sizes ; clipped labels trigger warning if image size is specified ; Process point label triggers on stack. return Main: ; Calculate image metrics ; plot section ; Pixel resolution ; grid points per pixel (Pixel length or GPP) ; plot size (px) / window size = pixel length ; mmap image size ; Render plot call plot ; Render axes/labels call Label ; Combine plot and labels ; Shift plot up and to the right While I cooperate with Evan to try and get EDB to work on my machine, I cannot progress with this program so I'm starting on another one. This time I'm experimenting with trying to skeletonize the program from the start using comments and then fill in between the comments once the structure is clear enough, sort of like an interlaced image. By outlining the features I want to implement from the start and making best guesses on the type of code and how it is best optimized, I can attempt to set out with the optimal program layout and call/jump structure, as well as effective and readable inline documentation. I have hit a snag in trying to implement multi-core. The only debugger which does what I need is EDB, but for some reason it really hates me and crashes whenever I open a file. The developer told me I must have crashed it more than everyone else combined. This reminds me of FreshIDE, which I still can't run because it's just too buggy. I seem to have problems no one else does, both on my new computer and on my old one, and across multiple Linux distributions. As far as the circle program, I think I could reduce render time significantly by inserting the circle drawing function inline with the scan loop, which would eliminate an extra call and the associated lags and false dependencies. I could also unroll the scan loop to an amount that I knew would be a factor of the total resolution; 5 seems low enough to support even strange resolutions while I think 20 should support all desktop resolutions. This would reduce code jumps to up to 1/20; essentially code jumps would no longer be a factor in performance, if they were to begin with. According to Agner a call takes about 13 cycles; that's a significant portion of the 21 cycles my code takes to execute. |
|||
02 Apr 2013, 06:06 |
|
randall 02 Apr 2013, 10:47
I am using fdbg and it works really well. What are the features that you need and are not present in fdbg?
|
|||
02 Apr 2013, 10:47 |
|
keantoken 01 Aug 2017, 10:33
So... I dropped this project for 4 years and never touched assembly during that period. But that seems to be okay because I've just been able to implement interleaved multithreading. It just splits the image into scanlines and then has each thread render it's own lines.
My problem right now is that I can only have each thread write it's own file with an incremented filename. I don't know how to create a shared memory space that all threads update together. I have been trying to use sys_mmap with file backing. My guess is that I need to so sys_open, passing the file descriptor to sys_mmap with the appropriate flags, but one of those flags is the memory pagesize and I don't know how to get that number. Any suggestions on how to share an mmap between threads? |
|||
01 Aug 2017, 10:33 |
|
revolution 01 Aug 2017, 11:29
Threads in a single process all share the same memory. So you can allocate a section of memory to hold the entire output. Then each thread can store the pixels to the appropriate memory position. Then when all the threads are finished save the memory section to disk.
|
|||
01 Aug 2017, 11:29 |
|
keantoken 01 Aug 2017, 11:44
What is shared between threads depends on whether you call sys_clone or sys_fork and what flags you use. I am using sys_fork because I could never figure out how to use sys_clone. As far as I can tell no memory space was shared because otherwise at least one of the output images would have the entire image contents.
I should be able to use the CLONE_VM flag with sys_clone to share virtual memory (which I assume means mmap), although I might have to rewrite the spawn loop to do it. |
|||
01 Aug 2017, 11:44 |
|
Furs 01 Aug 2017, 12:25
But fork (and by extension, sys_fork) creates a new process, so of course it doesn't share the memory by default.
You need to use clone. I think the classic case of thread flags is CLONE_VM | CLONE_THREAD | CLONE_FS | CLONE_FILES | CLONE_SIGHAND (or omit the CLONE_THREAD, or other stuff if you don't need them shared) |
|||
01 Aug 2017, 12:25 |
|
keantoken 01 Aug 2017, 14:16
Are there any examples? I only found an assembly example using fork.
|
|||
01 Aug 2017, 14:16 |
|
Furs 01 Aug 2017, 15:27
You should really learn C (at least to read it) when using Linux, even if you don't program in it, since almost all reference material is C.
But I'll try give an (untested!) example with raw syscall. For info, see here: http://man7.org/linux/man-pages/man2/clone.2.html (note that it first describes the glibc interface, not the raw syscall) The "signature" for x86-64 (what you need?) is like: Code: long clone(unsigned long flags, void* child_stack, int* ptid, int* ctid, unsigned long newtls); The last 3 parameters are optional and depend on flags; for simplest case they are not needed, but you probably want ptid so your parent can know about the child. To use ptid (a pointer to a memory for the parent that receives the thread ID of the child) use the flag CLONE_PARENT_SETTID. You will have to allocate space for the thread's stack yourself and then pass a pointer to the end of the allocated memory in the second parameter. That's up to you. Useful references (bookmark them ): syscalls table (what goes in eax): 32-bit | 64-bit Flags for clone are found in this file: https://github.com/torvalds/linux/blob/master/include/uapi/linux/sched.h So I guess something like: Code: mov eax, 56 ; should be sys_clone mov edi, 0x00110F00 ; CLONE_VM | CLONE_THREAD | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_PARENT_SETTID mov rsi, end_of_allocated_stack lea rdx, [rsp+ptid] ; assuming it's a local variable, change it as needed syscall Of course I don't recommend using hardcoded values, you should use sed or something to convert the #define into FASM constants. |
|||
01 Aug 2017, 15:27 |
|
keantoken 01 Aug 2017, 16:08
So, at risk of sounding stupid, am I right that I need to push a bunch of empty space into the stack, however much I think my threads will need, and then tell sys_clone to put their stack pointers at the right position? Or do I need a to use another system call to allocate stack space?
Also, why [rsp+ptid]? Does this just give each thread quick unique stacks or is there a mathematical reason to use ptid here? I haven't used the stack in my program except if some syscalls needed it, I'm not sure why I need to since I do everything in mmap. |
|||
01 Aug 2017, 16:08 |
|
Furs 01 Aug 2017, 17:03
The ptid is for the parent, not for the new thread. It's where the sys_clone will store the ID of the thread, so that you can access it from the parent, in case you want to do stuff with the thread from the parent. Think of it as an extra return value for the parent (not for the new thread). Of course, if you let the thread run wild without the parent knowing, then you won't need it (in such case don't supply the flag).
The reason I placed it on the stack is because of good programming practice to avoid globals in multi-threaded code. You can place it in a global variable if that global is only accessed by the parent thread. Or you can place it on a allocated data structure accessed by the parent only as well, works fine too. About the stack, yes you need to allocate new space with sys_mmap, sys_brk or something like that (or use a user-land library, depends on your use case). Using the same stack for both threads can be dangerous and is quite hacky. |
|||
01 Aug 2017, 17:03 |
|
keantoken 01 Aug 2017, 17:41
By globals do you mean variables which all threads could change at any time? I have a long list of variables in the RW segment of my program. Using sys_fork they are not shared between threads, but I don't know how sys_clone will work. For each thread to have it's own separate RW segment would be easiest to make work with the program as it is already.
|
|||
01 Aug 2017, 17:41 |
|
Goto page Previous 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.