flat assembler
Message board for the users of flat assembler.
Index
> DOS > Bresenham line routine for the linear framebuffer with 8 bpp Goto page 1, 2 Next |
Author |
|
sid123 19 Mar 2014, 02:26
Cool
|
|||
19 Mar 2014, 02:26 |
|
edfed 19 Mar 2014, 09:48
you can add a sort of "out of bound" put pixel ignore
if the result of the pixel coordinate computation is less than 0, or more than 1920*1200 -1, don't draw then, don't care about the bounds of your line. and with a little more computations, you can also reajust the coordinates to fit in the screen , but it needs some mul and div instructions in the init phase, just after the bound checking condition a putpixel routine can let you play with the screen, without having anything to care about the specificities of the framebuffer. Code: mov ebx,x mov ecx,y mov eax,color call putpixel |
|||
19 Mar 2014, 09:48 |
|
freecrac 20 Mar 2014, 09:25
Hello and thanks all for their feedback.
edfed wrote: you can add a sort of "out of bound" put pixel ignore if the result of the pixel coordinate computation is less than 0, or more than 1920*1200 -1, don't draw then, don't care about the bounds of your line. Yes it is possible, but i think my own main routines for calling the line routine do not need for to use coordinates outside of the resolution size. So if it happend, then the main routine have a part of a wrong calculation inside and i have to find and to fix it. Quote: a putpixel routine can let you play with the screen, without having anything to care about the specificities of the framebuffer. I think it results in more clock cycles, if we call a subroutine for to draw every single pixel, so this method is against the speed optimizing for to use as few as possible only very simple instructions that can be paired and execute together in two or more instruction pipelines. Maybe some additional NOP-instructions between the calculating instructions can prevent some dependencies between some of the instructions, example if they use the same register for reading after writing and additional for to let the target addresses of the conditional jumps to be code aligned of 16 bytes. In this moment i have not tested to assemble and to use the line routine for the 32 bit mode. I hope it is also possible. Dirk |
|||
20 Mar 2014, 09:25 |
|
edfed 20 Mar 2014, 13:17
it results in more clock cycles, but you have plenty of them... in fact, the computer spends 99% of it's cpu clocks to do nothing...
later, the putpixel function can be implemented in any way you want (opengl, write in a buffer, send to the network, ...). the abstraction induced by the usage of putpixel is clearly magic and can let do many many more things than when the putpixel routine in inside the line, image or anything else function. you can also scale, rotate, 3D projection, etc... just because you have a putpixel routine shared by all graphics functions, and you'll just have to modify the putpixel routine to make all your graphics fit the new implementation and the besenham algo likes a lot the putpixel routine. here is mine: Code: ;transparent return line: .call=0 .x=4 .y=8 .xl=12 .yl=16 .c=20 push eax ebx ecx edx esi edi mov ecx,1 ;set the increments mov edx,1 xor edi,edi ;set the 0 value mov eax,[esi+.xl] ;load the length of the line, in a conventional besenham algo, it is computed, here it is directlly given, and if i replace the line function by a rectangle function, it will trace the bounding box of the line, etc... it's cool mov ebx,[esi+.yl] cmp eax,edi ;compare the x length with 0 jge @f ;if it is negative neg eax ;negate the length neg ecx ;and the increment @@: cmp ebx,edi ;compare the y length with 0 jge @f ;if it is negative neg ebx ;negate the length neg edx ;and the increment @@: cmp eax,ebx ;compare the lengths jl .isy ;and set the increments as needed .isx: ;x is the master axis, means the pixel will always move by one on this axis mov [.xinc1],ecx ;set the first x increment mov [.xinc2],edi ;the second is 0 mov [.yinc1],edi ;the first y increment is 0 mov [.yinc2],edx ;the second is to set mov [.dmax],eax ;remember the delta values mov [.dmin],ebx ;the min and the max jmp @f .isy: ;y is the master axis mov [.xinc1],edi ;same as above, but for the y axis mov [.xinc2],ecx mov [.yinc1],edx mov [.yinc2],edi mov [.dmax],ebx mov [.dmin],eax @@: mov eax,[esi+.x] ;load the first point mov ebx,[esi+.y] ; in eax and ebx mov edx,[.dmax] ;load the maximal delta shr edx,1 ;divide it by 2 to have a symetric line mov cl,[esi+.c] ;load the color mov edi,[.dmax] ;load the maximal delta again, can be made from the edx value before the shift, save some time... @@: call pixel ;put the pixel at eax,ebx with color cl dec edi ;decrement the maximal delta (means the number of pixel of the line) jl @f ;if it is negative, it is the end, then, go out, nothing more to see add eax,[.xinc1] ;increment the x coordinate add ebx,[.yinc1] ;increment the y coordinate sub edx,[.dmin] ;iterative division on the delta max by the delta min jge @b ;if the result is not negative, continue the first loop add eax,[.xinc2] ;otherwise, the second loop is there to make move the pixel on the slave axis add ebx,[.yinc2] add edx,[.dmax] ;the iterative division restarts jmp @b ;and continue the loop @@: pop edi esi edx ecx ebx eax ret ;end of the algo align 4 .xinc1 rd 1 ;theses are local variables, i didn't made them in the stack, later, i will cause the ebp register is free, cause i use a putpixel routine, then, i have the ebp register free .xinc2 rd 1 ;but i dislike the stack for data, i prefer the stack just for the execution flow... .yinc1 rd 1 ;but only dumb people never change their opinion. .yinc2 rd 1 ;then, these 6 ugly locals (that are global) will be replaced by stack variables. .dmin rd 1 .dmax rd 1 ;good bye putpixel is to graphics what putchar is to text. printstring will use putchar, not reimplement it. |
|||
20 Mar 2014, 13:17 |
|
freecrac 20 Mar 2014, 19:15
Hello.
edfed wrote: it results in more clock cycles, but you have plenty of them... in fact, the computer spends 99% of it's cpu clocks to do nothing... Not the entire CPU spends 99% with doing nothing while calculationg and drawing a line with my line routine, but only a part of the CPU. Quote: later, the putpixel function can be implemented in any way you want (opengl, write in a buffer, send to the network, ...). the abstraction induced by the usage of putpixel is clearly magic and can let do many many more things than when the putpixel routine in inside the line, image or anything else function. you can also scale, rotate, 3D projection, etc... just because you have a putpixel routine shared by all graphics functions, and you'll just have to modify the putpixel routine to make all your graphics fit the new implementation I do not like to use only one routine for all situations, i like it more to have several specialized routines and maybe each for only one purpose, because we have enough memory for to have more than only one routine. ... I have also written a fractal routine for the linux framebuffer device(fb0). This routine is for two linux PCs. One PC send some start parameter for the fractal calculation to the second PC via network. And then the first PC begin to calculate the first line and the second PC begin to calculate the second line of the same fractal picture. After the calculation of both lines are done and stored in a buffer, then the second PC send the complete second scanline of 4096 bytes to the first PC. And the first PC write both lines directly into its own framebuffer device, so we can observe the progress on the monitor double lines followed by double lines. Both PC use a AMD K6-2 @550 mhz and a fast ethernet card. The PC wich shows the picture on the screen use a MATROX PCI(4MB) card and a linux LIVE CD with booting the framebuffer device in 1024x768 with 32 bit for the color by default. I think it is a bad relation for to send only one pixel via the network. Quote:
Yes i can see, your routine is a schweizer knife . But i do not like to push register values to the stack. I only save a register if it is really needed to a known memory location into the data segment, but this is not always and more rarely. If it is possible i use other registers for to minimize the ram access. For push/pop versus mov/mov on older CPUs before Pentium 4 the mov/mov instructions are faster. Dirk |
|||
20 Mar 2014, 19:15 |
|
revolution 20 Mar 2014, 19:52
freecrac wrote: But i do not like to push register values to the stack. I only save a register if it is really needed to a known memory location into the data segment, ... freecrac wrote: ... but this is not always and more rarely. If it is possible i use other registers for to minimize the ram access. For push/pop versus mov/mov on older CPUs before Pentium 4 the mov/mov instructions are faster. BTW: Do you have timing data to show you how long it takes to run? This is an important piece of information to guide you when trying to "make it run faster". |
|||
20 Mar 2014, 19:52 |
|
neville 21 Mar 2014, 08:07
revolution wrote: So no chance of using multiple threads then. With using only global variables you make it impossible to make this really fast by taking advantage of the extra cores lying idle. _________________ FAMOS - the first memory operating system |
|||
21 Mar 2014, 08:07 |
|
freecrac 21 Mar 2014, 10:22
Hello.
revolution wrote:
Yes it is true, in this moment i do not not think about to implement multithreading, or about to startup other cores for multiprocessing, so on multicore CPUs the other cores lying idle. Quote:
But if we load a value from the data segment, than it comes also in the DCache, so there is only a marginal difference between for to get a value from the stack. In the aftereffect the stack size can be reduced. But it is also possible to push and pop a value to/from the stack by using only mov-instructions +decreasing/increasing the stack pointer. Quote: I think you could be optimising for the wrong things and ignoring the things that can make a much larger difference to the performance. I do not fully ignoring it, but for to startup and using the other cores it need more to learn how to handle a context switch and such things and this is really not so simple to build a multiprocessing kernel only for an application. (I do not planing for to build an own OS.) Quote: Look into how you can stream the data to the video RAM. Use the internal DCache to store intermediate steps and then stream it all to screen. You could get an order of magnitude increase if you get it right. If we want to copy the entire screen to the framebuffer, it is simple to draw the line into another address area in the ram by recalculating the address table for the line routine. But i never use the Memory type range register(MTRR) for to set write combining, because there are some different ways for to use it with depends on the architecture. Some use Page attribute table (PAT) and the other have no PAT. Quote: BTW: Do you have timing data to show you how long it takes to run? This is an important piece of information to guide you when trying to "make it run faster". Yes, we do some tests for push/pop vs mov/mov with a simple programm written from Frank Kotler: Code: ; nasm -f elf pushvsmov.asm -d_MOV (or "-d_PUSH") ; ld -o pushvsmov pushvsmov.o global _start section .bss eax_sav resd 1 ebx_sav resd 1 ecx_sav resd 1 edx_sav resd 1 esi_sav resd 1 edi_sav resd 1 section .text _start: nop xor eax, eax cpuid rdtsc push edx push eax %ifdef _MOV mov [eax_sav], eax mov [ebx_sav], ebx mov [ecx_sav], ecx mov [edx_sav], edx mov [esi_sav], esi mov [edi_sav], edi mov edi, [edi_sav] mov esi, [esi_sav] mov edx, [edx_sav] mov ecx, [ecx_sav] mov ebx, [ebx_sav] mov eax, [eax_sav] %elifdef _PUSH push eax push ebx push ecx push edx push esi push edi pop edi pop esi pop edx pop ecx pop ebx pop eax %else %error 'must define _MOV or _PUSH' %endif xor eax, eax cpuid rdtsc pop ebx pop ecx sub eax, ebx sbb edx, ecx call showeaxd xor ebx, ebx mov eax, 1 int 80h ;--------------------------------- showeaxd: push eax push ebx push ecx push edx push esi sub esp, 10h lea ecx, [esp + 12] mov ebx, 10 xor esi, esi mov byte [ecx], 0 .top: dec ecx xor edx, edx div ebx add dl, '0' mov [ecx], dl inc esi or eax, eax jnz .top mov edx, esi mov ebx, 1 mov eax, 4 int 80h add esp, 10h pop esi pop edx pop ecx pop ebx pop eax ret ;--------------------------------- On my K6-2@550mhz with debian sarge(2.6) on my Asus-board(Ali): 219 - 302 push/pop 115 - 116 mov/mov mov [eax_sav], eax mov [ebx_sav], ebx mov [ecx_sav], ecx mov [edx_sav], edx mov [esi_sav], esi mov [edi_sav], edi In an other arrangement: mov eax, [eax_sav] mov ebx, [ebx_sav] mov ecx, [ecx_sav] mov edx, [edx_sav] mov esi, [esi_sav] mov edi, [edi_sav] 19 - 77 mov/mov Test with Knoppix 4.02(Live-Boot-CD) with a gui: [AMD Tbred 2700+] push/pop 76 - 132 mov/mov 76 - 111 mov/mov(2) 70 - 111 [AMD Palomino 1800+] push/pop 76 - 134 mov/mov 76 - 112 mov/mov(2) 70 - 112 ;------------------- ; Frank Kotler said: Quote: Well, the lowest number I got on a AMD Duron 900 with MOV was 77; with ;------------------- ; sevagK said: Quote: For this one, if you place the *_sav variables on the stack, you'll get Dirk |
|||
21 Mar 2014, 10:22 |
|
revolution 21 Mar 2014, 10:34
Sorry but artificial test are not useful. In a real app the stack is shared by all the procedures. Anyhow those figures are basically all the same as I see it. No real significant difference. But your ICache will suffer and a real app might start to show the problem. But even so this type of thing would be very unlikely to make any noticeable difference unless you have some specific reason to optimise for one particular CPU/mobo/RAM combo and find that spending hours analysing and tuning timing results will then save you days or weeks in subsequent computing time.
|
|||
21 Mar 2014, 10:34 |
|
edfed 21 Mar 2014, 10:49
say... you can (should) also use the stack with mov when dealing with locals
Code: line: push ebp sub esp,6*4;reserve 6 dwords for my 6 local variables mov ebp,esp ... add esp,6*4 pop ebp ret |
|||
21 Mar 2014, 10:49 |
|
freecrac 21 Mar 2014, 12:32
revolution wrote: Sorry but artificial test are not useful. In a real app the stack is shared by all the procedures. Anyhow those figures are basically all the same as I see it. No real significant difference. But your ICache will suffer and a real app might start to show the problem. But even so this type of thing would be very unlikely to make any noticeable difference unless you have some specific reason to optimise for one particular CPU/mobo/RAM combo and find that spending hours analysing and tuning timing results will then save you days or weeks in subsequent computing time. From my point of view the older CPUs provide mostly from a speed optimizing. So i take a look to the clock cycles of the instructions: Code: 80386 POP 4 clocks mov 2 clocks 80386 PUSH 2 clocks mov 2 clocks --------------------- -------------- sum 6 clocks 4 clocks difference 2 clocks Code: 80486 POP 4 clocks mov 1 clocks 80486 PUSH 1 clocks mov 1 clocks --------------------- -------------- sum 5 clocks 2 clocks difference 3 clocks ;--------------------------------------- Here is a startup code for extra cores in DOS: (programmer: ALLAN CRUSE) (But it need a workaround for to let they execute together.) http://www.cs.usfca.edu/~cruse/cs630/mphello.s Dirk |
|||
21 Mar 2014, 12:32 |
|
tthsqe 21 Mar 2014, 13:05
freecrac,
I think you are missing the big picture for the pixel (hehe). If the difference between push/pop and mov is big to your program, then probably inlining the function is a good idea. Plu,s as revolution said, by writing to a fixed memory location, you abnegate the possibility of multithreaded drawing. BTW, modern cpu's have special stack engines that make stack operations fast. |
|||
21 Mar 2014, 13:05 |
|
freecrac 22 Mar 2014, 07:34
tthsqe wrote: freecrac, Inline assembler for a high level programming language? Quote: Plus as revolution said, by writing to a fixed memory location, you abnegate the possibility of multithreaded drawing. But we need to programm the multithreadeding functionality first. In this moment i have no multithreadeding enviroment for DOS and no DOS application that use multithreadeding. Do you have one? Quote: BTW, modern cpu's have special stack engines that make stack operations fast. Yes, with modern cpu's we can use a basic interpreter and a line routine written in the basic language where all basic instruction have to be interpreted before executing and in spite of that the line will be faster drawing as on an 80386 with a line routine written in assembler. And this makes one clear, to use the assembly language is more powerfull for the older CPUs and a code optimizing is lesser beneficial for the modern CPUs, but rather for the older CPUs, where the effect of optimizing is visible for a human observer. Dirk |
|||
22 Mar 2014, 07:34 |
|
neville 22 Mar 2014, 09:25
freecrac wrote: Yes, with modern cpu's we can use a basic interpreter and a line routine written in the basic language where all basic instruction have to be interpreted before executing and in spite of that the line will be faster drawing as on an 80386 with a line routine written in assembler. And this makes one clear, to use the assembly language is more powerfull for the older CPUs and a code optimizing is lesser beneficial for the modern CPUs, but rather for the older CPUs, where the effect of optimizing is visible for a human observer. BTW, thanks for posting Allan Cruse's multicore initialisation code too. I'll have a look at it soon. _________________ FAMOS - the first memory operating system |
|||
22 Mar 2014, 09:25 |
|
revolution 22 Mar 2014, 14:32
neville wrote: I asked revolution the same thing in my post above, but there has been no response from him yet. |
|||
22 Mar 2014, 14:32 |
|
neville 23 Mar 2014, 01:03
revolution wrote: Ahem, him/her. Anyway, I think you missed the smiley I put after the statement discussing the other CPUs. earlier, revolution wrote: So no chance of using multiple threads then. With using only global variables you make it impossible to make this really fast by taking advantage of the extra cores lying idle. Razz _________________ FAMOS - the first memory operating system |
|||
23 Mar 2014, 01:03 |
|
tthsqe 23 Mar 2014, 08:02
neville, are you saying that multithreading in DOS is not possible?!
I guess I made a donkey out of you and me when I just assumed that it was possible. If it is not possible, by all means make programs as least thread safe as possible. In that case my objections from my initial knee jerk reaction are not valid. Also, you can't just assume that I am male. Also, if the professed benefits of mov over push/pop are so great, freecrack can still maintain some thread safe habits by using Code: sub esp,8*6 mov [esp+8*0],esx mov [esp+8*1],edi ... and the reverse at return. |
|||
23 Mar 2014, 08:02 |
|
sid123 23 Mar 2014, 08:47
Quote: neville, are you saying that multithreading in DOS is not possible?! It is. You can initialize multiple cores in DOS, the only thing that prevents you is the crap real mode addressing, you can't access memory above 1MB, I think( ) the location of the ACPI and I/O APIC tables exists between 3GB to 4GB (0xC0000000 to 0x40000000 I guess), which is seriously not reachable in RM. But who said it's not possible in DOS. All fasm people are aware of FRM (? or Unreal, can't remember the difference, Tomasz told about this in an old thread). The problem after initializing multiple cores will be BS BIOS interrupts, which switch to Protected Mode and return back to pure real mode, and you're not in Unreal Mode at all. Code: mov [esp+8*0],esx ESX? lol. Is there any register like that in x86? Well it does make me laugh for some or the other reason. Quote: Also, you can't just assume that I am male. :Mad: Well, Btw not trying to become a revolution, (pun intended) I am a he. EDIT: Confirmed it. OSDev.org Memory Map (x86) wrote:
_________________ "Those who can make you believe in absurdities can make you commit atrocities" -- Voltaire https://github.com/Benderx2/R3X XD |
|||
23 Mar 2014, 08:47 |
|
neville 24 Mar 2014, 03:45
tthsqe wrote: I guess I made a donkey out of you and me when I just assumed that it was possible. tthsqe wrote: Also, you can't just assume that I am male. _________________ FAMOS - the first memory operating system |
|||
24 Mar 2014, 03:45 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.