flat assembler
Message board for the users of flat assembler.

Index > DOS > Bresenham line routine for the linear framebuffer with 8 bpp

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
freecrac



Joined: 19 Oct 2011
Posts: 117
Location: Germany Hamburg
freecrac 18 Mar 2014, 12:06
Hello, in the last days i have remastered my line routine and here is the result:
Code:
;--------------------------------------------------------
;  This is a Bresenham line subroutine for videomodes
:  with 255 colors using the linear framebuffer(LFB).
;
;  This routine need an address table of startaddresses
;  of each scanline of the LFB placed in the beginning
;  of a data segment at OFFSET 0000.
;
;  The first entry of this address table have to be
;  the address of the LFB itself. And for all next
;  entries we have to add the bytes of the scanline
;  of the used resolution, so that every entry in
;  this address table represent the startaddress
;  of each line on the screen.
;
;  The start- and end-coordinates + the color of the
;  Line have to be placed in the following registers
;  Startpoint:EBX,ESI  Endpoint:ECX,EDI  Color:AL
;--------------------------------------------------------
LINE:     mov      edx, edi
          sub      ecx, ebx
          jl  T0
          add      ebx, [esi*4]
          sub      edx, esi
          jl  T1
          mov      esi, DWORD PTR[XMP1]
          cmp      ecx, edx
          jl  T2
          lea      edx, [edx+edx]
          mov      ebp, edx
          sub      edx, ecx
          mov      edi, edx
          sub      edx, ecx
;-------------------------------------
M00:      mov      [ebx], al
          and      edi, edi
          jge short M01
          lea      ebx, [ebx+1]
          lea      edi, [edi+ebp]
          dec      ecx
          jnz M00
          ret
;-------------------------------------
M01:      lea      ebx, [ebx+esi]
          lea      edi, [edi+edx]
          dec      ecx
          jnz M00
          ret
;-----------------------------------------------------
T0:       neg      ecx
          add      ebx, [esi*4]
          sub      edx, esi
          mov      esi, DWORD PTR[XMP1]
          jl  T01
          mov      esi, DWORD PTR[XMM1]
          cmp      ecx, edx
          jl  short T21
          lea      edx, [edx+edx]
          mov      ebp, edx
          sub      edx, ecx
          mov      edi, edx
          sub      edx, ecx
;-------------------------------------
M02:      mov      [ebx], al
          and      edi, edi
          jge short M03
          lea      ebx, [ebx-1]
          lea      edi, [edi+ebp]
          dec      ecx
          jnz M02
          ret
;-------------------------------------
M03:      lea      ebx, [ebx+esi]
          lea      edi, [edi+edx]
          dec      ecx
          jnz M02
          ret
;-----------------------------------------------------
T21:      lea      ecx, [ecx+ecx]
          mov      ebp, ecx
          sub      ecx, edx
          mov      edi, ecx
          sub      ecx, edx
;-------------------------------------
M04:      mov      [ebx], al
          and      edi, edi
          jge short M05
          add      ebx, DWORD PTR[XMAX]
          lea      edi, [edi+ebp]
          dec      edx
          jnz M04
          ret
;-------------------------------------
M05:      lea      ebx, [ebx+esi]
          lea      edi, [edi+ecx]
          dec      edx
          jnz M04
          ret
;-----------------------------------------------------
T01:      neg      edx
          cmp      ecx, edx
          jl  short T22
          lea      edx, [edx+edx]
          mov      ebp, edx
          sub      edx, ecx
          mov      edi, edx
          sub      edx, ecx
;-------------------------------------
M06:      mov      [ebx], al
          and      edi, edi
          jge short M07
          lea      ebx, [ebx-1]
          lea      edi, [edi+ebp]
          dec      ecx
          jnz M06
          ret
;-------------------------------------
M07:      sub      ebx, esi
          lea      edi, [edi+edx]
          dec      ecx
          jnz M06
          ret
;-----------------------------------------------------
T22:      lea      ecx, [ecx+ecx]
          mov      ebp, ecx
          sub      ecx, edx
          mov      edi, ecx
          sub      ecx, edx
;-------------------------------------
M08:      mov      [ebx], al
          and      edi, edi
          jge short M09
          sub      ebx, DWORD PTR[XMAX]
          lea      edi, [edi+ebp]
          dec      edx
          jnz M08
          ret
;-------------------------------------
M09:      sub      ebx, esi
          lea      edi, [edi+ecx]
          dec      edx
          jnz M08
          ret
;-----------------------------------------------------
T1:       neg      edx
          mov      esi, DWORD PTR[XMM1]
          cmp      ecx, edx
          jl  T12
          lea      edx, [edx+edx]
          mov      ebp, edx
          sub      edx, ecx
          mov      edi, edx
          sub      edx, ecx
;-------------------------------------
M10:      mov      [ebx], al
          and      edi, edi
          jge short M11
          lea      ebx, [ebx+1]
          lea      edi, [edi+ebp]
          dec      ecx
          jnz M10
          ret
;-------------------------------------
M11:      sub      ebx, esi
          lea      edi, [edi+edx]
          dec      ecx
          jnz M10
          ret
;-----------------------------------------------------
T12:      lea      ecx, [ecx+ecx]
          mov      ebp, ecx
          sub      ecx, edx
          mov      edi, ecx
          sub      ecx, edx
;-------------------------------------
M12:      mov      [ebx], al
          and      edi, edi
          jge short M13
          sub      ebx, DWORD PTR[XMAX]
          lea      edi, [edi+ebp]
          dec      edx
          jnz M12
          ret
;-------------------------------------
M13:      sub      ebx, esi
          lea      edi, [edi+ecx]
          dec      edx
          jnz M12
          ret
;-----------------------------------------------------
T2:       lea      ecx, [ecx+ecx]
          mov      ebp, ecx
          sub      ecx, edx
          mov      edi, ecx
          sub      ecx, edx
;-------------------------------------
M14:      mov      [ebx], al
          and      edi, edi
          jge short M15
          add      ebx, DWORD PTR[XMAX]
          lea      edi, [edi+ebp]
          dec      edx
          jnz M14
          ret
;-------------------------------------
M15:      lea      ebx, [ebx+esi]
          lea      edi, [edi+ecx]
          dec      edx
          jnz M14
          ret
;-----------------------------------------------------
;    Copyfree for all humans on planet earth
;-----------------------------------------------------
    

Subsequently i show how to use the line routine.
Code:
;----------------------------
; Example for to draw a line using a VBE videomode with a horizontal resolution
; of 1920 and a vertical resolution of 1200 with 255 colors and 8 bit per pixel.
; (Hint: This example shows only wich instruction and data is needed for to use
; the line routine and the code have to be placed inside of a working application.)
;----------------------------
     RES_X =  1920           ; horizontal resolution
     RES_Y =  1200           ; vertical resolution
     Color =  0Eh
;----------------------------
.DATA
;----
; This following table of LFB startaddresses of each line
; have to be placed at offset 0000 in the data segment
;----
PIXTAB    DD RES_Y dup (0)   ; table of startaddresses (for to draw the line)
XMAX      DD 0               ; ScanLine (for to draw the line)
XMP1      DD 0               ; ScanLine + 1 (for to draw the line)
XMM1      DD 0               ; ScanLine - 1 (for to draw the line)
;----
VBEINFO   DB 512 dup (0)     ; Buffer for VBE 4F00 Vbe Info Block
MODEINFO  DB 256 dup (0)     ; Buffer for VBE 4F01 Mode Info Block
;----------------------------
.code
          mov      ax, @DATA           ; segment address of the data segment
          mov      ds, ax
          mov      es, ax

          mov      di, OFFSET VBEINFO  ; get the Vbe Info Block
          mov      ax, 4F00h           ; es:di 512 byte
          int    10h                   : Function 00h - Return VBE Controller Information
          cmp      ax, 4Fh
          jnz ERROR       ; need instructions for output an Error message + terminate program

          mov      dl, [di+5]          ; major version number of VBE
          cmp      dl, 2               ; version 2?
          jb  ERROR

; Get the VBE modenumber from the VBE modetable

          lfs      si, [di+0Eh]        ; VbeFarPtr to VideoModeList
GETMode:  mov      cx, fs:[si]         ; get the mode number
          lea      si, [si+2]
          cmp      cx, 0FFFFh          ; end of modelist ?
          jz  ERROR

          add      cx, 4000h           ; mode number + linear acess
          mov      di, OFFSET MODEINFO
          mov      ax, 4F01h           ; get the Mode Info Block
          int    10h                   ; Function 01h - Return VBE Mode Information
          cmp      ax, 4Fh
          jnz ERROR

; Now we have to find the mode number wich operate with our desired resolution

          cmp     WORD PTR[di+12h], RES_X ; horizontal resolution in pixels
          jnz GETMode
          cmp     WORD PTR[di+14h], RES_Y ; vertical resolution in pixels
          jnz GETMode
          cmp     BYTE PTR[di+19h], 8     ; bits per pixel
          jnz GETMode
          test    WORD PTR[di], 80h       ; Linear frame buffer mode
          jz  GETMode
          cmp     DWORD PTR[di+28h], 0    ; physical address for flat memory frame buffer
          jz  GETMode

; Set the VBE mode

          mov      ax, 4F02h
          mov      bx, cx               ; modenumber             
          int    10h
          cmp      ax, 4Fh
          jnz ERROR

; create a table of start addresses of each line on the screen

          mov      si, OFFSET MODEINFO
          xor      ebx, ebx
          mov      bx, ds
          mov      eax, [si+28h]        ; Address of the LFB
          shl      ebx, 4
          sub      eax, ebx             ; LFB = ds:reg32

          xor      edx, edx
          mov      dx, [si+32h]         ; LinBytesPerScanLine
          mov     DWORD PTR[XMAX], edx
          mov      ebx, edx
          inc      ebx
          mov     DWORD PTR[XMP1], ebx  ; LinBytesPerScanLine + 1
          sub      ebx, 2
          mov     DWORD PTR[XMM1], ebx  ; LinBytesPerScanLine - 1

          xor      ecx, ecx
          mov      cx, [si+14h]         ; vertical resolution in pixels
          shl      ecx, 2               ; vertical resolution in pixels * 4

          xor      edi, edi             ; OFFSET 0000
AGAIN:    mov      [di], eax            ; fill the address table
          add      edi, 4
          add      eax, edx             ; plus scanline
          cmp      edi, ecx             ; vertical resolution in pixels * 4 ?
          jb  AGAIN

;-----------
;     ----  Switching to the BIG-REALMODE  ----
; enhance "DS"-segment up to 4 GB + enable A20 address line

          call BIGREALMODE   ; this subroutine is not a part of this example

;-----------

; Drawing a line from the upper left corner to the lower right corner
; of the screen using a resolution of 1920 x 1200 with 8 bit color.

; Make sure that the coordinates of the start and end parameter for the
; line routine do not exceed the size of the horizontal and vertical
; resolution minus one. The maximum x/y coordinates have to be one times
; lower as the resolution size. For the resolution of 1920x1200 the highest
; coordinates are X=1919 Y=1199.

          mov      ebx, 0      ; X1-Position
          mov      esi, 0      ; Y1-Position
          mov      ecx, 1919   ; X2-Position
          mov      edi, 1199   ; Y2-Position
          mov      al, Color
          call LINE
;-----------    

Dirk
Post 18 Mar 2014, 12:06
View user's profile Send private message Send e-mail Reply with quote
sid123



Joined: 30 Jul 2013
Posts: 339
Location: Asia, Singapore
sid123 19 Mar 2014, 02:26
Cool Very Happy
Post 19 Mar 2014, 02:26
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4347
Location: Now
edfed 19 Mar 2014, 09:48
you can add a sort of "out of bound" put pixel ignore Smile

if the result of the pixel coordinate computation is less than 0, or more than 1920*1200 -1, don't draw Smile

then, don't care about the bounds of your line.

and with a little more computations, you can also reajust the coordinates to fit in the screen Smile, but it needs some mul and div instructions in the init phase, just after the bound checking condition Wink

a putpixel routine can let you play with the screen, without having anything to care about the specificities of the framebuffer.

Code:
mov ebx,x
mov ecx,y
mov eax,color
call putpixel
    
Post 19 Mar 2014, 09:48
View user's profile Send private message Visit poster's website Reply with quote
freecrac



Joined: 19 Oct 2011
Posts: 117
Location: Germany Hamburg
freecrac 20 Mar 2014, 09:25
Hello and thanks all for their feedback.
edfed wrote:
you can add a sort of "out of bound" put pixel ignore Smile if the result of the pixel coordinate computation is less than 0, or more than 1920*1200 -1, don't draw Smile then, don't care about the bounds of your line.

and with a little more computations, you can also reajust the coordinates to fit in the screen Smile, but it needs some mul and div instructions in the init phase, just after the bound checking condition Wink

Yes it is possible, but i think my own main routines for calling the line routine do not need for to use coordinates outside of the resolution size. So if it happend, then the main routine have a part of a wrong calculation inside and i have to find and to fix it.

Quote:
a putpixel routine can let you play with the screen, without having anything to care about the specificities of the framebuffer.

Code:
mov ebx,x
mov ecx,y
mov eax,color
call putpixel
    

I think it results in more clock cycles, if we call a subroutine for to draw every single pixel, so this method is against the speed optimizing for to use as few as possible only very simple instructions that can be paired and execute together in two or more instruction pipelines.

Maybe some additional NOP-instructions between the calculating instructions can prevent some dependencies between some of the instructions, example if they use the same register for reading after writing and additional for to let the target addresses of the conditional jumps to be code aligned of 16 bytes.

In this moment i have not tested to assemble and to use the line routine for the 32 bit mode. I hope it is also possible.

Dirk
Post 20 Mar 2014, 09:25
View user's profile Send private message Send e-mail Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4347
Location: Now
edfed 20 Mar 2014, 13:17
it results in more clock cycles, but you have plenty of them... in fact, the computer spends 99% of it's cpu clocks to do nothing...

later, the putpixel function can be implemented in any way you want (opengl, write in a buffer, send to the network, ...). the abstraction induced by the usage of putpixel is clearly magic and can let do many many more things than when the putpixel routine in inside the line, image or anything else function.

you can also scale, rotate, 3D projection, etc... just because you have a putpixel routine shared by all graphics functions, and you'll just have to modify the putpixel routine to make all your graphics fit the new implementation Smile

and the besenham algo likes a lot the putpixel routine. Smile

here is mine:
Code:
;transparent return
line:
.call=0
.x=4
.y=8
.xl=12
.yl=16
.c=20
        push eax ebx ecx edx esi edi

        mov ecx,1               ;set the increments
        mov edx,1
        xor edi,edi             ;set the 0 value
        mov eax,[esi+.xl]       ;load the length of the line, in a conventional besenham algo, it is computed, here it is directlly given, and if i replace the line function by a rectangle function, it will trace the bounding box of the line, etc... it's cool Smile
        mov ebx,[esi+.yl]
        cmp eax,edi             ;compare the x length with 0
        jge @f                ;if it is negative
        neg eax                 ;negate the length
        neg ecx                 ;and the increment
@@:
        cmp ebx,edi             ;compare the y length with 0
        jge @f                ;if it is negative
        neg ebx                 ;negate the length
        neg edx                 ;and the increment
@@:
        cmp eax,ebx             ;compare the lengths
        jl .isy                 ;and set the increments as needed
.isx:                         ;x is the master axis, means the pixel will always move by one on this axis
        mov [.xinc1],ecx        ;set the first x increment
        mov [.xinc2],edi        ;the second is 0
        mov [.yinc1],edi        ;the first y increment is 0
        mov [.yinc2],edx        ;the second is to set
        mov [.dmax],eax         ;remember the delta values
        mov [.dmin],ebx         ;the min and the max
        jmp @f
.isy:                         ;y is the master axis
        mov [.xinc1],edi        ;same as above, but for the y axis
        mov [.xinc2],ecx
        mov [.yinc1],edx
        mov [.yinc2],edi
        mov [.dmax],ebx
        mov [.dmin],eax
@@:
        mov eax,[esi+.x]        ;load the first point
        mov ebx,[esi+.y]        ; in eax and ebx
        mov edx,[.dmax]         ;load the maximal delta
        shr edx,1               ;divide it by 2 to have a symetric line
        mov cl,[esi+.c]         ;load the color
        mov edi,[.dmax]         ;load the maximal delta again, can be made from the edx value before the shift, save some time...
@@:
        call pixel              ;put the pixel at eax,ebx with color cl
        dec edi                 ;decrement the maximal delta (means the number of pixel of the line)
        jl @f                   ;if it is negative, it is the end, then, go out, nothing more to see
        add eax,[.xinc1]        ;increment the x coordinate
        add ebx,[.yinc1]        ;increment the y coordinate
        sub edx,[.dmin]         ;iterative division on the delta max by the delta min
        jge @b                  ;if the result is not negative, continue the first loop
        add eax,[.xinc2]        ;otherwise, the second loop is there to make move the pixel on the slave axis
        add ebx,[.yinc2]
        add edx,[.dmax]         ;the iterative division restarts
        jmp @b                  ;and continue the loop
@@:
        pop edi esi edx ecx ebx eax

        ret                     ;end of the algo

align 4
.xinc1  rd 1                    ;theses are local variables, i didn't made them in the stack, later, i will cause the ebp register is free, cause i use a putpixel routine, then, i have the ebp register free
.xinc2  rd 1                    ;but i dislike the stack for data, i prefer the stack just for the execution flow...
.yinc1  rd 1                    ;but only dumb people never change their opinion.
.yinc2  rd 1                    ;then, these 6 ugly locals (that are global) will be replaced by stack variables. Smile
.dmin   rd 1
.dmax   rd 1
;good bye
    


putpixel is to graphics what putchar is to text. printstring will use putchar, not reimplement it.
Post 20 Mar 2014, 13:17
View user's profile Send private message Visit poster's website Reply with quote
freecrac



Joined: 19 Oct 2011
Posts: 117
Location: Germany Hamburg
freecrac 20 Mar 2014, 19:15
Hello.
edfed wrote:
it results in more clock cycles, but you have plenty of them... in fact, the computer spends 99% of it's cpu clocks to do nothing...

Not the entire CPU spends 99% with doing nothing while calculationg and drawing a line with my line routine, but only a part of the CPU.

Quote:
later, the putpixel function can be implemented in any way you want (opengl, write in a buffer, send to the network, ...). the abstraction induced by the usage of putpixel is clearly magic and can let do many many more things than when the putpixel routine in inside the line, image or anything else function. you can also scale, rotate, 3D projection, etc... just because you have a putpixel routine shared by all graphics functions, and you'll just have to modify the putpixel routine to make all your graphics fit the new implementation Smile

and the besenham algo likes a lot the putpixel routine. Smile


I do not like to use only one routine for all situations, i like it more to have several specialized routines and maybe each for only one purpose, because we have enough memory for to have more than only one routine.

...

I have also written a fractal routine for the linux framebuffer device(fb0).
This routine is for two linux PCs. One PC send some start parameter for the fractal calculation to the second PC via network. And then the first PC begin to calculate the first line and the second PC begin to calculate the second line of the same fractal picture. After the calculation of both lines are done and stored in a buffer, then the second PC send the complete second scanline of 4096 bytes to the first PC. And the first PC write both lines directly into its own framebuffer device, so we can observe the progress on the monitor double lines followed by double lines.

Both PC use a AMD K6-2 @550 mhz and a fast ethernet card.
The PC wich shows the picture on the screen use a MATROX PCI(4MB) card and a linux LIVE CD with booting the framebuffer device in 1024x768 with 32 bit for the color by default.

I think it is a bad relation for to send only one pixel via the network.

Quote:

here is mine:
Code:
;transparent return
line:
.call=0
.x=4
.y=8
.xl=12
.yl=16
.c=20
        push eax ebx ecx edx esi edi
;        push dword[esi+.x] dword[esi+.y] dword[esi+.xl] dword[esi+.yl]
;        or edi,edi
;        je @f
;        call addxy
;@@:
        mov ecx,1
        mov edx,1
        xor edi,edi
        mov eax,[esi+.xl]
        mov ebx,[esi+.yl]
        cmp eax,edi
        jge @f
        neg eax
        neg ecx
@@:
        cmp ebx,edi
        jge @f
        neg ebx
        neg edx
@@:
        cmp eax,ebx
        jl .isy
.isx:
        mov [.xinc1],ecx
        mov [.xinc2],edi
        mov [.yinc1],edi
        mov [.yinc2],edx
        mov [.dmax],eax
        mov [.dmin],ebx
        jmp @f
.isy:
        mov [.xinc1],edi
        mov [.xinc2],ecx
        mov [.yinc1],edx
        mov [.yinc2],edi
        mov [.dmax],ebx
        mov [.dmin],eax
@@:
        mov eax,[esi+.x]
        mov ebx,[esi+.y]
        mov edx,[.dmax]
        shr edx,1
        mov cl,[esi+.c]
        mov edi,[.dmax]
@@:
        call pixel
        dec edi
        jl @f
        add eax,[.xinc1]
        add ebx,[.yinc1]
        sub edx,[.dmin]
        jge @b
        add eax,[.xinc2]
        add ebx,[.yinc2]
        add edx,[.dmax]
        jmp @b
@@:
;        pop dword[esi+.yl] dword[esi+.xl] dword[esi+.y] dword[esi+.x]
        pop edi esi edx ecx ebx eax
        ret
align 4
.xinc1  rd 1
.xinc2  rd 1
.yinc1  rd 1
.yinc2  rd 1
.dmin   rd 1
.dmax   rd 1
    


putpixel is to graphics what putchar is to text. printstring will use putchar, not reimplement it.

Yes i can see, your routine is a schweizer knife Image.Image

But i do not like to push register values to the stack. I only save a register if it is really needed to a known memory location into the data segment, but this is not always and more rarely. If it is possible i use other registers for to minimize the ram access. For push/pop versus mov/mov on older CPUs before Pentium 4 the mov/mov instructions are faster.

Dirk
Post 20 Mar 2014, 19:15
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 20 Mar 2014, 19:52
freecrac wrote:
But i do not like to push register values to the stack. I only save a register if it is really needed to a known memory location into the data segment, ...
So no chance of using multiple threads then. With using only global variables you make it impossible to make this really fast by taking advantage of the extra cores lying idle. Razz
freecrac wrote:
... but this is not always and more rarely. If it is possible i use other registers for to minimize the ram access. For push/pop versus mov/mov on older CPUs before Pentium 4 the mov/mov instructions are faster.
I doubt that for two reasons 1) push/pop use less space in the ICache and 2) stack values are usually already in the DCache. Cache usage is probably the most important thing here. Individual instruction timings will be overwhelmed by the much slower bus transactions to the main memory and video memory. I think you could be optimising for the wrong things and ignoring the things that can make a much larger difference to the performance. Look into how you can stream the data to the video RAM. Use the internal DCache to store intermediate steps and then stream it all to screen. You could get an order of magnitude increase if you get it right.

BTW: Do you have timing data to show you how long it takes to run? This is an important piece of information to guide you when trying to "make it run faster".
Post 20 Mar 2014, 19:52
View user's profile Send private message Visit poster's website Reply with quote
neville



Joined: 13 Jul 2008
Posts: 507
Location: New Zealand
neville 21 Mar 2014, 08:07
revolution wrote:
So no chance of using multiple threads then. With using only global variables you make it impossible to make this really fast by taking advantage of the extra cores lying idle. Razz
freecrac has kindly posted his code in the DOS category. So do you know how to use extra cores in DOS? Or in Big Real DOS? Please tell, I would be very interested Surprised

_________________
FAMOS - the first memory operating system
Post 21 Mar 2014, 08:07
View user's profile Send private message Visit poster's website Reply with quote
freecrac



Joined: 19 Oct 2011
Posts: 117
Location: Germany Hamburg
freecrac 21 Mar 2014, 10:22
Hello.
revolution wrote:
freecrac wrote:
But i do not like to push register values to the stack. I only save a register if it is really needed to a known memory location into the data segment, ...
So no chance of using multiple threads then. With using only global variables you make it impossible to make this really fast by taking advantage of the extra cores lying idle. Razz


Yes it is true, in this moment i do not not think about to implement multithreading, or about to startup other cores for multiprocessing, so on multicore CPUs the other cores lying idle.

Quote:
freecrac wrote:
... but this is not always and more rarely. If it is possible i use other registers for to minimize the ram access. For push/pop versus mov/mov on older CPUs before Pentium 4 the mov/mov instructions are faster.
I doubt that for two reasons 1) push/pop use less space in the ICache and 2) stack values are usually already in the DCache. Cache usage is probably the most important thing here. Individual instruction timings will be overwhelmed by the much slower bus transactions to the main memory and video memory.

But if we load a value from the data segment, than it comes also in the DCache, so there is only a marginal difference between for to get a value from the stack. In the aftereffect the stack size can be reduced. But it is also possible to push and pop a value to/from the stack by using only mov-instructions +decreasing/increasing the stack pointer.

Quote:
I think you could be optimising for the wrong things and ignoring the things that can make a much larger difference to the performance.

I do not fully ignoring it, but for to startup and using the other cores it need more to learn how to handle a context switch and such things and this is really not so simple to build a multiprocessing kernel only for an application. (I do not planing for to build an own OS.)

Quote:
Look into how you can stream the data to the video RAM. Use the internal DCache to store intermediate steps and then stream it all to screen. You could get an order of magnitude increase if you get it right.


If we want to copy the entire screen to the framebuffer, it is simple to draw the line into another address area in the ram by recalculating the address table for the line routine. But i never use the Memory type range register(MTRR) for to set write combining, because there are some different ways for to use it with depends on the architecture. Some use Page attribute table (PAT) and the other have no PAT.

Quote:
BTW: Do you have timing data to show you how long it takes to run? This is an important piece of information to guide you when trying to "make it run faster".

Yes, we do some tests for push/pop vs mov/mov with a simple programm written from Frank Kotler:
Code:
; nasm -f elf pushvsmov.asm -d_MOV (or "-d_PUSH")
; ld -o pushvsmov pushvsmov.o

global _start

section .bss
     eax_sav resd 1
     ebx_sav resd 1
     ecx_sav resd 1
     edx_sav resd 1
     esi_sav resd 1
     edi_sav resd 1

section .text
_start:
     nop
     xor eax, eax
     cpuid
     rdtsc
     push edx
     push eax
%ifdef _MOV
     mov [eax_sav], eax
     mov [ebx_sav], ebx
     mov [ecx_sav], ecx
     mov [edx_sav], edx
     mov [esi_sav], esi
     mov [edi_sav], edi

     mov edi, [edi_sav]
     mov esi, [esi_sav]
     mov edx, [edx_sav]
     mov ecx, [ecx_sav]
     mov ebx, [ebx_sav]
     mov eax, [eax_sav]
%elifdef _PUSH
     push eax
     push ebx
     push ecx
     push edx
     push esi
     push edi

     pop edi
     pop esi
     pop edx
     pop ecx
     pop ebx
     pop eax
%else
     %error 'must define _MOV or _PUSH'
%endif

     xor eax, eax
     cpuid
     rdtsc
     pop ebx
     pop ecx
     sub eax, ebx
     sbb edx, ecx

     call showeaxd

     xor ebx, ebx
     mov eax, 1
     int 80h

;---------------------------------
showeaxd:
     push eax
     push ebx
     push ecx
     push edx
     push esi

     sub esp, 10h
     lea ecx, [esp + 12]
     mov ebx, 10
     xor esi, esi
     mov byte [ecx], 0
.top:
     dec ecx
     xor edx, edx
     div ebx
     add dl, '0'
     mov [ecx], dl
     inc esi
     or eax, eax
     jnz .top

     mov edx, esi
     mov ebx, 1
     mov eax, 4
     int 80h

     add esp, 10h

     pop esi
     pop edx
     pop ecx
     pop ebx
     pop eax

     ret
;---------------------------------    

On my K6-2@550mhz with debian sarge(2.6) on my Asus-board(Ali):
219 - 302 push/pop
115 - 116 mov/mov

mov [eax_sav], eax
mov [ebx_sav], ebx
mov [ecx_sav], ecx
mov [edx_sav], edx
mov [esi_sav], esi
mov [edi_sav], edi

In an other arrangement:
mov eax, [eax_sav]
mov ebx, [ebx_sav]
mov ecx, [ecx_sav]
mov edx, [edx_sav]
mov esi, [esi_sav]
mov edi, [edi_sav]

19 - 77 mov/mov

Test with Knoppix 4.02(Live-Boot-CD) with a gui:
[AMD Tbred 2700+]
push/pop 76 - 132
mov/mov 76 - 111
mov/mov(2) 70 - 111

[AMD Palomino 1800+]
push/pop 76 - 134
mov/mov 76 - 112
mov/mov(2) 70 - 112

;-------------------
; Frank Kotler said:
Quote:
Well, the lowest number I got on a AMD Duron 900 with MOV was 77; with
PUSH was 95. On a Pentium MMX 233 the MOV was 250 and the PUSH was 252.

I added a count-to-40 loop to your NASM code and got these numbers:

AMD
MOV 356
PUSH 483

PENTIUM
MOV 546
PUSH 455

;-------------------
; sevagK said:
Quote:
For this one, if you place the *_sav variables on the stack, you'll get
different results. On my system (AMD64 3400), push/pop (and
pushad/popad) are faster than using the mov with the static memory.
Using mov with stack storage is faster than all of them.

pushad/popad had equivalent results to pushing/popping individually.

Dirk
Post 21 Mar 2014, 10:22
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 21 Mar 2014, 10:34
Sorry but artificial test are not useful. In a real app the stack is shared by all the procedures. Anyhow those figures are basically all the same as I see it. No real significant difference. But your ICache will suffer and a real app might start to show the problem. But even so this type of thing would be very unlikely to make any noticeable difference unless you have some specific reason to optimise for one particular CPU/mobo/RAM combo and find that spending hours analysing and tuning timing results will then save you days or weeks in subsequent computing time.
Post 21 Mar 2014, 10:34
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4347
Location: Now
edfed 21 Mar 2014, 10:49
say... you can (should) also use the stack with mov when dealing with locals Smile

Code:
line:
push ebp
sub esp,6*4;reserve 6 dwords for my 6 local variables
mov ebp,esp
...
add esp,6*4
pop ebp
ret
    
Post 21 Mar 2014, 10:49
View user's profile Send private message Visit poster's website Reply with quote
freecrac



Joined: 19 Oct 2011
Posts: 117
Location: Germany Hamburg
freecrac 21 Mar 2014, 12:32
revolution wrote:
Sorry but artificial test are not useful. In a real app the stack is shared by all the procedures. Anyhow those figures are basically all the same as I see it. No real significant difference. But your ICache will suffer and a real app might start to show the problem. But even so this type of thing would be very unlikely to make any noticeable difference unless you have some specific reason to optimise for one particular CPU/mobo/RAM combo and find that spending hours analysing and tuning timing results will then save you days or weeks in subsequent computing time.

From my point of view the older CPUs provide mostly from a speed optimizing. So i take a look to the clock cycles of the instructions:
Code:
80386 POP   4 clocks     mov  2 clocks
80386 PUSH  2 clocks     mov  2 clocks
---------------------   --------------
sum         6 clocks          4 clocks           difference 2 clocks    

Code:
80486 POP   4 clocks     mov  1 clocks
80486 PUSH  1 clocks     mov  1 clocks
---------------------   --------------
sum         5 clocks          2 clocks           difference 3 clocks    

;---------------------------------------

Here is a startup code for extra cores in DOS:
(programmer: ALLAN CRUSE)
(But it need a workaround for to let they execute together.)
http://www.cs.usfca.edu/~cruse/cs630/mphello.s

Dirk
Post 21 Mar 2014, 12:32
View user's profile Send private message Send e-mail Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 21 Mar 2014, 13:05
freecrac,
I think you are missing the big picture for the pixel (hehe).
If the difference between push/pop and mov is big to your program, then probably inlining the function is a good idea.
Plu,s as revolution said, by writing to a fixed memory location, you abnegate the possibility of multithreaded drawing.
BTW, modern cpu's have special stack engines that make stack operations fast.
Post 21 Mar 2014, 13:05
View user's profile Send private message Reply with quote
freecrac



Joined: 19 Oct 2011
Posts: 117
Location: Germany Hamburg
freecrac 22 Mar 2014, 07:34
tthsqe wrote:
freecrac,
I think you are missing the big picture for the pixel (hehe).
If the difference between push/pop and mov is big to your program, then probably inlining the function is a good idea.

Inline assembler for a high level programming language?

Quote:
Plus as revolution said, by writing to a fixed memory location, you abnegate the possibility of multithreaded drawing.

But we need to programm the multithreadeding functionality first. In this moment i have no multithreadeding enviroment for DOS and no DOS application that use multithreadeding. Do you have one?

Quote:
BTW, modern cpu's have special stack engines that make stack operations fast.

Yes, with modern cpu's we can use a basic interpreter and a line routine written in the basic language where all basic instruction have to be interpreted before executing and in spite of that the line will be faster drawing as on an 80386 with a line routine written in assembler. And this makes one clear, to use the assembly language is more powerfull for the older CPUs and a code optimizing is lesser beneficial for the modern CPUs, but rather for the older CPUs, where the effect of optimizing is visible for a human observer.

Dirk
Post 22 Mar 2014, 07:34
View user's profile Send private message Send e-mail Reply with quote
neville



Joined: 13 Jul 2008
Posts: 507
Location: New Zealand
neville 22 Mar 2014, 09:25
freecrac wrote:
Yes, with modern cpu's we can use a basic interpreter and a line routine written in the basic language where all basic instruction have to be interpreted before executing and in spite of that the line will be faster drawing as on an 80386 with a line routine written in assembler. And this makes one clear, to use the assembly language is more powerfull for the older CPUs and a code optimizing is lesser beneficial for the modern CPUs, but rather for the older CPUs, where the effect of optimizing is visible for a human observer.

Dirk
Very well explained! And very patiently too Wink Also I hope tthsqe does have a multicore DOS environment and that he shares it with us. I asked revolution the same thing in my post above, but there has been no response from him yet.

BTW, thanks for posting Allan Cruse's multicore initialisation code too. I'll have a look at it soon.

_________________
FAMOS - the first memory operating system
Post 22 Mar 2014, 09:25
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20363
Location: In your JS exploiting you and your system
revolution 22 Mar 2014, 14:32
neville wrote:
I asked revolution the same thing in my post above, but there has been no response from him yet.
Ahem, him/her. Anyway, I think you missed the smiley I put after the statement discussing the other CPUs.
Post 22 Mar 2014, 14:32
View user's profile Send private message Visit poster's website Reply with quote
neville



Joined: 13 Jul 2008
Posts: 507
Location: New Zealand
neville 23 Mar 2014, 01:03
revolution wrote:
Ahem, him/her. Anyway, I think you missed the smiley I put after the statement discussing the other CPUs.
earlier, revolution wrote:
So no chance of using multiple threads then. With using only global variables you make it impossible to make this really fast by taking advantage of the extra cores lying idle. Razz
Smile Laughing So you think Razz (Razz) is a smiley? Seems a little disingenuous? Or can we assume that you really don't know how "the extra cores lying idle" can be used in DOS, in which case your perceived limitation of freecrac's code was invalid? Hopefully tthsqe can still help...

_________________
FAMOS - the first memory operating system
Post 23 Mar 2014, 01:03
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 23 Mar 2014, 08:02
neville, are you saying that multithreading in DOS is not possible?!
I guess I made a donkey out of you and me when I just assumed that it was possible.Embarassed
If it is not possible, by all means make programs as least thread safe as possible. In that case my objections from my initial knee jerk reaction are not valid.
Also, you can't just assume that I am male. Mad

Also, if the professed benefits of mov over push/pop are so great, freecrack can still maintain some thread safe habits by using
Code:
sub  esp,8*6
mov [esp+8*0],esx
mov [esp+8*1],edi
...
    

and the reverse at return.
Post 23 Mar 2014, 08:02
View user's profile Send private message Reply with quote
sid123



Joined: 30 Jul 2013
Posts: 339
Location: Asia, Singapore
sid123 23 Mar 2014, 08:47
Quote:
neville, are you saying that multithreading in DOS is not possible?!
I guess I made a donkey out of you and me when I just assumed that it was possible

It is.
You can initialize multiple cores in DOS, the only thing that prevents you is the crap real mode addressing, you can't access memory above 1MB, I think( Question ) the location of the ACPI and I/O APIC tables exists between 3GB to 4GB (0xC0000000 to 0x40000000 I guess), which is seriously not reachable in RM. But who said it's not possible in DOS. All fasm people are aware of FRM (? or Unreal, can't remember the difference, Tomasz told about this in an old thread).
The problem after initializing multiple cores will be BS BIOS interrupts, which switch to Protected Mode and return back to pure real mode, and you're not in Unreal Mode at all. Sad
Code:
mov [esp+8*0],esx     

ESX? lol. Is there any register like that in x86? Well it does make me laugh for some or the other reason. Laughing
Quote:
Also, you can't just assume that I am male. :Mad:

Well,
Image
Btw not trying to become a revolution, (pun intended)
I am a he.
EDIT: Confirmed it.
OSDev.org Memory Map (x86) wrote:

Start: 0xC0000000 (sometimes, depends on motherboard and devices)
End: 0xFFFFFFFF
Use: Memory mapped PCI devices, PnP NVRAM?, IO APIC/s, local APIC/s, BIOS, ...
The region of RAM above 1 MiB is not standardized, well-defined, or contiguous. There are likely to be regions of it that contain memory mapped hardware, that nothing but a device driver should ever access. There are likely to be regions of it that contain ACPI tables which your initialization code will probably want to read

_________________
"Those who can make you believe in absurdities can make you commit atrocities" -- Voltaire https://github.com/Benderx2/R3X
XD
Post 23 Mar 2014, 08:47
View user's profile Send private message Reply with quote
neville



Joined: 13 Jul 2008
Posts: 507
Location: New Zealand
neville 24 Mar 2014, 03:45
tthsqe wrote:
I guess I made a donkey out of you and me when I just assumed that it was possible.Embarassed
Hee-Haw to you but I'm fine Smile
tthsqe wrote:
Also, you can't just assume that I am male. Mad
You are having an extraordinary identity crisis here Shocked Not just gender, but name too... Wink

_________________
FAMOS - the first memory operating system
Post 24 Mar 2014, 03:45
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.