flat assembler
Message board for the users of flat assembler.

Index > Main > fast copy/vertical text scrolling?

Author
Thread Post new topic Reply to topic
Ralph



Joined: 04 Oct 2003
Posts: 86
Ralph
I'm trying to do vertical text scrolling in video mode. I'm using 800x600x16 resolution (so about 1mb of data) and I'm currently just copying the whole buffer up by one row using a pretty standard fild/fistp copy. However, for some reason that isn't particularly fast. I have an amd64 3200 and there is a noticeable delay when the display is moved. I tried using movaps and some others, but they're not much faster either. I can play games at 4 times that resolution at a higher framerate so what am I doing wrong? It's not a double buffering problem (it doesn't flicker or anything), it just seems slower than it should be on a cpu this fast.
Post 09 Dec 2004, 08:36
View user's profile Send private message Reply with quote
polygon7



Joined: 14 Aug 2003
Posts: 62
Location: Poznan, Poland
polygon7
Hi,
you could try this (VC++ syntax):
Code:
char *tbuf = NULL;
tbuf = new char[2048];

void mem4(void *dst, void *src, int nbytes)
{
        __asm {
        mov esi, src 
        mov ecx, nbytes 
        mov ebx, ecx 
        shr ebx, 11 // 2048 bytes at a time 
        mov edi, dst 

loop2k: // Copy 2k into temporary buffer 
        push edi 
        mov edi, tbuf 
        mov ecx, 2048 
        shr ecx, 6 

loopMemToL1: 
        prefetchnta 64[ESI] // Prefetch next loop, non-temporal 
        prefetchnta 96[ESI] 

        movq mm1,  0[ESI] // Read in source data 
        movq mm2,  8[ESI] 
        movq mm3, 16[ESI] 
        movq mm4, 24[ESI] 
        movq mm5, 32[ESI] 
        movq mm6, 40[ESI] 
        movq mm7, 48[ESI] 
        movq mm0, 56[ESI] 

        movq  0[EDI], mm1 // Store into L1 
        movq  8[EDI], mm2 
        movq 16[EDI], mm3 
        movq 24[EDI], mm4 
        movq 32[EDI], mm5 
        movq 40[EDI], mm6 
        movq 48[EDI], mm7 
        movq 56[EDI], mm0 
        add esi, 64 
        add edi, 64 
        dec ecx 
        jnz loopMemToL1 

        pop edi // Now copy from L1 to system memory 
        push esi 
        mov esi, tbuf 
        mov ecx, 2048 
        shr ecx, 6 

loopL1ToMem: 
        movq mm1, 0[ESI] // Read in source data from L1 
        movq mm2, 8[ESI] 
        movq mm3, 16[ESI] 
        movq mm4, 24[ESI] 
        movq mm5, 32[ESI] 
        movq mm6, 40[ESI] 
        movq mm7, 48[ESI] 
        movq mm0, 56[ESI] 

        movntq 0[EDI], mm1 // Non-temporal stores 
        movntq 8[EDI], mm2 
        movntq 16[EDI], mm3 
        movntq 24[EDI], mm4 
        movntq 32[EDI], mm5 
        movntq 40[EDI], mm6 
        movntq 48[EDI], mm7 
        movntq 56[EDI], mm0 

        add esi, 64 
        add edi, 64 
        dec ecx 
        jnz loopL1ToMem 

        pop esi // Do next 2k block 
        dec ebx 
        jnz loop2k 

            emms
        } 
}
    


P.S This code is not mine - i dont remember who is author.

_________________
best regards
p7
Post 09 Dec 2004, 09:32
View user's profile Send private message Visit poster's website Reply with quote
ASHLEY4



Joined: 28 Apr 2004
Posts: 376
Location: UK
ASHLEY4
Try here: http://www.programmersheaven.com/zone10/cat106/

\\\\||////
(@@)
ASHLEY4.

Batteries not included, Some assembly required.
Post 09 Dec 2004, 15:11
View user's profile Send private message Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1171
Location: Overflow
Matrix
hello
here's a whole thread of Fastest Memory Copying Algorithms with test programs, you just had to read a little below
http://board.flatassembler.net/topic.php?t=2571

you weren't copyying from memory mapped port to memory mapped port were you? because its extreme slow.

btw.: if you are doing scrolling the memcopy algorithtm will not be good for you unless you make it fit the application in some way.
Post 09 Dec 2004, 17:25
View user's profile Send private message Visit poster's website Reply with quote
Ralph



Joined: 04 Oct 2003
Posts: 86
Ralph
I was copying from video memory to video memory at first, but I changed that and it's not much faster. Not as fast as I was hoping it would be anyway. I then added vsync and it's incredibly slow now. It's just annoying that I can scroll a 1600x1200 browser window faster than I can scroll text at 800x600, but I guess that's done with hardware acceleration?

Thanks for all the info, I'll give it another shot. That vc++ snippet looks good.
Post 09 Dec 2004, 21:11
View user's profile Send private message Reply with quote
Octavio



Joined: 21 Jun 2003
Posts: 366
Location: Spain
Octavio
Ralph wrote:
I was copying from video memory to video memory at first, but I changed that and it's not much faster. Not as fast as I was hoping it would be anyway. I then added vsync and it's incredibly slow now. It's just annoying that I can scroll a 1600x1200 browser window faster than I can scroll text at 800x600, but I guess that's done with hardware acceleration?

Thanks for all the info, I'll give it another shot. That vc++ snippet looks good.

You can use vesa functions to do hardware scroll.
And scroll many lines at a time, not one by one.
Post 09 Dec 2004, 21:56
View user's profile Send private message Visit poster's website Reply with quote
Ralph



Joined: 04 Oct 2003
Posts: 86
Ralph
I ended up using the 32-bit memory copy from the AMD optimization guide. It's faster but not by a huge amount. The whole thing is still quite slow. It probably only scrolls around 10 lines per second. At 800x600x16 my blocks are around 0.9mb and I need to do 2 copies (one from buffer to video and one to sync the buffer up again), so at ~10 lines/s that's ~20mb/s. According to AMD the throughput should be ~2gb/s for an 8mb block and ~4.7gb/s for an 8k block. How can my code be ~100x slower?

Here is the whole scroll code by the way:

Code:
scroll:
    mov     edi,dword [_vram@]
    mov     esi,dword [_offscrn@]
    add     esi,px*depth*8
    mov     ecx,px*py*depth
    call    mcpy32amd
    mov     esi,dword [_vram@]
    mov     edi,dword [_offscrn@]
    mov     ecx,px*py*depth
    call    mcpy32amd
    sub     dword [_cursor],px/8
    ret    

mcpy32amd:
    add     esi,ecx
    add     edi,ecx
    shr     ecx,3
    neg     ecx
@@: prefetchnta [esi+ecx*8+256]   ;load and mark a line into l1
    movdqa  xmm0,[esi+ecx*8]
    movntdq [edi+ecx*8],xmm0
    movdqa  xmm1,[esi+ecx*8+16]
    movntdq [edi+ecx*8+16],xmm1
    add     ecx,4
    jnz     @b
    sfence                        ;flush write combine
    ret       
Post 10 Dec 2004, 23:28
View user's profile Send private message Reply with quote
Ralph



Joined: 04 Oct 2003
Posts: 86
Ralph
I got rid of that second read from video. Much much faster now. Bad idea to read from vram I guess. Thanks for all the info.
Post 11 Dec 2004, 07:18
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.