flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
polygon7 09 Dec 2004, 09:32
Hi,
you could try this (VC++ syntax): Code: char *tbuf = NULL; tbuf = new char[2048]; void mem4(void *dst, void *src, int nbytes) { __asm { mov esi, src mov ecx, nbytes mov ebx, ecx shr ebx, 11 // 2048 bytes at a time mov edi, dst loop2k: // Copy 2k into temporary buffer push edi mov edi, tbuf mov ecx, 2048 shr ecx, 6 loopMemToL1: prefetchnta 64[ESI] // Prefetch next loop, non-temporal prefetchnta 96[ESI] movq mm1, 0[ESI] // Read in source data movq mm2, 8[ESI] movq mm3, 16[ESI] movq mm4, 24[ESI] movq mm5, 32[ESI] movq mm6, 40[ESI] movq mm7, 48[ESI] movq mm0, 56[ESI] movq 0[EDI], mm1 // Store into L1 movq 8[EDI], mm2 movq 16[EDI], mm3 movq 24[EDI], mm4 movq 32[EDI], mm5 movq 40[EDI], mm6 movq 48[EDI], mm7 movq 56[EDI], mm0 add esi, 64 add edi, 64 dec ecx jnz loopMemToL1 pop edi // Now copy from L1 to system memory push esi mov esi, tbuf mov ecx, 2048 shr ecx, 6 loopL1ToMem: movq mm1, 0[ESI] // Read in source data from L1 movq mm2, 8[ESI] movq mm3, 16[ESI] movq mm4, 24[ESI] movq mm5, 32[ESI] movq mm6, 40[ESI] movq mm7, 48[ESI] movq mm0, 56[ESI] movntq 0[EDI], mm1 // Non-temporal stores movntq 8[EDI], mm2 movntq 16[EDI], mm3 movntq 24[EDI], mm4 movntq 32[EDI], mm5 movntq 40[EDI], mm6 movntq 48[EDI], mm7 movntq 56[EDI], mm0 add esi, 64 add edi, 64 dec ecx jnz loopL1ToMem pop esi // Do next 2k block dec ebx jnz loop2k emms } } P.S This code is not mine - i dont remember who is author. _________________ best regards p7 |
|||
![]() |
|
ASHLEY4 09 Dec 2004, 15:11
Try here: http://www.programmersheaven.com/zone10/cat106/
\\\\||//// (@@) ASHLEY4. Batteries not included, Some assembly required. |
|||
![]() |
|
Matrix 09 Dec 2004, 17:25
hello
here's a whole thread of Fastest Memory Copying Algorithms with test programs, you just had to read a little below http://board.flatassembler.net/topic.php?t=2571 you weren't copyying from memory mapped port to memory mapped port were you? because its extreme slow. btw.: if you are doing scrolling the memcopy algorithtm will not be good for you unless you make it fit the application in some way. |
|||
![]() |
|
Ralph 09 Dec 2004, 21:11
I was copying from video memory to video memory at first, but I changed that and it's not much faster. Not as fast as I was hoping it would be anyway. I then added vsync and it's incredibly slow now. It's just annoying that I can scroll a 1600x1200 browser window faster than I can scroll text at 800x600, but I guess that's done with hardware acceleration?
Thanks for all the info, I'll give it another shot. That vc++ snippet looks good. |
|||
![]() |
|
Octavio 09 Dec 2004, 21:56
Ralph wrote: I was copying from video memory to video memory at first, but I changed that and it's not much faster. Not as fast as I was hoping it would be anyway. I then added vsync and it's incredibly slow now. It's just annoying that I can scroll a 1600x1200 browser window faster than I can scroll text at 800x600, but I guess that's done with hardware acceleration? You can use vesa functions to do hardware scroll. And scroll many lines at a time, not one by one. |
|||
![]() |
|
Ralph 10 Dec 2004, 23:28
I ended up using the 32-bit memory copy from the AMD optimization guide. It's faster but not by a huge amount. The whole thing is still quite slow. It probably only scrolls around 10 lines per second. At 800x600x16 my blocks are around 0.9mb and I need to do 2 copies (one from buffer to video and one to sync the buffer up again), so at ~10 lines/s that's ~20mb/s. According to AMD the throughput should be ~2gb/s for an 8mb block and ~4.7gb/s for an 8k block. How can my code be ~100x slower?
Here is the whole scroll code by the way: Code: scroll: mov edi,dword [_vram@] mov esi,dword [_offscrn@] add esi,px*depth*8 mov ecx,px*py*depth call mcpy32amd mov esi,dword [_vram@] mov edi,dword [_offscrn@] mov ecx,px*py*depth call mcpy32amd sub dword [_cursor],px/8 ret mcpy32amd: add esi,ecx add edi,ecx shr ecx,3 neg ecx @@: prefetchnta [esi+ecx*8+256] ;load and mark a line into l1 movdqa xmm0,[esi+ecx*8] movntdq [edi+ecx*8],xmm0 movdqa xmm1,[esi+ecx*8+16] movntdq [edi+ecx*8+16],xmm1 add ecx,4 jnz @b sfence ;flush write combine ret |
|||
![]() |
|
Ralph 11 Dec 2004, 07:18
I got rid of that second read from video. Much much faster now. Bad idea to read from vram I guess. Thanks for all the info.
|
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.