flat assembler
Message board for the users of flat assembler.

Index > Projects and Ideas > optimizing data movement code

Author
Thread Post new topic Reply to topic
StarKnightD



Joined: 04 Jul 2007
Posts: 38
StarKnightD 21 Sep 2008, 15:07
I was wondering, in terms of optimizing data movement, is there any faster way to implement the following code....

Code:
mem_copy:
movaps xmm0, [esi + 16*0]
movaps xmm1, [esi + 16*1]
movaps xmm2, [esi + 16*2]
movaps xmm3, [esi + 16*3]
movaps xmm4, [esi + 16*4]
movaps xmm5, [esi + 16*5]
movaps xmm6, [esi + 16*6]
movaps xmm7, [esi + 16*7]
  
movntps [edi + 16*0], xmm0
movntps [edi + 16*1], xmm1
movntps [edi + 16*2], xmm2
movntps [edi + 16*3], xmm3
movntps [edi + 16*4], xmm4
movntps [edi + 16*5], xmm5
movntps [edi + 16*6], xmm6
movntps [edi + 16*7], xmm7

add esi, 16*8
add edi, 16*8
      
add ecx, -16*8
jnz mem_copy
ret     


I'm thinking that borrowing the video card's (if you've got one, anyway) ability to access memory, or some hardware like it will significantly increase performance. however, that's dependent on each piece of hardware, it's also dependent on multitasking to give the cpu something to do while you wait!

when use16 precedes the code (ESI and EDI), per the previous example, I get 100bytes of code required for it.

when use16 precedes the code (SI and DI) I get 77 bytes of code.

when use32 precedes the code (ESI and ESI) I get 81 bytes of code.

when use64 precedes the code and replacing all esi/edi references with rsi/rdi that involve the xmm registers result in the same sizes..

the cost per instruction is my primary concern, since there are so many similar instructions the cost will multiiply with the size of the array.

by the way, the following code is necessary to enable xmm registers as I found out recently ( ^_^;; ). that is, if your running without an OS.
Code:
mov eax, cr4
or  eax, 1 shl 9
mov cr4, eax    
I don't think using mm0-mm7 will help by much, but it might help some! Twisted Evil hehehe, Evil Speed!!
Post 21 Sep 2008, 15:07
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20446
Location: In your JS exploiting you and your system
revolution 21 Sep 2008, 15:25
The question is:
StarKnightD wrote:
I was wondering, in terms of optimizing data movement, is there any faster way to implement the following code....
The short answer is:

It depends.

The long answer is:

It depends upon many many factors. Here is a list of some factors to consider, in no particular order, and the list is not exhaustive:
  • The number of bytes being transferred
  • The alignment of the source and destination
  • Whether the data is used again shortly after moving
  • The availability of things like MMX, SSE, AVX etc.
  • Whether the data to be moved is already in the cache (and which level)
  • Whether the source or destination is on a slow interface
  • etc.
So there is not any way to answer your question. Please specify the characteristics of the movement required in more detail.
Post 21 Sep 2008, 15:25
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4353
Location: Now
edfed 21 Sep 2008, 22:28
i think, but i'm not sure, that DMA have an option to copy datas from a segment to another.

what about UDMA?
if it's not optimisation, what is it?
it's hard to understand, and more, to try it and find the good datasheet.
Post 21 Sep 2008, 22:28
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 22 Sep 2008, 00:25
yes, there definitively were memory-to-memory (nothing to do with segments) DMA transfers, but there was something tricky about them, not sure what. I just saw lot of talk about it and no code, few years back when I was interested in it.

Btw, I think DMA uses physical addresses, so you must run in kernel mode, buffer must be physically contigous, etc...
Post 22 Sep 2008, 00:25
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 22 Sep 2008, 00:30
Iirc DMA on the PC platform is severely limited, and only really useful for device memory transfers. Iirc it's also pretty slow?
Post 22 Sep 2008, 00:30
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 22 Sep 2008, 01:58
I think point with DMA was that you can use it to make transfer in background, and waste CPU power doing something else (porn?) in the meantime.
Post 22 Sep 2008, 01:58
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
StarKnightD



Joined: 04 Jul 2007
Posts: 38
StarKnightD 22 Sep 2008, 12:05
the idea is to make a set of library move functions which are multi-tasking friendly.. that is, the fewest cycles are expending on the code which performs the move while attaining the highest transfer rates.

like having one version that works from the bottom up, and another that works from the top down (the one I gave works from the bottom up). if the data overlaps, then direction matters. also, one hardware version that copies in the background while another thread is working.

as for DMA, that was a side note.. so I wasn't really expecting such a response about it. I read that DMA memory to memory only works at roughly 4mb/s MAX. As for how to implement it, I've never seen code that does it.

by the way, the code snippet I gave is actual code I use now. it works. I haven't had a use for it in general data movement (I'm not the liberal C++ programmer who wants to throw around memory).. the data does in fact need to be continuous. I used it to scroll the output on the screen and it's roughly 4x faster than a BIOS call that does the same. but of course, that's device memory.

DMA overview: Data must be in lower 16MB Physical Region. if amount of data crosses a Physical address of 2^16, the address it's using rolls over mod 2^16. I can't think of much else, but it's certainly not useful if it can only handle data throughput of 4mb/s..

I just tested the code in real mode (boot time), it works at roughly 1gb/s for my Athlon64 3700+ and 200Mhz DDR (400 effective).. 16MB were transferred in a single shot, using ESI/EDI of course. I am curious how it would fare in protected mode, but not much!

the prefetch instruction hurts performance, by the way. by about 200mb/s. using MOVDQA instead of MOVAPS hurts by roughly 300mb/s.
Post 22 Sep 2008, 12:05
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20446
Location: In your JS exploiting you and your system
revolution 22 Sep 2008, 12:23
StarKnightD: Have you seen the AMD manuals? There is an example in there about how to make very fast memory to memory moves for large data sets. They show how to take advantage of the cache to improve the speed to very high rates.
Post 22 Sep 2008, 12:23
View user's profile Send private message Visit poster's website Reply with quote
StarKnightD



Joined: 04 Jul 2007
Posts: 38
StarKnightD 22 Sep 2008, 19:35
actually, revolution, the code example I gave above is based on the AMD Block Prefetch Paper.. my version, if you would actually look at it instead of responding blindly with vast generalizations, is optimized for code density and per-instruction decoding speed.

my code uses MOVAPS, which is 1 byte smaller (per instruction) than MOVDQA. also, my code uses a constant offset within EACH move instruction's address field, instead of:
Code:
 movq mm0, [esi + ecx*8 + 8] ; their code.    
which is atleast 1 byte larger per instruction and a few cycles slower per instruction!

slower? it's memory movement, so "who cares", right? the idea is to let the processor get on with other things as quickly as it can.. I want it to be multitasking friendly movement operations! so the less code that needs to be copied into the processor cache the more of the actual data can be copied in!
Post 22 Sep 2008, 19:35
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20446
Location: In your JS exploiting you and your system
revolution 23 Sep 2008, 00:53
rep movsd
Post 23 Sep 2008, 00:53
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 23 Sep 2008, 01:41
StarKnightD, have you tried something around the quote below? How was the performance compared to your code?

AMD wrote:
5.16 Interleave Loads and Stores
When loading and storing data as in a copy routine, the organization of the sequence of loads and
stores can affect performance.
Application
This optimization applies to:
• 32-bit software
• 64-bit software
Rationale
When using SSE and SSE2 instructions to perform loads and stores, it is best to interleave them in the
following pattern—Load, Store, Load, Store, Load, Store, etc. This enables the processor to maximize
the load/store bandwidth.
If using MMX loads and stores in 32-bit mode, the loads and stores should be arranged in the
following pattern—Load, Load, Store, Store, Load, Load, Store, Store, etc.
Example
The following example illustrates a sequence of 128-bit loads and stores:
movdqa xmm0,[rdx+r8*8] ; Load
movntdq [rcx+r8*8],xmm0 ; Store
movdqa xmm1,[rdx+r8*8+16] ; Load
movntdq [rcx+r8*8+16],xmm1 ; Store
Post 23 Sep 2008, 01:41
View user's profile Send private message Reply with quote
StarKnightD



Joined: 04 Jul 2007
Posts: 38
StarKnightD 23 Sep 2008, 09:32
revolution: cute Twisted Evil

LocoDelAssembly: the phrase AMD and Intel like to use is dependency chain.. that's exactly what AMD and Intel manuals have continuously stressed should not occur. I can't test whether it's faster or not right now, as I don't have code for getting into long mode. it should encode fine, since the default register within an address (ie [rax]) is whatever mode you're in.

[rcx + r8*8] however, will require extra cycles (the multiply by 8, and addition, isn't free, I've benchmarked it in xp64) and 1 extra byte per instruction. Also, I may be off right now, but I think there is a logical bug in multiplying by 8(bytes), instead of 16(bytes) since they're XMM registers instead of MMX.
Post 23 Sep 2008, 09:32
View user's profile Send private message Reply with quote
StarKnightD



Joined: 04 Jul 2007
Posts: 38
StarKnightD 23 Sep 2008, 10:38
I just reordered them to be alternating and it consistently reduced the cycle count (for 16mb of data) by ~500,000 cycles.. it took 39.66 million cycles, as opposed to the original consuming 40.17 million cycles. 1.2% is nice, though the issue of the dependency chain makes me wonder what's going on in the hardware.

Corrected: it's not bad, it's good!
Post 23 Sep 2008, 10:38
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20446
Location: In your JS exploiting you and your system
revolution 23 Sep 2008, 12:05
StarKnightD wrote:
actually, revolution, the code example I gave above is based on the AMD Block Prefetch Paper.. my version, if you would actually look at it instead of responding blindly with vast generalizations, is optimized for code density and per-instruction decoding speed.

my code uses MOVAPS, which is 1 byte smaller (per instruction) than MOVDQA. also, my code uses a constant offset within EACH move instruction's address field, instead of:
Code:
 movq mm0, [esi + ecx*8 + 8] ; their code.    
which is atleast 1 byte larger per instruction and a few cycles slower per instruction!

slower? it's memory movement, so "who cares", right? the idea is to let the processor get on with other things as quickly as it can.. I want it to be multitasking friendly movement operations! so the less code that needs to be copied into the processor cache the more of the actual data can be copied in!
I think you still have not properly defined your requirements. You say in the last paragraph that you want to maximise the data in the cache and minimise the code in the cache but you reject my suggestion of rep movsd. Why? Is it too slow? It certainly is the smallest code possible that meets the goal you state. Intel CPUs have special circuitry to make the rep movs* codes perform very fast, so maybe it is not as slow as many may think. Have you tested the speed on different systems? Do you need your code to perform on only your system or many others systems?

I think you need to have some way of objectively saying what you require. If I can get your code 1 byte smaller how much speed are you prepared to sacrifice? What if I can halve the size of your code, now how much speed are you prepared to sacrifice? What if I make your code 10 bytes larger, are you prepared to accept it if it is some amount faster?

If you can please post some formulae or set of criteria that we can follow to know if our suggestions can meet your goals.
Post 23 Sep 2008, 12:05
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.