flat assembler
Message board for the users of flat assembler.
Index
> Projects and Ideas > optimizing data movement code |
Author |
|
revolution 21 Sep 2008, 15:25
The question is:
StarKnightD wrote: I was wondering, in terms of optimizing data movement, is there any faster way to implement the following code.... It depends. The long answer is: It depends upon many many factors. Here is a list of some factors to consider, in no particular order, and the list is not exhaustive:
|
|||
21 Sep 2008, 15:25 |
|
edfed 21 Sep 2008, 22:28
i think, but i'm not sure, that DMA have an option to copy datas from a segment to another.
what about UDMA? if it's not optimisation, what is it? it's hard to understand, and more, to try it and find the good datasheet. |
|||
21 Sep 2008, 22:28 |
|
vid 22 Sep 2008, 00:25
yes, there definitively were memory-to-memory (nothing to do with segments) DMA transfers, but there was something tricky about them, not sure what. I just saw lot of talk about it and no code, few years back when I was interested in it.
Btw, I think DMA uses physical addresses, so you must run in kernel mode, buffer must be physically contigous, etc... |
|||
22 Sep 2008, 00:25 |
|
f0dder 22 Sep 2008, 00:30
Iirc DMA on the PC platform is severely limited, and only really useful for device memory transfers. Iirc it's also pretty slow?
|
|||
22 Sep 2008, 00:30 |
|
vid 22 Sep 2008, 01:58
I think point with DMA was that you can use it to make transfer in background, and waste CPU power doing something else (porn?) in the meantime.
|
|||
22 Sep 2008, 01:58 |
|
StarKnightD 22 Sep 2008, 12:05
the idea is to make a set of library move functions which are multi-tasking friendly.. that is, the fewest cycles are expending on the code which performs the move while attaining the highest transfer rates.
like having one version that works from the bottom up, and another that works from the top down (the one I gave works from the bottom up). if the data overlaps, then direction matters. also, one hardware version that copies in the background while another thread is working. as for DMA, that was a side note.. so I wasn't really expecting such a response about it. I read that DMA memory to memory only works at roughly 4mb/s MAX. As for how to implement it, I've never seen code that does it. by the way, the code snippet I gave is actual code I use now. it works. I haven't had a use for it in general data movement (I'm not the liberal C++ programmer who wants to throw around memory).. the data does in fact need to be continuous. I used it to scroll the output on the screen and it's roughly 4x faster than a BIOS call that does the same. but of course, that's device memory. DMA overview: Data must be in lower 16MB Physical Region. if amount of data crosses a Physical address of 2^16, the address it's using rolls over mod 2^16. I can't think of much else, but it's certainly not useful if it can only handle data throughput of 4mb/s.. I just tested the code in real mode (boot time), it works at roughly 1gb/s for my Athlon64 3700+ and 200Mhz DDR (400 effective).. 16MB were transferred in a single shot, using ESI/EDI of course. I am curious how it would fare in protected mode, but not much! the prefetch instruction hurts performance, by the way. by about 200mb/s. using MOVDQA instead of MOVAPS hurts by roughly 300mb/s. |
|||
22 Sep 2008, 12:05 |
|
revolution 22 Sep 2008, 12:23
StarKnightD: Have you seen the AMD manuals? There is an example in there about how to make very fast memory to memory moves for large data sets. They show how to take advantage of the cache to improve the speed to very high rates.
|
|||
22 Sep 2008, 12:23 |
|
StarKnightD 22 Sep 2008, 19:35
actually, revolution, the code example I gave above is based on the AMD Block Prefetch Paper.. my version, if you would actually look at it instead of responding blindly with vast generalizations, is optimized for code density and per-instruction decoding speed.
my code uses MOVAPS, which is 1 byte smaller (per instruction) than MOVDQA. also, my code uses a constant offset within EACH move instruction's address field, instead of: Code: movq mm0, [esi + ecx*8 + 8] ; their code. slower? it's memory movement, so "who cares", right? the idea is to let the processor get on with other things as quickly as it can.. I want it to be multitasking friendly movement operations! so the less code that needs to be copied into the processor cache the more of the actual data can be copied in! |
|||
22 Sep 2008, 19:35 |
|
revolution 23 Sep 2008, 00:53
rep movsd
|
|||
23 Sep 2008, 00:53 |
|
LocoDelAssembly 23 Sep 2008, 01:41
StarKnightD, have you tried something around the quote below? How was the performance compared to your code?
AMD wrote: 5.16 Interleave Loads and Stores |
|||
23 Sep 2008, 01:41 |
|
StarKnightD 23 Sep 2008, 09:32
revolution: cute
LocoDelAssembly: the phrase AMD and Intel like to use is dependency chain.. that's exactly what AMD and Intel manuals have continuously stressed should not occur. I can't test whether it's faster or not right now, as I don't have code for getting into long mode. it should encode fine, since the default register within an address (ie [rax]) is whatever mode you're in. [rcx + r8*8] however, will require extra cycles (the multiply by 8, and addition, isn't free, I've benchmarked it in xp64) and 1 extra byte per instruction. Also, I may be off right now, but I think there is a logical bug in multiplying by 8(bytes), instead of 16(bytes) since they're XMM registers instead of MMX. |
|||
23 Sep 2008, 09:32 |
|
StarKnightD 23 Sep 2008, 10:38
I just reordered them to be alternating and it consistently reduced the cycle count (for 16mb of data) by ~500,000 cycles.. it took 39.66 million cycles, as opposed to the original consuming 40.17 million cycles. 1.2% is nice, though the issue of the dependency chain makes me wonder what's going on in the hardware.
Corrected: it's not bad, it's good! |
|||
23 Sep 2008, 10:38 |
|
revolution 23 Sep 2008, 12:05
StarKnightD wrote: actually, revolution, the code example I gave above is based on the AMD Block Prefetch Paper.. my version, if you would actually look at it instead of responding blindly with vast generalizations, is optimized for code density and per-instruction decoding speed. I think you need to have some way of objectively saying what you require. If I can get your code 1 byte smaller how much speed are you prepared to sacrifice? What if I can halve the size of your code, now how much speed are you prepared to sacrifice? What if I make your code 10 bytes larger, are you prepared to accept it if it is some amount faster? If you can please post some formulae or set of criteria that we can follow to know if our suggestions can meet your goals. |
|||
23 Sep 2008, 12:05 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.