flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
revolution
Usually to make "fast" data transfers you will make good use of the cache.
So you need to determine the cache size of your system. Divide the data into chunks about half the cache size. Read in one section to a cache buffer, and write out a different section from the other cache buffer. So you alternate the read and write to/from the cache buffers. The details can get very tricky. And the performance is heavily system dependant. So posting generalised code is almost impossible since it would have to account for all the different system variants for memory channel timing and CPU type etc. But with the right code to match the system you could get up to many GB/s. |
|||
![]() |
|
Overclick
Caching is cool, but how about commands? Do we have any alternate (faster) way to sum data cells?
For example: Code: mov ecx,0xFFFF xor rbx,rbx toBegin: imul ebx,ecx,32 mov r8d,[Buffer1-32+ebx] add [Buffer2-32+ebx],r8d mov r9d,[Buffer1-28+ebx] add [Buffer2-28+ebx],r9d mov r10d,[Buffer1-24+ebx] add [Buffer2-24+ebx],r10d mov r11d,[Buffer1-20+ebx] add [Buffer2-20+ebx],r11d mov r12d,[Buffer1-16+ebx] add [Buffer2-16+ebx],r12d mov r13d,[Buffer1-12+ebx] add [Buffer2-12+ebx],r13d mov r14d,[Buffer1-8+ebx] add [Buffer2-8+ebx],r14d mov r15d,[Buffer1-4+ebx] add [Buffer2-4+ebx],r15d loop toBegin |
|||
![]() |
|
Overclick
error: relative jump out of range. )))
|
|||
![]() |
|
revolution
It is impossible for us to know what will be faster on your system in your application.
But I see no reason why you are limiting yourself to using 32-bit registers for the transfer. Perhaps using 64-bit reads an writes will help. |
|||
![]() |
|
Overclick
Reason of 32bit is needs to modifying, but yes in this case it have to be 64bit. Don't we miss some FPU/SSE commands to work with packets like "rep movs"?
Could you show some examples of difference in different systems? I understand for caching, but basic operations? Do you mean bit alignment? |
|||
![]() |
|
revolution
You can try using SSE/AVX to do 32 bit adds with 128/256/512 bit registers. It might or might not be "faster" on your system. It depends upon a lot on other factors as to how fast it will be during runtime
Maybe a tight loop with a single operation is faster for you. Or maybe an unrolled loop with multiple operations per loop is faster. Perhaps doing blocks of reads first, and then blocks of adds, and then blocks of writes will be faster. Maybe using the implicit read-modity-write add as you show above is slower than an explicit two reads, add, then write. Or some combination of the above. Or something else entirely. You will have to try it in various ways and time it to see which works best in your case. |
|||
![]() |
|
Overclick
Actually I cannot to use 64bit per operation even 32bit as container only. Data stream is 24+24+24... I have to move each calculated dword to buffer, then shift address by one byte less. Or I can shift it inside some huge register then unload that register to buffer as block. Have to search around SSE/AVX for best way. Seems I need to write few different algorithms depend of cpus supported instructions/registers that users may have.
|
|||
![]() |
|
revolution
Consider that in most cases the CPU will be much faster then the memory bus. So doing more work in the CPU and transferring fewer bytes to/from memory is usually "faster". So having the memory format use tightly packed 24-bit data and use the CPU to unpack, compute, then repack might be a win for you.
Also, when using the cache, you can have the data transfer to/from the cache while at the same time the CPU is computing other data. This overlap of operations is where a lot of speed-up can be realised. |
|||
![]() |
|
Overclick
How can I lock in cache systemly provided buffers?
Actually buffers provided by interfaces. I may try to VirtualLock it and calculate my job partly of memory page size to make sure it cached just once. I just not sure that buffers let me do anything like that. Last edited by Overclick on 05 Apr 2021, 14:36; edited 1 time in total |
|||
![]() |
|
revolution
Overclick wrote: How can I lock in cache systemly provided buffers? To get data into cache you only need to read it. To write stuff out of cache you only need to write it. Code: mov rax,[rsi] ; read from memory into cache, slow mov rax,[rsi+64] ; keep reading more into cache, here we assume the cache line is 64 bytes long, slow ; repeat a few K times ;... mov rax,[rsi] ; get data out of the cached read from above, this should be very fast if the data hasn't been evicted yet add rax,rbx ; compute something mov [rdi],rax ; write to final destination, or use movnt, slow |
|||
![]() |
|
Overclick
I edited my comment above. You're right. I just want to be sure that memory pages not reloaded randomly times.
|
|||
![]() |
|
revolution
If you are worried about the OS paging out your memory then you are looking at the wrong level IMO. That is way above in the hierarchy and won't have any effect on performance at the instruction level. If the OS is paging you out to disk then you already lost, so don't worry abut it.
|
|||
![]() |
|
Overclick
I mean perfetched lines, not pages (my mistake)
|
|||
![]() |
|
revolution
Overclick wrote: I mean perfetched lines, not pages (my mistake) Overclick wrote: I may try to VirtualLock it ... |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.