flat assembler
Message board for the users of flat assembler.
Index
> Main > Add stream buffer to another Goto page 1, 2 Next |
Author |
|
revolution 04 Apr 2021, 19:23
Usually to make "fast" data transfers you will make good use of the cache.
So you need to determine the cache size of your system. Divide the data into chunks about half the cache size. Read in one section to a cache buffer, and write out a different section from the other cache buffer. So you alternate the read and write to/from the cache buffers. The details can get very tricky. And the performance is heavily system dependant. So posting generalised code is almost impossible since it would have to account for all the different system variants for memory channel timing and CPU type etc. But with the right code to match the system you could get up to many GB/s. |
|||
04 Apr 2021, 19:23 |
|
Overclick 04 Apr 2021, 20:07
Caching is cool, but how about commands? Do we have any alternate (faster) way to sum data cells?
For example: Code: mov ecx,0xFFFF xor rbx,rbx toBegin: imul ebx,ecx,32 mov r8d,[Buffer1-32+ebx] add [Buffer2-32+ebx],r8d mov r9d,[Buffer1-28+ebx] add [Buffer2-28+ebx],r9d mov r10d,[Buffer1-24+ebx] add [Buffer2-24+ebx],r10d mov r11d,[Buffer1-20+ebx] add [Buffer2-20+ebx],r11d mov r12d,[Buffer1-16+ebx] add [Buffer2-16+ebx],r12d mov r13d,[Buffer1-12+ebx] add [Buffer2-12+ebx],r13d mov r14d,[Buffer1-8+ebx] add [Buffer2-8+ebx],r14d mov r15d,[Buffer1-4+ebx] add [Buffer2-4+ebx],r15d loop toBegin |
|||
04 Apr 2021, 20:07 |
|
Overclick 04 Apr 2021, 21:48
error: relative jump out of range. )))
|
|||
04 Apr 2021, 21:48 |
|
revolution 05 Apr 2021, 02:23
It is impossible for us to know what will be faster on your system in your application.
But I see no reason why you are limiting yourself to using 32-bit registers for the transfer. Perhaps using 64-bit reads an writes will help. |
|||
05 Apr 2021, 02:23 |
|
Overclick 05 Apr 2021, 03:14
Reason of 32bit is needs to modifying, but yes in this case it have to be 64bit. Don't we miss some FPU/SSE commands to work with packets like "rep movs"?
Could you show some examples of difference in different systems? I understand for caching, but basic operations? Do you mean bit alignment? |
|||
05 Apr 2021, 03:14 |
|
revolution 05 Apr 2021, 05:03
You can try using SSE/AVX to do 32 bit adds with 128/256/512 bit registers. It might or might not be "faster" on your system. It depends upon a lot on other factors as to how fast it will be during runtime
Maybe a tight loop with a single operation is faster for you. Or maybe an unrolled loop with multiple operations per loop is faster. Perhaps doing blocks of reads first, and then blocks of adds, and then blocks of writes will be faster. Maybe using the implicit read-modity-write add as you show above is slower than an explicit two reads, add, then write. Or some combination of the above. Or something else entirely. You will have to try it in various ways and time it to see which works best in your case. |
|||
05 Apr 2021, 05:03 |
|
Overclick 05 Apr 2021, 12:47
Actually I cannot to use 64bit per operation even 32bit as container only. Data stream is 24+24+24... I have to move each calculated dword to buffer, then shift address by one byte less. Or I can shift it inside some huge register then unload that register to buffer as block. Have to search around SSE/AVX for best way. Seems I need to write few different algorithms depend of cpus supported instructions/registers that users may have.
|
|||
05 Apr 2021, 12:47 |
|
revolution 05 Apr 2021, 12:56
Consider that in most cases the CPU will be much faster then the memory bus. So doing more work in the CPU and transferring fewer bytes to/from memory is usually "faster". So having the memory format use tightly packed 24-bit data and use the CPU to unpack, compute, then repack might be a win for you.
Also, when using the cache, you can have the data transfer to/from the cache while at the same time the CPU is computing other data. This overlap of operations is where a lot of speed-up can be realised. |
|||
05 Apr 2021, 12:56 |
|
Overclick 05 Apr 2021, 14:23
How can I lock in cache systemly provided buffers?
Actually buffers provided by interfaces. I may try to VirtualLock it and calculate my job partly of memory page size to make sure it cached just once. I just not sure that buffers let me do anything like that. Last edited by Overclick on 05 Apr 2021, 14:36; edited 1 time in total |
|||
05 Apr 2021, 14:23 |
|
revolution 05 Apr 2021, 14:30
Overclick wrote: How can I lock in cache systemly provided buffers? To get data into cache you only need to read it. To write stuff out of cache you only need to write it. Code: mov rax,[rsi] ; read from memory into cache, slow mov rax,[rsi+64] ; keep reading more into cache, here we assume the cache line is 64 bytes long, slow ; repeat a few K times ;... mov rax,[rsi] ; get data out of the cached read from above, this should be very fast if the data hasn't been evicted yet add rax,rbx ; compute something mov [rdi],rax ; write to final destination, or use movnt, slow |
|||
05 Apr 2021, 14:30 |
|
Overclick 05 Apr 2021, 14:42
I edited my comment above. You're right. I just want to be sure that memory pages not reloaded randomly times.
|
|||
05 Apr 2021, 14:42 |
|
revolution 05 Apr 2021, 14:56
If you are worried about the OS paging out your memory then you are looking at the wrong level IMO. That is way above in the hierarchy and won't have any effect on performance at the instruction level. If the OS is paging you out to disk then you already lost, so don't worry abut it.
|
|||
05 Apr 2021, 14:56 |
|
Overclick 05 Apr 2021, 23:31
I mean perfetched lines, not pages (my mistake)
|
|||
05 Apr 2021, 23:31 |
|
revolution 06 Apr 2021, 04:46
Overclick wrote: I mean perfetched lines, not pages (my mistake) Overclick wrote: I may try to VirtualLock it ... |
|||
06 Apr 2021, 04:46 |
|
Roman 01 May 2021, 04:55
If intel write good asm commands.
We would have this asm commands: Code: ;Very easy isn't it ? ;data offset this reg or value or [mem]. num loops this reg or value or [mem] movsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops addsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops mulsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops But not this crazy style ! Code: mov r8d,[Buffer1-32+ebx] add [Buffer2-32+ebx],r8d mov r9d,[Buffer1-28+ebx] add [Buffer2-28+ebx],r9d mov r10d,[Buffer1-24+ebx] add [Buffer2-24+ebx],r10d mov r11d,[Buffer1-20+ebx] add [Buffer2-20+ebx],r11d mov r12d,[Buffer1-16+ebx] add [Buffer2-16+ebx],r12d mov r13d,[Buffer1-12+ebx] add [Buffer2-12+ebx],r13d mov r14d,[Buffer1-8+ebx] add [Buffer2-8+ebx],r14d |
|||
01 May 2021, 04:55 |
|
bitRAKE 01 May 2021, 08:36
Code: repeat 7, reg:8 mov r#reg#d,[Buffer1 - (9-%)shl 2 + rbx] add [Buffer2 - (9-%)shl 2 + rbx],r#reg#d end repeat _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
01 May 2021, 08:36 |
|
DimonSoft 01 May 2021, 08:43
Roman wrote: If intel write good asm commands. Are you sure you’re trying to write in asm? Maybe Python is the way to go? And why are you mixing assemblers and Intel processors’ ISA? |
|||
01 May 2021, 08:43 |
|
Roman 01 May 2021, 09:34
DimonSoft
Read first post in this them. Quote:
Second. Why i must using Python ? I want use normal asm SSE commands ! Third. Why are you surprised my proposition (i about movsslp) and not surprised old asm commadns repe movsb ?! |
|||
01 May 2021, 09:34 |
|
Overclick 01 May 2021, 10:22
I'm looking for FPU instructions to do multiplied add operations at same tact.
My buffer's values is 32bit and I can load few of them to some bigger register. From lower dword to higher. Example: Code: struct BUFFER frame1 dd ? frame2 dd ? ... ends buffer1 BUFFER buffer2 BUFFER ... .code vmovdqa ymm1,[buffer1] vmovdqa ymm2,ymm1 vpaddd ymm1,ymm2,[buffer2] vmovdqa [buffer1],ymm1 |
|||
01 May 2021, 10:22 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.