flat assembler
Message board for the users of flat assembler.

Index > Main > Add stream buffer to another

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Hi
How can I add stream buffers to each other as faster as can? SSE whatever.
Code:
        mov             rcx,0xFFFF
toBegin:
        mov             r10d,[Buffer1-4+(ecx*4)]
        add             [Buffer2-4+(ecx*4)],r10d
loop    toBegin
    
Post 04 Apr 2021, 19:07
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
Usually to make "fast" data transfers you will make good use of the cache.

So you need to determine the cache size of your system.
Divide the data into chunks about half the cache size.
Read in one section to a cache buffer, and write out a different section from the other cache buffer. So you alternate the read and write to/from the cache buffers.

The details can get very tricky. And the performance is heavily system dependant. So posting generalised code is almost impossible since it would have to account for all the different system variants for memory channel timing and CPU type etc. But with the right code to match the system you could get up to many GB/s.
Post 04 Apr 2021, 19:23
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Caching is cool, but how about commands? Do we have any alternate (faster) way to sum data cells?
For example:
Code:
        mov             ecx,0xFFFF
        xor             rbx,rbx
toBegin:
        imul            ebx,ecx,32
        mov             r8d,[Buffer1-32+ebx]
        add             [Buffer2-32+ebx],r8d
        
        mov             r9d,[Buffer1-28+ebx]
        add             [Buffer2-28+ebx],r9d
        
        mov             r10d,[Buffer1-24+ebx]
        add             [Buffer2-24+ebx],r10d
        
        mov             r11d,[Buffer1-20+ebx]
        add             [Buffer2-20+ebx],r11d
        
        mov             r12d,[Buffer1-16+ebx]
        add             [Buffer2-16+ebx],r12d
        
        mov             r13d,[Buffer1-12+ebx]
        add             [Buffer2-12+ebx],r13d
        
        mov             r14d,[Buffer1-8+ebx]
        add             [Buffer2-8+ebx],r14d
        
        mov             r15d,[Buffer1-4+ebx]
        add             [Buffer2-4+ebx],r15d
loop    toBegin      
Post 04 Apr 2021, 20:07
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
error: relative jump out of range. )))
Post 04 Apr 2021, 21:48
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
It is impossible for us to know what will be faster on your system in your application.

But I see no reason why you are limiting yourself to using 32-bit registers for the transfer. Perhaps using 64-bit reads an writes will help.
Post 05 Apr 2021, 02:23
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Reason of 32bit is needs to modifying, but yes in this case it have to be 64bit. Don't we miss some FPU/SSE commands to work with packets like "rep movs"?
Could you show some examples of difference in different systems? I understand for caching, but basic operations? Do you mean bit alignment?
Post 05 Apr 2021, 03:14
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
You can try using SSE/AVX to do 32 bit adds with 128/256/512 bit registers. It might or might not be "faster" on your system. It depends upon a lot on other factors as to how fast it will be during runtime

Maybe a tight loop with a single operation is faster for you. Or maybe an unrolled loop with multiple operations per loop is faster.

Perhaps doing blocks of reads first, and then blocks of adds, and then blocks of writes will be faster.

Maybe using the implicit read-modity-write add as you show above is slower than an explicit two reads, add, then write.

Or some combination of the above.

Or something else entirely.

You will have to try it in various ways and time it to see which works best in your case.
Post 05 Apr 2021, 05:03
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Actually I cannot to use 64bit per operation even 32bit as container only. Data stream is 24+24+24... I have to move each calculated dword to buffer, then shift address by one byte less. Or I can shift it inside some huge register then unload that register to buffer as block. Have to search around SSE/AVX for best way. Seems I need to write few different algorithms depend of cpus supported instructions/registers that users may have.
Post 05 Apr 2021, 12:47
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
Consider that in most cases the CPU will be much faster then the memory bus. So doing more work in the CPU and transferring fewer bytes to/from memory is usually "faster". So having the memory format use tightly packed 24-bit data and use the CPU to unpack, compute, then repack might be a win for you.

Also, when using the cache, you can have the data transfer to/from the cache while at the same time the CPU is computing other data. This overlap of operations is where a lot of speed-up can be realised.
Post 05 Apr 2021, 12:56
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
How can I lock in cache systemly provided buffers?

Actually buffers provided by interfaces. I may try to VirtualLock it and calculate my job partly of memory page size to make sure it cached just once. I just not sure that buffers let me do anything like that.


Last edited by Overclick on 05 Apr 2021, 14:36; edited 1 time in total
Post 05 Apr 2021, 14:23
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
Overclick wrote:
How can I lock in cache systemly provided buffers?
You just read and write memory using normal instructions. It is cached automatically.

To get data into cache you only need to read it. To write stuff out of cache you only need to write it.
Code:
mov rax,[rsi] ; read from memory into cache, slow
mov rax,[rsi+64] ; keep reading more into cache, here we assume the cache line is 64 bytes long, slow
; repeat a few K times
;...
mov rax,[rsi] ; get data out of the cached read from above, this should be very fast if the data hasn't been evicted yet
add rax,rbx ; compute something
mov [rdi],rax ; write to final destination, or use movnt, slow    
Also look into movnt for bypassing the cache when you know it ins't needed again later.
Post 05 Apr 2021, 14:30
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
I edited my comment above. You're right. I just want to be sure that memory pages not reloaded randomly times.
Post 05 Apr 2021, 14:42
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
If you are worried about the OS paging out your memory then you are looking at the wrong level IMO. That is way above in the hierarchy and won't have any effect on performance at the instruction level. If the OS is paging you out to disk then you already lost, so don't worry abut it.
Post 05 Apr 2021, 14:56
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
I mean perfetched lines, not pages (my mistake)
Post 05 Apr 2021, 23:31
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
Overclick wrote:
I mean perfetched lines, not pages (my mistake)
Okay, but ...
Overclick wrote:
I may try to VirtualLock it ...
... VirtualLock won't help in any way. That is the wrong place to be looking.
Post 06 Apr 2021, 04:46
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1082
Roman
If intel write good asm commands.
We would have this asm commands:
Code:
;Very easy isn't it ? 

;data offset this reg or value or [mem].  num loops this reg or value or [mem]
movsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops 
addsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops 
mulsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops 
    

But not this crazy style !
Code:
mov             r8d,[Buffer1-32+ebx]
        add             [Buffer2-32+ebx],r8d
        
        mov             r9d,[Buffer1-28+ebx]
        add             [Buffer2-28+ebx],r9d
        
        mov             r10d,[Buffer1-24+ebx]
        add             [Buffer2-24+ebx],r10d
        
        mov             r11d,[Buffer1-20+ebx]
        add             [Buffer2-20+ebx],r11d
        
        mov             r12d,[Buffer1-16+ebx]
        add             [Buffer2-16+ebx],r12d
        
        mov             r13d,[Buffer1-12+ebx]
        add             [Buffer2-12+ebx],r13d
        
        mov             r14d,[Buffer1-8+ebx]
        add             [Buffer2-8+ebx],r14d    
Post 01 May 2021, 04:55
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3307
Location: vpcmipstrm
bitRAKE
Code:
repeat 7, reg:8
        mov r#reg#d,[Buffer1 - (9-%)shl 2 + rbx]
        add [Buffer2 - (9-%)shl 2 + rbx],r#reg#d
end repeat    

_________________
¯\(°_o)/¯ unlicense.org
Post 01 May 2021, 08:36
View user's profile Send private message Visit poster's website Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 958
Location: Belarus
DimonSoft
Roman wrote:
If intel write good asm commands.
We would have this asm commands:
Code:
;Very easy isn't it ? 

;data offset this reg or value or [mem].  num loops this reg or value or [mem]
movsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops 
addsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops 
mulsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops 
    

Are you sure you’re trying to write in asm? Maybe Python is the way to go? And why are you mixing assemblers and Intel processors’ ISA?
Post 01 May 2021, 08:43
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1082
Roman
DimonSoft
Read first post in this them.
Quote:

How can I add stream buffers to each other as faster as can? SSE whatever.


Second.
Why i must using Python ? I want use normal asm SSE commands !

Third.
Why are you surprised my proposition (i about movsslp) and not surprised old asm commadns repe movsb ?!
Post 01 May 2021, 09:34
View user's profile Send private message Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
I'm looking for FPU instructions to do multiplied add operations at same tact.
My buffer's values is 32bit and I can load few of them to some bigger register. From lower dword to higher. Example:
Code:
struct BUFFER
 frame1 dd ?
 frame2 dd ?
 ...
ends
buffer1 BUFFER
buffer2 BUFFER
...
.code
 vmovdqa ymm1,[buffer1]
 vmovdqa ymm2,ymm1
 vpaddd ymm1,ymm2,[buffer2]
 vmovdqa [buffer1],ymm1
    
Post 01 May 2021, 10:22
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.