Add stream buffer to another

Index > Main > Add stream buffer to another

Goto page 1, 2 Next

Author

Thread

Overclick

Joined: 11 Jul 2020
Posts: 670
Location: Ukraine

Overclick 04 Apr 2021, 19:07

Hi
How can I add stream buffers to each other as faster as can? SSE whatever.

Code:

        mov             rcx,0xFFFF
toBegin:
        mov             r10d,[Buffer1-4+(ecx*4)]
        add             [Buffer2-4+(ecx*4)],r10d
loop    toBegin

04 Apr 2021, 19:07

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20700
Location: In your JS exploiting you and your system

revolution 04 Apr 2021, 19:23

Usually to make "fast" data transfers you will make good use of the cache.

So you need to determine the cache size of your system.
Divide the data into chunks about half the cache size.
Read in one section to a cache buffer, and write out a different section from the other cache buffer. So you alternate the read and write to/from the cache buffers.

The details can get very tricky. And the performance is heavily system dependant. So posting generalised code is almost impossible since it would have to account for all the different system variants for memory channel timing and CPU type etc. But with the right code to match the system you could get up to many GB/s.

04 Apr 2021, 19:23

Overclick

Joined: 11 Jul 2020
Posts: 670
Location: Ukraine

Overclick 04 Apr 2021, 20:07

Caching is cool, but how about commands? Do we have any alternate (faster) way to sum data cells?
For example:

Code:

        mov             ecx,0xFFFF
        xor             rbx,rbx
toBegin:
        imul            ebx,ecx,32
        mov             r8d,[Buffer1-32+ebx]
        add             [Buffer2-32+ebx],r8d
        
        mov             r9d,[Buffer1-28+ebx]
        add             [Buffer2-28+ebx],r9d
        
        mov             r10d,[Buffer1-24+ebx]
        add             [Buffer2-24+ebx],r10d
        
        mov             r11d,[Buffer1-20+ebx]
        add             [Buffer2-20+ebx],r11d
        
        mov             r12d,[Buffer1-16+ebx]
        add             [Buffer2-16+ebx],r12d
        
        mov             r13d,[Buffer1-12+ebx]
        add             [Buffer2-12+ebx],r13d
        
        mov             r14d,[Buffer1-8+ebx]
        add             [Buffer2-8+ebx],r14d
        
        mov             r15d,[Buffer1-4+ebx]
        add             [Buffer2-4+ebx],r15d
loop    toBegin

04 Apr 2021, 20:07

Overclick

Joined: 11 Jul 2020
Posts: 670
Location: Ukraine

Overclick 04 Apr 2021, 21:48

error: relative jump out of range. )))

04 Apr 2021, 21:48

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20700
Location: In your JS exploiting you and your system

revolution 05 Apr 2021, 02:23

It is impossible for us to know what will be faster on your system in your application.

But I see no reason why you are limiting yourself to using 32-bit registers for the transfer. Perhaps using 64-bit reads an writes will help.

05 Apr 2021, 02:23

Overclick

Joined: 11 Jul 2020
Posts: 670
Location: Ukraine

Overclick 05 Apr 2021, 03:14

Reason of 32bit is needs to modifying, but yes in this case it have to be 64bit. Don't we miss some FPU/SSE commands to work with packets like "rep movs"?
Could you show some examples of difference in different systems? I understand for caching, but basic operations? Do you mean bit alignment?

05 Apr 2021, 03:14

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20700
Location: In your JS exploiting you and your system

revolution 05 Apr 2021, 05:03

You can try using SSE/AVX to do 32 bit adds with 128/256/512 bit registers. It might or might not be "faster" on your system. It depends upon a lot on other factors as to how fast it will be during runtime

Maybe a tight loop with a single operation is faster for you. Or maybe an unrolled loop with multiple operations per loop is faster.

Perhaps doing blocks of reads first, and then blocks of adds, and then blocks of writes will be faster.

Maybe using the implicit read-modity-write add as you show above is slower than an explicit two reads, add, then write.

Or some combination of the above.

Or something else entirely.

You will have to try it in various ways and time it to see which works best in your case.

05 Apr 2021, 05:03

Overclick

Joined: 11 Jul 2020
Posts: 670
Location: Ukraine

Overclick 05 Apr 2021, 12:47

Actually I cannot to use 64bit per operation even 32bit as container only. Data stream is 24+24+24... I have to move each calculated dword to buffer, then shift address by one byte less. Or I can shift it inside some huge register then unload that register to buffer as block. Have to search around SSE/AVX for best way. Seems I need to write few different algorithms depend of cpus supported instructions/registers that users may have.

05 Apr 2021, 12:47

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20700
Location: In your JS exploiting you and your system

revolution 05 Apr 2021, 12:56

Consider that in most cases the CPU will be much faster then the memory bus. So doing more work in the CPU and transferring fewer bytes to/from memory is usually "faster". So having the memory format use tightly packed 24-bit data and use the CPU to unpack, compute, then repack might be a win for you.

Also, when using the cache, you can have the data transfer to/from the cache while at the same time the CPU is computing other data. This overlap of operations is where a lot of speed-up can be realised.

05 Apr 2021, 12:56

Overclick

Joined: 11 Jul 2020
Posts: 670
Location: Ukraine

Overclick 05 Apr 2021, 14:23

How can I lock in cache systemly provided buffers?

Actually buffers provided by interfaces. I may try to VirtualLock it and calculate my job partly of memory page size to make sure it cached just once. I just not sure that buffers let me do anything like that.

Last edited by Overclick on 05 Apr 2021, 14:36; edited 1 time in total

05 Apr 2021, 14:23

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20700
Location: In your JS exploiting you and your system

revolution 05 Apr 2021, 14:30

Overclick wrote:

How can I lock in cache systemly provided buffers?

You just read and write memory using normal instructions. It is cached automatically.

To get data into cache you only need to read it. To write stuff out of cache you only need to write it.

Code:

mov rax,[rsi] ; read from memory into cache, slow
mov rax,[rsi+64] ; keep reading more into cache, here we assume the cache line is 64 bytes long, slow
; repeat a few K times
;...
mov rax,[rsi] ; get data out of the cached read from above, this should be very fast if the data hasn't been evicted yet
add rax,rbx ; compute something
mov [rdi],rax ; write to final destination, or use movnt, slow

Also look into movnt for bypassing the cache when you know it ins't needed again later.

05 Apr 2021, 14:30

Overclick

Joined: 11 Jul 2020
Posts: 670
Location: Ukraine

Overclick 05 Apr 2021, 14:42

I edited my comment above. You're right. I just want to be sure that memory pages not reloaded randomly times.

05 Apr 2021, 14:42

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20700
Location: In your JS exploiting you and your system

revolution 05 Apr 2021, 14:56

If you are worried about the OS paging out your memory then you are looking at the wrong level IMO. That is way above in the hierarchy and won't have any effect on performance at the instruction level. If the OS is paging you out to disk then you already lost, so don't worry abut it.

05 Apr 2021, 14:56

Overclick

Joined: 11 Jul 2020
Posts: 670
Location: Ukraine

Overclick 05 Apr 2021, 23:31

I mean perfetched lines, not pages (my mistake)

05 Apr 2021, 23:31

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20700
Location: In your JS exploiting you and your system

revolution 06 Apr 2021, 04:46

Overclick wrote:

I mean perfetched lines, not pages (my mistake)

Okay, but ...

Overclick wrote:

I may try to VirtualLock it ...

... VirtualLock won't help in any way. That is the wrong place to be looking.

06 Apr 2021, 04:46

Roman

Joined: 21 Apr 2012
Posts: 1978

Roman 01 May 2021, 04:55

If intel write good asm commands.
We would have this asm commands:

Code:

;Very easy isn't it ? 

;data offset this reg or value or [mem].  num loops this reg or value or [mem]
movsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops 
addsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops 
mulsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops

But not this crazy style !

Code:

mov             r8d,[Buffer1-32+ebx]
        add             [Buffer2-32+ebx],r8d
        
        mov             r9d,[Buffer1-28+ebx]
        add             [Buffer2-28+ebx],r9d
        
        mov             r10d,[Buffer1-24+ebx]
        add             [Buffer2-24+ebx],r10d
        
        mov             r11d,[Buffer1-20+ebx]
        add             [Buffer2-20+ebx],r11d
        
        mov             r12d,[Buffer1-16+ebx]
        add             [Buffer2-16+ebx],r12d
        
        mov             r13d,[Buffer1-12+ebx]
        add             [Buffer2-12+ebx],r13d
        
        mov             r14d,[Buffer1-8+ebx]
        add             [Buffer2-8+ebx],r14d

01 May 2021, 04:55

bitRAKE

Joined: 21 Jul 2003
Posts: 4259
Location: vpcmpistri

bitRAKE 01 May 2021, 08:36

Code:

repeat 7, reg:8
        mov r#reg#d,[Buffer1 - (9-%)shl 2 + rbx]
        add [Buffer2 - (9-%)shl 2 + rbx],r#reg#d
end repeat

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.

01 May 2021, 08:36

DimonSoft

Joined: 03 Mar 2010
Posts: 1228
Location: Belarus

DimonSoft 01 May 2021, 08:43

Roman wrote:

If intel write good asm commands.
We would have this asm commands:

Code:

;Very easy isn't it ? 

;data offset this reg or value or [mem].  num loops this reg or value or [mem]
movsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops 
addsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops 
mulsslp xmm1,Source mem adr or reg,Dest mem adr or reg,data offset,num loops

Are you sure you’re trying to write in asm? Maybe Python is the way to go? And why are you mixing assemblers and Intel processors’ ISA?

01 May 2021, 08:43

Roman

Joined: 21 Apr 2012
Posts: 1978

Roman 01 May 2021, 09:34

DimonSoft
Read first post in this them.

Quote:

How can I add stream buffers to each other as faster as can? SSE whatever.

Second.
Why i must using Python ? I want use normal asm SSE commands !

Third.
Why are you surprised my proposition (i about movsslp) and not surprised old asm commadns repe movsb ?!

01 May 2021, 09:34

Overclick

Joined: 11 Jul 2020
Posts: 670
Location: Ukraine

Overclick 01 May 2021, 10:22

I'm looking for FPU instructions to do multiplied add operations at same tact.
My buffer's values is 32bit and I can load few of them to some bigger register. From lower dword to higher. Example:

Code:

struct BUFFER
 frame1 dd ?
 frame2 dd ?
 ...
ends
buffer1 BUFFER
buffer2 BUFFER
...
.code
 vmovdqa ymm1,[buffer1]
 vmovdqa ymm2,ymm1
 vpaddd ymm1,ymm2,[buffer2]
 vmovdqa [buffer1],ymm1

01 May 2021, 10:22

Goto page 1, 2 Next

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum