flat assembler
Message board for the users of flat assembler.

Index > Main > Memory Move Optimization (BTW, I have read the other thread)

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
jbojarczuk



Joined: 21 Jun 2006
Posts: 27
jbojarczuk
Hi,

I have a question on memory movement for unaligned source/dest under 32 bits.

Suppose unaligned Source or Dest (and it is only possible to align one of them). Is it not faster to align source, keeping the unaligned dest bytes in a register, so that you shift the dword received from source and all reads/writes are made aligned?

Code:
... align source

        mov eax , 3
        sub ecx , 8
        jb .Trim        ;; I will move bytes in chunks of 8.
        and eax , edi ;; this case could be implemented through
        jz .Aligned   ;; a table jump.
        cmp eax , 2
        jb .UnAlign1
        je .UnAlign2
        ;ja .Unalign3
.Unalign3:
                mov eax , [esi - 3]
.NextQword              
                mov edx , [esi + 1]
                shr eax , 24
                mov ebx , edx
                shl edx , 8
                or edx , eax
                mov [edi - 3] , edx
                
                mov edx , [esi + 5]
                shr ebx , 24
                mov eax , edx
                shl edx , 8
                or edx , ebx
                mov [edi + 1] , edx
                
                sub ecx , 8
                edd esi , 8
                add edi , 8

                ja .NextQword
    


I am using about 4 (short, fast, directpath) instructions in order to keep all memory references aligned (that would be about 1½ cycles in an AMD). Would this be faster or slower than the obvious solution?

Code:
...Initialization
.NextQword
                mov eax , [esi]
                mov [edi] , eax
                mov  eax , [esi]
                mov [edi], eax
                edd esi , 8
                add edi , 8
                sub ecx , 8
                jb .NextQword
    


Thanks for your atention...

Ps: I know that I could optimize and remove the add esi and add edi lines, by using ecx in the address, but that is not the point here.
Post 21 Feb 2007, 00:51
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
I think aligning the Dest would be better since storage alignment is more important than load alignment.

The memmove/RtlMoveMemory function in the Win xp ntdll is very well optimized and it aligns the destination. Also using prefetch for larger moves seems to improve performance.

Some optimization guides suggest pairing the load and store instructions
IE
movdqa xmm0,[esi]
movntdq [edi],xmm0
movdqa xmm1,[esi+16]
movntdq [edi+16],xmm1
But I've found that this is slower than the following
movdqa xmm0,[esi]
movdqa xmm1,[esi+16]
movdqa [edi],xmm0
movdqa [edi+16],xmm1
Post 25 Feb 2007, 06:37
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
jbojarczuk



Joined: 21 Jun 2006
Posts: 27
jbojarczuk
Hi,

Thanks for your reply.

The question I am posing here is not if it is better to align the destination or the source. I am questioning on the probability of keeping both of them aligned, in spite of having to use 5 more instructions in order to make the shifts.

Has anyone tried this? Which would be faster (aligned with shift, losing about 1½ cycles with the shifting, or unaligned without shift)?
Post 27 Feb 2007, 18:20
View user's profile Send private message Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
Quote:

Has anyone tried this? Which would be faster (aligned with shift, losing about 1½ cycles with the shifting, or unaligned without shift)?

Well, this is your semi-lucky day. I had a little extra time and energy, although not as much as I thought, and tried alignment with shift. See the file memcpy.asm at my website. Output in memcpy.txt
Post 01 Mar 2007, 10:58
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Hi Xorpd!. how do you output the text? Because in my console window its too long to fit and I can't scroll back in the history enough.
"memcpy > memcpy.txt" prints nothing and even "memcpy | more" shows nothing. What seems to be the problem?

As much as I could gather (Q6600, 2GB RAM):
Code:
                   6                  11                1152                1152

                   6                  12                1143                1134

                   6                  13                1179                1179

                   6                  14                 765                 765

                   6                  15                1143                1143

                   7                   0                 990                 990

                   7                   1                1125                1116

                   7                   2                1125                1125

                   7                   3                1116                1116

                   7                   4                1080                1080

                   7                   5                1143                1143

                   7                   6                1125                1125

                   7                   7                 747                 747

                   7                   8                1107                1107

                   7                   9                1098                1098

                   7                  10                1098                1098

                   7                  11                1161                1161

                   7                  12                1134                1143

                   7                  13                1188                1179

                   7                  14                1161                1161

                   7                  15                 747                 747

                   8                   0                 639                 639

                   8                   1                1134                1134

                   8                   2                1125                1116

                   8                   3                1089                1089

                   8                   4                1107                1107

                   8                   5                1134                1134

                   8                   6                1125                1125

                   8                   7                1116                1116

                   8                   8                 729                 729

                   8                   9                1143                1170

                   8                  10                1098                1089

                   8                  11                1080                1080

                   8                  12                1143                1143

                   8                  13                1179                1188

                   8                  14                1161                1161

                   8                  15                1170                1170

                   9                   0                 990                 990

                   9                   1                 783                 783

                   9                   2                1143                1143

                   9                   3                1098                1098

                   9                   4                1098                1098

                   9                   5                1152                1152

                   9                   6                1125                1116

                   9                   7                1116                1116

                   9                   8                1125                1116

                   9                   9                 765                 765

                   9                  10                1107                1098

                   9                  11                1089                1089

                   9                  12                1080                1080

                   9                  13                1179                1179

                   9                  14                1170                1179

                   9                  15                1161                1161

                  10                   0                 981                 981

                  10                   1                1116                1107

                  10                   2                 756                 756

                  10                   3                1098                1089

                  10                   4                1098                1107

                  10                   5                1143                1143

                  10                   6                1134                1143

                  10                   7                1107                1107

                  10                   8                1116                1125

                  10                   9                1116                1107

                  10                  10                 756                 747

                  10                  11                1107                1098

                  10                  12                1080                1080

                  10                  13                1152                1152

                  10                  14                1161                1161

                  10                  15                1161                1161

                  11                   0                 990                 990

                  11                   1                1116                1116

                  11                   2                1089                1089

                  11                   3                 747                 747

                  11                   4                1098                1098

                  11                   5                1143                1143

                  11                   6                1125                1125

                  11                   7                1125                1125

                  11                   8                1116                1116

                  11                   9                1107                1107

                  11                  10                1098                1098

                  11                  11                 738                 738

                  11                  12                1080                1080

                  11                  13                1152                1152

                  11                  14                1134                1134

                  11                  15                1152                1152

                  12                   0                 990                1008

                  12                   1                1152                1161

                  12                   2                1134                1116

                  12                   3                1089                1089

                  12                   4                 747                 747

                  12                   5                1134                1134

                  12                   6                1125                1125

                  12                   7                1116                1116

                  12                   8                1098                1089

                  12                   9                1107                1107

                  12                  10                1098                1098

                  12                  11                1080                1071

                  12                  12                 738                 738

                  12                  13                1188                1152

                  12                  14                1125                1116

                  12                  15                1143                1143

                  13                   0                 990                 990

                  13                   1                1143                1161

                  13                   2                1134                1125

                  13                   3                1098                1098

                  13                   4                1098                1098

                  13                   5                 783                 792

                  13                   6                1125                1125

                  13                   7                1107                1116

                  13                   8                1116                1125

                  13                   9                1134                1134

                  13                  10                1089                1089

                  13                  11                1071                1080

                  13                  12                1080                1089

                  13                  13                 783                 774

                  13                  14                1125                1134

                  13                  15                1125                1116

                  14                   0                 981                 981

                  14                   1                1116                1125

                  14                   2                1143                1125

                  14                   3                1098                1098

                  14                   4                1107                1098

                  14                   5                1152                1143

                  14                   6                 774                 774

                  14                   7                1116                1107

                  14                   8                1125                1116

                  14                   9                1107                1107

                  14                  10                1116                1125

                  14                  11                1071                1080

                  14                  12                1089                1089

                  14                  13                1143                1134

                  14                  14                 747                 756

                  14                  15                1143                1143

                  15                   0                 990                 990

                  15                   1                1107                1107

                  15                   2                1161                1152

                  15                   3                1125                1116

                  15                   4                1098                1098

                  15                   5                1143                1143

                  15                   6                1125                1125

                  15                   7                8613                 756

                  15                   8                1125                1116

                  15                   9                1107                1107

                  15                  10                1107                1098

                  15                  11                1098                1107

                  15                  12                1080                1080

                  15                  13                1143                1152

                  15                  14                1125                1125

                  15                  15                 729                 747
    
Post 16 Dec 2007, 12:40
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
jbojarczuk, the best thing you can do is optimize your code and benchmark it against the RtlMoveMemory api. How does your technique of keeping SRC and DST aligned scale into SSE instructions?

I tried to improve on XP's memory move, but the versions I created were never cross the board improvements in ALL test scenarios.

Trying to keep the SRC and DEST both aligned, seems like on larger moves >4kb your extra logic would be slower than just aligning the DEST. If you assume 4 instructions at 1 cycle a peice (I'm being generous) vs an unaligned load at worst a 3 cycle penalty your method would still be slower.
BUT only a thorough benchmark using all possible alignments and a large subset of different move sizes would tell you for sure.
Post 16 Dec 2007, 23:23
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Hayden



Joined: 06 Oct 2005
Posts: 132
Hayden
to get rid of register dependecies is some thing like this,
this is how you would move 1 paragraph. btw cache reads/writes usualy occure 1 paragraph at a time.

mov ebx, [edi+00H]
mov edx, [edi+04H]
mov ecx, [edi+08H]
mov eax, [edi+0CH]
mov [esi+00H], ebx
mov [esi+04H], edx
mov [esi+08H], ecx
mov [esi+0CH], eax

_________________
New User.. Hayden McKay.
Post 17 Dec 2007, 08:01
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
OT Start:
---------
Have you noticed Windows Server 2003 doing some nifty HDD optimizations on its own. Like yesterday I was repetedly trying to copy ~500 or so images (JPEGs 2-8MB in size) and the pattern was something like "12345........67890......." and it was really impressive to see the background action.

The explanation is something like Hayden's post - it takes some 16MB of data and (I think) moves it through cache. The 16MB R / 16MB W / ... patterns are brought to the user with a limpy looking transfers.
-----------
OT Ends

It seems to be the best way and I've done my memory moves usually like this:
Code:
movaps xmm0, [esi+00h]
movaps xmm1, [esi+10h]
movaps xmm2, [esi+20h]
movaps xmm3, [esi+30h]
movntps [edi+00h], xmm0
movntps [edi+10h], xmm1
movntps [edi+20h], xmm2
movntps [edi+30h], xmm3
    

The MOVAPS is the shortest of the MOV group and 16-bytes per cycle throughput at the average. MOVNTPS is lower latency just because CPU thinks that the data is already in single precision and MOVNTPS is also shorter than other variants.
Post 17 Dec 2007, 11:00
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
don't forget

and esi,not 0fh
and edi,not 0fh
Post 17 Dec 2007, 12:15
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
To get all the output of my memcpy.asm program, issue the command
mode con: lines=1000
at the Windows console prompt before running. If that doesn't work, it may be bacause you are running the lame *32 cmd.exe. Microsoft wants to make this the default; has anyone ever noticed that Microsoft has better stuff that they want to give out by default? Find your x64 cmd.exe and try again. I don't know why redirection doesn't work. I noticed that myself and perhaps I'm doing some Windows thing wrong in my program.
Post 17 Dec 2007, 17:21
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/redirection.mspx?mfr=true

Quote:

To redirect all of the output, including handle 2 (that is, STDERR), from the ipconfig command to handle 1 (that is, STDOUT), and then redirect the ouput to Output.log, type:

ipconfig.exe>>output.log 2>&1
Post 17 Dec 2007, 23:58
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
The lines=1000 worked, but I don't get it, why outputting to a file doesn't work - at all.

I was using the 64-bit version of cmd all the time, but for testing I now took the 32-bit version. Both act the same way and memcpy>out.txt 2>&1 doesn't work.

http://support.microsoft.com/kb/110930
Actually the problem I'm having is when ANY redirection is done - all the output is lost. Really wierd...
Post 18 Dec 2007, 07:05
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
I don't get it either, else I would have fixed it. Maybe redirection is done by mainCRTStartup rather than cmd.exe. If anyone has a suggestion, here is the problem in detail: we assemble and link hello.asm:
Code:
format MS64 coff
; File: hello.asm
; Assembled with: fasm hello.asm
; Linked with link hello.obj /subsystem:console /defaultlib:kernel32.lib /entry:_main

extrn GetStdHandle
extrn 'WriteConsoleA' as WriteConsole
extrn ExitProcess

STD_OUTPUT_HANDLE = -11

section 'CODE' code readable executable align 16
align 16
public _main
_main:
; Align stack frame and leave room for procedure calls
   sub rsp, 40
; Save rbx
   mov [rsp+48], rbx
; Get handle for stdout
   mov ecx, STD_OUTPUT_HANDLE
   call GetStdHandle
   mov rbx, rax
; Write message
   mov rcx, rbx  ; hConsoleOutput
   mov rdx, mess ; lpBuffer
   mov r8d, mess_len ; nNumberOfCharsToWrite
   mov r9, result ; lpNumberOfCharsWritten
   mov qword[rsp+32], 0 ; lpReserved
   call WriteConsole
; Exit
   mov rbx, [rsp+48]
   mov ecx, 0
   call ExitProcess

section 'DATA' data readable writeable align 16
mess db 'Hello, world!', 0dh, 0ah
mess_len = $-mess
align 4
result dd ?
    

it runs fine:
Code:
C:\Asm\FASM\EXAMPLES\WIN64\my_example>fasm hello.asm
flat assembler  version 1.67.18  (1222417 kilobytes memory)
3 passes, 401 bytes.

C:\Asm\FASM\EXAMPLES\WIN64\my_example>link hello.obj /subsystem:console /default
lib:kernel32.lib /entry:_main
Microsoft (R) Incremental Linker Version 8.00.40310.39
Copyright (C) Microsoft Corporation.  All rights reserved.


C:\Asm\FASM\EXAMPLES\WIN64\my_example>hello
Hello, world!
    

but it doesn't do redirection successfully
Code:
C:\Asm\FASM\EXAMPLES\WIN64\my_example>dir hello.txt
 Volume in drive C has no label.
 Volume Serial Number is CCC8-B430

 Directory of C:\Asm\FASM\EXAMPLES\WIN64\my_example

File Not Found

C:\Asm\FASM\EXAMPLES\WIN64\my_example>hello > hello.txt

C:\Asm\FASM\EXAMPLES\WIN64\my_example>dir hello.txt
 Volume in drive C has no label.
 Volume Serial Number is CCC8-B430

 Directory of C:\Asm\FASM\EXAMPLES\WIN64\my_example

12/18/2007  12:15 AM                 0 hello.txt
               1 File(s)              0 bytes
               0 Dir(s)  122,378,969,088 bytes free
    

Ah, here's a fix:
Code:
format MS64 coff
; File: hello.asm
; Assembled with: fasm hello.asm
; Linked with link hello.obj /subsystem:console /defaultlib:kernel32.lib /entry:_main

extrn GetStdHandle
extrn ExitProcess
extrn WriteFile

STD_OUTPUT_HANDLE = -11

section 'CODE' code readable executable align 16
align 16
public _main
_main:
; Align stack frame and leave room for procedure calls
   sub rsp, 40
; Save rbx
   mov [rsp+48], rbx
; Get handle for stdout
   mov ecx, STD_OUTPUT_HANDLE
   call GetStdHandle
   mov rbx, rax
; Write message
   mov rcx, rbx  ; hConsoleOutput
   mov rdx, mess ; lpBuffer
   mov r8d, mess_len ; nNumberOfCharsToWrite
   mov r9, result ; lpNumberOfCharsWritten
   mov qword[rsp+32], 0 ; lpReserved
   call WriteFile
; Exit
   mov rbx, [rsp+48]
   mov ecx, 0
   call ExitProcess

section 'DATA' data readable writeable align 16
mess db 'Hello, world!', 0dh, 0ah
mess_len = $-mess
align 4
result dd ?
    

And now:
Code:
C:\Asm\FASM\EXAMPLES\WIN64\my_example>dir hello.txt
 Volume in drive C has no label.
 Volume Serial Number is CCC8-B430

 Directory of C:\Asm\FASM\EXAMPLES\WIN64\my_example

File Not Found

C:\Asm\FASM\EXAMPLES\WIN64\my_example>fasm hello.asm
flat assembler  version 1.67.18  (1195954 kilobytes memory)
3 passes, 397 bytes.

C:\Asm\FASM\EXAMPLES\WIN64\my_example>link hello.obj /subsystem:console /default
lib:kernel32.lib /entry:_main
Microsoft (R) Incremental Linker Version 8.00.40310.39
Copyright (C) Microsoft Corporation.  All rights reserved.


C:\Asm\FASM\EXAMPLES\WIN64\my_example>hello > hello.txt

C:\Asm\FASM\EXAMPLES\WIN64\my_example>dir hello.txt
 Volume in drive C has no label.
 Volume Serial Number is CCC8-B430

 Directory of C:\Asm\FASM\EXAMPLES\WIN64\my_example

12/18/2007  12:26 AM                15 hello.txt
               1 File(s)             15 bytes
               0 Dir(s)  122,378,682,368 bytes free

C:\Asm\FASM\EXAMPLES\WIN64\my_example>type hello.txt
Hello, world!
    

So it seems that WriteConsoleA doesn't necessarily write anywhere if output has been redirected, but WriteFile works. Wish that Microsoft had made this more obvious in their documentation.
Post 18 Dec 2007, 07:30
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Well, if you check MSDN/PlatformSDK... first, at the description:
Quote:

WriteConsole Function
Writes a character string to a console screen buffer beginning at the current cursor location.

- imho this does hint that it won't work with files, although I guess you could argue that "cursor location" could be interpreted as file pointer instead Smile

And then theres from the parameter description:
Quote:

hConsoleOutput
A handle to the console screen buffer. The handle must have the GENERIC_WRITE access right. For more information, see Console Buffer Security and Access Rights.

- obviously a redirected STDOUT means your GetStdHandle() returns a file handle, not a console handle.

But the real and final indication would be the following, from Remarks (emphasis is mine):
Quote:

Both WriteConsole and WriteFile can be used for console I/O. While WriteConsole supports writing Unicode characters to a console screen buffer, WriteFile does not. However, WriteConsole fails if it is used with a standard handle that is redirected to a file.


But it's of course possible that this sentence hasn't been there in all PlatformSDKs and only been added relatively recently.

PS: add error checking after your API calls Wink
Post 18 Dec 2007, 12:28
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
grab FASMLIB and use it for easy formatted output.

Example:
Code:
format_string db "%10s %016X"

push [value]
push [name]
push format_string
call stdout.write.format
    


or with macros
Code:
stdcall stdout.write.format, "%10s %016X", [name], [value]
    
Post 18 Dec 2007, 12:36
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
@f0dder: There were several places the sentence about WriteConsole failing under redirection could be in the MSDN docs. Under which header did you find it? I was thinking about error checking, but I was going to output the results of the check using WriteConsole, which wouldn't have worked. I suppose I could have used a MessageBox, but I was somewhat flustered by the fact that I couldn't do any output at all via my first choice method.

@vid: So FASMLIB works under x64 now, too? I thought that project awaited a suitable FASM.DLL. It's hard enough to guess what invoke may do in x64, let alone what that stdcall macro does.
Post 18 Dec 2007, 16:18
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
xorpd: oh shit, i overlooked that code is 64bit Smile

No, there is no 64bit version of FASMLIB, and won't be any time soon. That would have to be complete rewrite. Sorry.
Post 18 Dec 2007, 16:44
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Xorpd!: I simply looked up WriteConsole - but it is a pretty recent version of MSDN/PlatformSDK I'm using, and I know that (even important) information have been added with time, instead of being there from the beginning; nobody's perfect, and certainly not Microsoft either Smile
Post 18 Dec 2007, 23:49
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
f0dder: The info in WriteConsole has been there since day one in the .hlp files and on MSDN.
Post 19 Dec 2007, 00:19
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
It's there all right. When using the MSDN docs I was more optimistic that one could fly through them as fast as I did without missing anything of significance. Obviously an incorrect assumption. Thank you for pointing this out as it's more reassuring to know that it's the documented behavior of the software and not just one's perception from limited experimentation.
Post 19 Dec 2007, 04:20
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.