flat assembler
Message board for the users of flat assembler.
Index
> Main > Memory Move Optimization (BTW, I have read the other thread) Goto page 1, 2 Next |
Author |
|
r22 25 Feb 2007, 06:37
I think aligning the Dest would be better since storage alignment is more important than load alignment.
The memmove/RtlMoveMemory function in the Win xp ntdll is very well optimized and it aligns the destination. Also using prefetch for larger moves seems to improve performance. Some optimization guides suggest pairing the load and store instructions IE movdqa xmm0,[esi] movntdq [edi],xmm0 movdqa xmm1,[esi+16] movntdq [edi+16],xmm1 But I've found that this is slower than the following movdqa xmm0,[esi] movdqa xmm1,[esi+16] movdqa [edi],xmm0 movdqa [edi+16],xmm1 |
|||
25 Feb 2007, 06:37 |
|
jbojarczuk 27 Feb 2007, 18:20
Hi,
Thanks for your reply. The question I am posing here is not if it is better to align the destination or the source. I am questioning on the probability of keeping both of them aligned, in spite of having to use 5 more instructions in order to make the shifts. Has anyone tried this? Which would be faster (aligned with shift, losing about 1½ cycles with the shifting, or unaligned without shift)? |
|||
27 Feb 2007, 18:20 |
|
Xorpd! 01 Mar 2007, 10:58
Quote:
Well, this is your semi-lucky day. I had a little extra time and energy, although not as much as I thought, and tried alignment with shift. See the file memcpy.asm at my website. Output in memcpy.txt |
|||
01 Mar 2007, 10:58 |
|
Madis731 16 Dec 2007, 12:40
Hi Xorpd!. how do you output the text? Because in my console window its too long to fit and I can't scroll back in the history enough.
"memcpy > memcpy.txt" prints nothing and even "memcpy | more" shows nothing. What seems to be the problem? As much as I could gather (Q6600, 2GB RAM): Code: 6 11 1152 1152 6 12 1143 1134 6 13 1179 1179 6 14 765 765 6 15 1143 1143 7 0 990 990 7 1 1125 1116 7 2 1125 1125 7 3 1116 1116 7 4 1080 1080 7 5 1143 1143 7 6 1125 1125 7 7 747 747 7 8 1107 1107 7 9 1098 1098 7 10 1098 1098 7 11 1161 1161 7 12 1134 1143 7 13 1188 1179 7 14 1161 1161 7 15 747 747 8 0 639 639 8 1 1134 1134 8 2 1125 1116 8 3 1089 1089 8 4 1107 1107 8 5 1134 1134 8 6 1125 1125 8 7 1116 1116 8 8 729 729 8 9 1143 1170 8 10 1098 1089 8 11 1080 1080 8 12 1143 1143 8 13 1179 1188 8 14 1161 1161 8 15 1170 1170 9 0 990 990 9 1 783 783 9 2 1143 1143 9 3 1098 1098 9 4 1098 1098 9 5 1152 1152 9 6 1125 1116 9 7 1116 1116 9 8 1125 1116 9 9 765 765 9 10 1107 1098 9 11 1089 1089 9 12 1080 1080 9 13 1179 1179 9 14 1170 1179 9 15 1161 1161 10 0 981 981 10 1 1116 1107 10 2 756 756 10 3 1098 1089 10 4 1098 1107 10 5 1143 1143 10 6 1134 1143 10 7 1107 1107 10 8 1116 1125 10 9 1116 1107 10 10 756 747 10 11 1107 1098 10 12 1080 1080 10 13 1152 1152 10 14 1161 1161 10 15 1161 1161 11 0 990 990 11 1 1116 1116 11 2 1089 1089 11 3 747 747 11 4 1098 1098 11 5 1143 1143 11 6 1125 1125 11 7 1125 1125 11 8 1116 1116 11 9 1107 1107 11 10 1098 1098 11 11 738 738 11 12 1080 1080 11 13 1152 1152 11 14 1134 1134 11 15 1152 1152 12 0 990 1008 12 1 1152 1161 12 2 1134 1116 12 3 1089 1089 12 4 747 747 12 5 1134 1134 12 6 1125 1125 12 7 1116 1116 12 8 1098 1089 12 9 1107 1107 12 10 1098 1098 12 11 1080 1071 12 12 738 738 12 13 1188 1152 12 14 1125 1116 12 15 1143 1143 13 0 990 990 13 1 1143 1161 13 2 1134 1125 13 3 1098 1098 13 4 1098 1098 13 5 783 792 13 6 1125 1125 13 7 1107 1116 13 8 1116 1125 13 9 1134 1134 13 10 1089 1089 13 11 1071 1080 13 12 1080 1089 13 13 783 774 13 14 1125 1134 13 15 1125 1116 14 0 981 981 14 1 1116 1125 14 2 1143 1125 14 3 1098 1098 14 4 1107 1098 14 5 1152 1143 14 6 774 774 14 7 1116 1107 14 8 1125 1116 14 9 1107 1107 14 10 1116 1125 14 11 1071 1080 14 12 1089 1089 14 13 1143 1134 14 14 747 756 14 15 1143 1143 15 0 990 990 15 1 1107 1107 15 2 1161 1152 15 3 1125 1116 15 4 1098 1098 15 5 1143 1143 15 6 1125 1125 15 7 8613 756 15 8 1125 1116 15 9 1107 1107 15 10 1107 1098 15 11 1098 1107 15 12 1080 1080 15 13 1143 1152 15 14 1125 1125 15 15 729 747 |
|||
16 Dec 2007, 12:40 |
|
r22 16 Dec 2007, 23:23
jbojarczuk, the best thing you can do is optimize your code and benchmark it against the RtlMoveMemory api. How does your technique of keeping SRC and DST aligned scale into SSE instructions?
I tried to improve on XP's memory move, but the versions I created were never cross the board improvements in ALL test scenarios. Trying to keep the SRC and DEST both aligned, seems like on larger moves >4kb your extra logic would be slower than just aligning the DEST. If you assume 4 instructions at 1 cycle a peice (I'm being generous) vs an unaligned load at worst a 3 cycle penalty your method would still be slower. BUT only a thorough benchmark using all possible alignments and a large subset of different move sizes would tell you for sure. |
|||
16 Dec 2007, 23:23 |
|
Hayden 17 Dec 2007, 08:01
to get rid of register dependecies is some thing like this,
this is how you would move 1 paragraph. btw cache reads/writes usualy occure 1 paragraph at a time. mov ebx, [edi+00H] mov edx, [edi+04H] mov ecx, [edi+08H] mov eax, [edi+0CH] mov [esi+00H], ebx mov [esi+04H], edx mov [esi+08H], ecx mov [esi+0CH], eax _________________ New User.. Hayden McKay. |
|||
17 Dec 2007, 08:01 |
|
Madis731 17 Dec 2007, 11:00
OT Start:
--------- Have you noticed Windows Server 2003 doing some nifty HDD optimizations on its own. Like yesterday I was repetedly trying to copy ~500 or so images (JPEGs 2-8MB in size) and the pattern was something like "12345........67890......." and it was really impressive to see the background action. The explanation is something like Hayden's post - it takes some 16MB of data and (I think) moves it through cache. The 16MB R / 16MB W / ... patterns are brought to the user with a limpy looking transfers. ----------- OT Ends It seems to be the best way and I've done my memory moves usually like this: Code: movaps xmm0, [esi+00h] movaps xmm1, [esi+10h] movaps xmm2, [esi+20h] movaps xmm3, [esi+30h] movntps [edi+00h], xmm0 movntps [edi+10h], xmm1 movntps [edi+20h], xmm2 movntps [edi+30h], xmm3 The MOVAPS is the shortest of the MOV group and 16-bytes per cycle throughput at the average. MOVNTPS is lower latency just because CPU thinks that the data is already in single precision and MOVNTPS is also shorter than other variants. |
|||
17 Dec 2007, 11:00 |
|
edfed 17 Dec 2007, 12:15
don't forget
and esi,not 0fh and edi,not 0fh |
|||
17 Dec 2007, 12:15 |
|
Xorpd! 17 Dec 2007, 17:21
To get all the output of my memcpy.asm program, issue the command
mode con: lines=1000 at the Windows console prompt before running. If that doesn't work, it may be bacause you are running the lame *32 cmd.exe. Microsoft wants to make this the default; has anyone ever noticed that Microsoft has better stuff that they want to give out by default? Find your x64 cmd.exe and try again. I don't know why redirection doesn't work. I noticed that myself and perhaps I'm doing some Windows thing wrong in my program. |
|||
17 Dec 2007, 17:21 |
|
rugxulo 17 Dec 2007, 23:58
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/redirection.mspx?mfr=true
Quote:
|
|||
17 Dec 2007, 23:58 |
|
Madis731 18 Dec 2007, 07:05
The lines=1000 worked, but I don't get it, why outputting to a file doesn't work - at all.
I was using the 64-bit version of cmd all the time, but for testing I now took the 32-bit version. Both act the same way and memcpy>out.txt 2>&1 doesn't work. http://support.microsoft.com/kb/110930 Actually the problem I'm having is when ANY redirection is done - all the output is lost. Really wierd... |
|||
18 Dec 2007, 07:05 |
|
Xorpd! 18 Dec 2007, 07:30
I don't get it either, else I would have fixed it. Maybe redirection is done by mainCRTStartup rather than cmd.exe. If anyone has a suggestion, here is the problem in detail: we assemble and link hello.asm:
Code: format MS64 coff ; File: hello.asm ; Assembled with: fasm hello.asm ; Linked with link hello.obj /subsystem:console /defaultlib:kernel32.lib /entry:_main extrn GetStdHandle extrn 'WriteConsoleA' as WriteConsole extrn ExitProcess STD_OUTPUT_HANDLE = -11 section 'CODE' code readable executable align 16 align 16 public _main _main: ; Align stack frame and leave room for procedure calls sub rsp, 40 ; Save rbx mov [rsp+48], rbx ; Get handle for stdout mov ecx, STD_OUTPUT_HANDLE call GetStdHandle mov rbx, rax ; Write message mov rcx, rbx ; hConsoleOutput mov rdx, mess ; lpBuffer mov r8d, mess_len ; nNumberOfCharsToWrite mov r9, result ; lpNumberOfCharsWritten mov qword[rsp+32], 0 ; lpReserved call WriteConsole ; Exit mov rbx, [rsp+48] mov ecx, 0 call ExitProcess section 'DATA' data readable writeable align 16 mess db 'Hello, world!', 0dh, 0ah mess_len = $-mess align 4 result dd ? it runs fine: Code: C:\Asm\FASM\EXAMPLES\WIN64\my_example>fasm hello.asm flat assembler version 1.67.18 (1222417 kilobytes memory) 3 passes, 401 bytes. C:\Asm\FASM\EXAMPLES\WIN64\my_example>link hello.obj /subsystem:console /default lib:kernel32.lib /entry:_main Microsoft (R) Incremental Linker Version 8.00.40310.39 Copyright (C) Microsoft Corporation. All rights reserved. C:\Asm\FASM\EXAMPLES\WIN64\my_example>hello Hello, world! but it doesn't do redirection successfully Code: C:\Asm\FASM\EXAMPLES\WIN64\my_example>dir hello.txt Volume in drive C has no label. Volume Serial Number is CCC8-B430 Directory of C:\Asm\FASM\EXAMPLES\WIN64\my_example File Not Found C:\Asm\FASM\EXAMPLES\WIN64\my_example>hello > hello.txt C:\Asm\FASM\EXAMPLES\WIN64\my_example>dir hello.txt Volume in drive C has no label. Volume Serial Number is CCC8-B430 Directory of C:\Asm\FASM\EXAMPLES\WIN64\my_example 12/18/2007 12:15 AM 0 hello.txt 1 File(s) 0 bytes 0 Dir(s) 122,378,969,088 bytes free Ah, here's a fix: Code: format MS64 coff ; File: hello.asm ; Assembled with: fasm hello.asm ; Linked with link hello.obj /subsystem:console /defaultlib:kernel32.lib /entry:_main extrn GetStdHandle extrn ExitProcess extrn WriteFile STD_OUTPUT_HANDLE = -11 section 'CODE' code readable executable align 16 align 16 public _main _main: ; Align stack frame and leave room for procedure calls sub rsp, 40 ; Save rbx mov [rsp+48], rbx ; Get handle for stdout mov ecx, STD_OUTPUT_HANDLE call GetStdHandle mov rbx, rax ; Write message mov rcx, rbx ; hConsoleOutput mov rdx, mess ; lpBuffer mov r8d, mess_len ; nNumberOfCharsToWrite mov r9, result ; lpNumberOfCharsWritten mov qword[rsp+32], 0 ; lpReserved call WriteFile ; Exit mov rbx, [rsp+48] mov ecx, 0 call ExitProcess section 'DATA' data readable writeable align 16 mess db 'Hello, world!', 0dh, 0ah mess_len = $-mess align 4 result dd ? And now: Code: C:\Asm\FASM\EXAMPLES\WIN64\my_example>dir hello.txt Volume in drive C has no label. Volume Serial Number is CCC8-B430 Directory of C:\Asm\FASM\EXAMPLES\WIN64\my_example File Not Found C:\Asm\FASM\EXAMPLES\WIN64\my_example>fasm hello.asm flat assembler version 1.67.18 (1195954 kilobytes memory) 3 passes, 397 bytes. C:\Asm\FASM\EXAMPLES\WIN64\my_example>link hello.obj /subsystem:console /default lib:kernel32.lib /entry:_main Microsoft (R) Incremental Linker Version 8.00.40310.39 Copyright (C) Microsoft Corporation. All rights reserved. C:\Asm\FASM\EXAMPLES\WIN64\my_example>hello > hello.txt C:\Asm\FASM\EXAMPLES\WIN64\my_example>dir hello.txt Volume in drive C has no label. Volume Serial Number is CCC8-B430 Directory of C:\Asm\FASM\EXAMPLES\WIN64\my_example 12/18/2007 12:26 AM 15 hello.txt 1 File(s) 15 bytes 0 Dir(s) 122,378,682,368 bytes free C:\Asm\FASM\EXAMPLES\WIN64\my_example>type hello.txt Hello, world! So it seems that WriteConsoleA doesn't necessarily write anywhere if output has been redirected, but WriteFile works. Wish that Microsoft had made this more obvious in their documentation. |
|||
18 Dec 2007, 07:30 |
|
f0dder 18 Dec 2007, 12:28
Well, if you check MSDN/PlatformSDK... first, at the description:
Quote:
- imho this does hint that it won't work with files, although I guess you could argue that "cursor location" could be interpreted as file pointer instead And then theres from the parameter description: Quote:
- obviously a redirected STDOUT means your GetStdHandle() returns a file handle, not a console handle. But the real and final indication would be the following, from Remarks (emphasis is mine): Quote:
But it's of course possible that this sentence hasn't been there in all PlatformSDKs and only been added relatively recently. PS: add error checking after your API calls |
|||
18 Dec 2007, 12:28 |
|
vid 18 Dec 2007, 12:36
grab FASMLIB and use it for easy formatted output.
Example: Code: format_string db "%10s %016X" push [value] push [name] push format_string call stdout.write.format or with macros Code: stdcall stdout.write.format, "%10s %016X", [name], [value] |
|||
18 Dec 2007, 12:36 |
|
Xorpd! 18 Dec 2007, 16:18
@f0dder: There were several places the sentence about WriteConsole failing under redirection could be in the MSDN docs. Under which header did you find it? I was thinking about error checking, but I was going to output the results of the check using WriteConsole, which wouldn't have worked. I suppose I could have used a MessageBox, but I was somewhat flustered by the fact that I couldn't do any output at all via my first choice method.
@vid: So FASMLIB works under x64 now, too? I thought that project awaited a suitable FASM.DLL. It's hard enough to guess what invoke may do in x64, let alone what that stdcall macro does. |
|||
18 Dec 2007, 16:18 |
|
vid 18 Dec 2007, 16:44
xorpd: oh shit, i overlooked that code is 64bit
No, there is no 64bit version of FASMLIB, and won't be any time soon. That would have to be complete rewrite. Sorry. |
|||
18 Dec 2007, 16:44 |
|
f0dder 18 Dec 2007, 23:49
Xorpd!: I simply looked up WriteConsole - but it is a pretty recent version of MSDN/PlatformSDK I'm using, and I know that (even important) information have been added with time, instead of being there from the beginning; nobody's perfect, and certainly not Microsoft either
|
|||
18 Dec 2007, 23:49 |
|
revolution 19 Dec 2007, 00:19
f0dder: The info in WriteConsole has been there since day one in the .hlp files and on MSDN.
|
|||
19 Dec 2007, 00:19 |
|
Xorpd! 19 Dec 2007, 04:20
It's there all right. When using the MSDN docs I was more optimistic that one could fly through them as fast as I did without missing anything of significance. Obviously an incorrect assumption. Thank you for pointing this out as it's more reassuring to know that it's the documented behavior of the software and not just one's perception from limited experimentation.
|
|||
19 Dec 2007, 04:20 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.