flat assembler
Message board for the users of flat assembler.

Index > Windows > Optimization tips needed for my coding

Author
Thread Post new topic Reply to topic
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 08 Mar 2008, 06:14
First of all, if you have any optimization tips, please state them, because I'm trying to learn some.

Okay, my loop has this on every iteration:
Code:
repeat 4 {
; four mov instructions
; four xor instructions
mov [dest+(%-1)*4],eax
 }
    

But, stosd would not work well, because the destination is only four dwords in length, therefore after every couple iterations I would have to have a sub esi, 16. Is there something else, or will the code be better with stosd?

Second: I need to move 4 dwords of data from one variable memory pointer to one constant memory pointer, and after the move I have to perform four xor operations on the dword. Is this the best possible?
Code:
mov edi,[variable]
mov esi,[constant]
repeat 4
  lodsd
  xor eax,[other]
  stosd
end repeat
    

Third, I have this loop:
Code:
repeat 4
movzx eax,byte
movzx ebx,byte
movzx ecx,byte
movxz edx,byte
; four mov al/ah, [eax+byte_table]
mov [dest],eax
end repeat
    

Will this run faster?
Code:
mov ebx,byte_table
repeat 4
mov al, byte
mov ah,byte
shl eax,16
mov al,byte
mov ah,byte
xlatb
ror eax,8
xlatb
ror eax,8
xlatb
ror eax,8
xlatb
ror eax,8
mov [dest],eax
end repeat
    

That is all I can see for now, please help to get my code running faster![/code]
Post 08 Mar 2008, 06:14
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20621
Location: In your JS exploiting you and your system
revolution 08 Mar 2008, 06:35
I suggest you just go ahead and code up all the solutions you have presented above and time it in the real situation that you want it to run. There is more to optimisation than just looking at code in the screen, so many interactions between sections inside, and outside, the CPU have a very significant effect on the runtime. You can't look at a piece of code in isolation and say it is fast. There are too many dependencies upon other things.

Also look at how much time you are spending on optimising and compare that to how time will be saved when the program is running. If you spend 1 month optimising and only save seconds in runtime then your time probably has not been utilised efficiently.

The only caveat to that is if your code has a specific deadline where it must produce results within a certain time limit, then of course you have to optimise to meet the demand. But situations like this are rare so it is unlikely you will find this requirement.
Post 08 Mar 2008, 06:35
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 08 Mar 2008, 16:20
Okay, from the start I've had many problems with endian-ness:

When moving from memory to a register, it goes from 1,2,3,4 to 2,1,4,3.

When moving from register to memory, it goes from 1,2,3,4 to 4,3,2,1

When using an operation (like xor) from register to memory, the memory operand goes from 1,2,3,4 in memory to 4,3,2,1 before the operation.

When using the bswap instruction (register), it goes from 1,2,3,4 to 4,3,2,1.

Everytime I receive user input I almost always have to perform bswap to fix it out before I even think of working with it.

I see two different types of byte-swapping in my computer, are my results normal?

Can someone please tell me why I have both middle-endian AND little-endian-ness in my computer? It is very hard to work with,


Last edited by AlexP on 08 Mar 2008, 17:14; edited 1 time in total
Post 08 Mar 2008, 16:20
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20621
Location: In your JS exploiting you and your system
revolution 08 Mar 2008, 17:09
It might be worth mentioning that what I mentioned above only applies to modern CPU super scalar architecture. If you are programming for some type of embedded system (which would be likely if you have a deadline to meet with the code timing) then that particular CPU may have more predicable software performance characteristics. eg. An 80386sx embedded system will be much easier to optimise for, even without running the code or actually doing timing tests. But still timing tests should be done to verify that your code is running as expected and required.


Last edited by revolution on 08 Mar 2008, 17:23; edited 1 time in total
Post 08 Mar 2008, 17:09
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20621
Location: In your JS exploiting you and your system
revolution 08 Mar 2008, 17:18
AlexP wrote:
Okay, from the start I've had many problems with endian-ness:

When moving from memory to a register, it goes from 1,2,3,4 to 2,1,4,3.

When moving from register to memory, it goes from 1,2,3,4 to 4,3,2,1

When using an operation (like xor) from register to memory, the memory operand goes from 1,2,3,4 in memory to 4,3,2,1 before the operation.

When using the bswap instruction (register), it goes from 1,2,3,4 to 4,3,2,1.

Everytime I receive user input I almost always have to perform bswap to fix it out before I even think of working with it.

I see two different types of byte-swapping in my computer, are my results normal?

But most importantly, what is with the 2,1,4,3 arrangement? I don't see why it does this when I use any mov instruction from memory to registers, because I have not heard of this type of endian-ness... What is with my computer?
It is all about how you are seeing it in your mind. x86 architecture is always little-endian, so when you load memory-to-register the lowest address from memory goes to the matching lowest significant byte in the register. Only bswap can do a 32bit endian-ness change.

Many of the hashing and encryption algos specify the byte order as big-endian. So to solve the problem just after loading the data from memory do a bswap. When working with the data internally always use little-endian, this is the native CPU format. Then just before storing the finalised value do a bswap to put it back to big-endian. This should save you many headaches and wasted time wondering about what goes where.

So do like this:
  1. load raw data
  2. bswap
  3. manipulate data
  4. bswap
  5. save final values
Post 08 Mar 2008, 17:18
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 08 Mar 2008, 17:21
Thanks, I thought I had a perfectly normal Pentium laptop with little-endianness, but apparently I've got, uhh, what you said. I have no deadlines, but what I do have is time.

What this applies to is my attempt at taking a byte[] from memory, using it to create another byte[] after all operations are done with. The only problem is that when I ever get user input, I almost always have to swap it and then end up with the wrong type of array, I don't see any way to get it right...
Post 08 Mar 2008, 17:21
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20621
Location: In your JS exploiting you and your system
revolution 08 Mar 2008, 17:28
AlexP wrote:
Thanks, I thought I had a perfectly normal Pentium laptop with little-endianness, but apparently I've got, uhh, what you said. I have no deadlines, but what I do have is time.

What this applies to is my attempt at taking a byte[] from memory, using it to create another byte[] after all operations are done with. The only problem is that when I ever get user input, I almost always have to swap it and then end up with the wrong type of array, I don't see any way to get it right...
Sorry, I posted before I saw you second message, and it may have been misleading or make no sense. I have edited it slightly.

I expect your laptop, if it is under ~8-10 years old, will be super-scaler and thus trickier to optimise for.

As for the byte[] thing you describe I am not sure what you are trying to say/ask. If you algo is primarily a big-endian algo then have have little choice but to do bswaps. But you mention byte arrays where the endian thing doesn't make sense.
Post 08 Mar 2008, 17:28
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 08 Mar 2008, 17:32
Yeah, I have to receive a byte[] from the user, then perform my operations to create a new byte[]. This new byte[] will be used by another one of my algorithms, and it is just easier for this new byte[] to be (a byte[] ), so my operations in the second algorithm can just use memory operations instead of lodsd, then performing the operation.

So far I've not found any easy way to do this, so I'm sticking with bswap'ing the input before using it. Thanks for insights, I will definitely research those things when I have time.
Post 08 Mar 2008, 17:32
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20621
Location: In your JS exploiting you and your system
revolution 08 Mar 2008, 17:39
I think you will need to provide details of you algo, we are talking about in circles. You need to be more specific about why a byte array needs any endian-ness attached. Endian-ness would imply you are using, at a minimum, word arrays, and more probably I expect you are using dword arrays.
Post 08 Mar 2008, 17:39
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 08 Mar 2008, 18:17
Okay, here. You may recognize it, I have been studying many implementations of AES and am trying to combine the best of them.
Code:
; -----------------------------------------------------------------------------
;
; aes encryption - user interface
; -----------------------------------------------------------------------------
 aes_encuser:

; preserve registers

    sub     esp, aes_stack_space
    mov     [esp+12], esi
    mov     [esp+ 8], edi
    mov     [esp+ 4], ebp
    mov     [esp+ 0], ebx

; obtain input parameter

    cld
    mov     esi, [esp+aes_stack_space+aes_encrypt_in_blk] ; lodsd
    mov     edi, aes_state.state                          ; stosd

; format user input to little-endian architectures, first key addition

  rept 4 column {
    lodsd
    xor     eax, d[aes_state.keys+(column-1)*4]
    stosd         }

; perform the 13 normal rounds

 if aes_encrypt_unroll

  rept 13 round
  { table_lookup_key_add (aes_state.state),(aes_state.state),((round)*16) }

 else

    mov     edx, 13
    mov     edi, aes_state.keys+16  ; add edi,16?
@@: table_lookup_key_add (aes_state.state),(aes_state.state),edi
    add     edi, 4*4
    dec     edx
    jnz     @b

 end if

; final round and key addition

    table_lookup_final (aes_state.state),(esp+aes_stack_space+aes_encrypt_out_blk),(224)

; restore registers and return

    mov     ebx, [esp+ 0]
    mov     ebp, [esp+ 4]
    mov     edi, [esp+ 8]
    mov     esi, [esp+12]
    add     esp, aes_stack_space
    ret 8
    

Here is the table lookup macros, pretty much the ones you used.
Code:
; -----------------------------------------------------------------------------
;
; core aes routines
; -----------------------------------------------------------------------------
 macro table_lookup_key_add source,dest,key {
  rept 4 column {
    movzx   eax, b[source+((column-1) mod 4)*4+0]
    movzx   ebx, b[source+((column+0) mod 4)*4+1]
    movzx   ecx, b[source+((column+1) mod 4)*4+2]
    mov     eax, [eax*4+aes_tables.tab_t1]
    xor     eax, [ebx*4+aes_tables.tab_t2]
    movzx   ebx, b[source+((column+2) mod 4)*4+3]
    xor     eax, [ecx*4+aes_tables.tab_t3]
    xor     eax, [(column-1)*4+key]
    xor     eax, [ebx*4+aes_tables.tab_t4]
    mov     d[dest+(column-1)*4], eax \} } ; stosd . . . / sub edi,16

 macro table_lookup_final source,dest,key   {
  rept 4 column {
    movzx   eax, b[source+((column+1) mod 4)*4+2]
    movzx   ebx, b[source+((column+2) mod 4)*4+3]
    movzx   ecx, b[source+((column-1) mod 4)*4+0]
    movzx   edx, b[source+((column+0) mod 4)*4+1]
    mov     al,  b[eax+aes_tables.tab_s] ; xlatb gives more instructions...
    mov     ah,  b[ebx+aes_tables.tab_s]
    shl     eax, 16
    mov     al,  b[ecx+aes_tables.tab_s]
    mov     ah,  b[edx+aes_tables.tab_s]
    xor     eax, [(column-1)*4+key]
    mov     [dest+(column-1)*4], eax \} } ; stosd . . . / sub edi,16
    

Hope it helps, the key schedule I am using is a dword[] (reversed in memory). I have chosen this path so when I use memory operands for xor operations it translates to normal.

PS: In the user interface, when the first key addition happens I need to insert "bswap eax". Then it will work. (lol I've said that hundreds of times, rarely ever happens.)
Post 08 Mar 2008, 18:17
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20621
Location: In your JS exploiting you and your system
revolution 08 Mar 2008, 18:33
AlexP wrote:
Hope it helps, the key schedule I am using is a dword[] (reversed in memory). I have chosen this path so when I use memory operands for xor operations it translates to normal.
Therein lies your problem, you seem to have totally confused yourself with using different endian-nesses. x86 is little-endian, so don't fight it, because it will always win, use it as is was designed and it will play nicely with you.

One golden rule that I perhaps should have mentioned earlier, and I assumed you had already applied, is "get it working, then get it fast".
Post 08 Mar 2008, 18:33
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 08 Mar 2008, 19:25
Thanks, I'll do that. Replace the input to the main algorithm with whatever is supposed to go there, to see if it works before worrying about all the formatting and stuff.
Post 08 Mar 2008, 19:25
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 09 Mar 2008, 07:42
Thank you revolution, I took (both) your advice, and I finally hit that F7 and it worked. It's taken many dozens of hours (mostly spend reading and researching AES), and many more to finally get something working. This is definitely the best project I've worked on so far, and now all I have to do is have fun making it better.

Right now, it's a hybrid between good coding practice, specification, and table usage. I feel that the encryption algo is as good as I need it for now, though it took me awhile to figure out why, in your version, you switched the "source" and "destination" between every round, then I saw I was overwriting what I needed! Works great now.

I'm also planning to buy the official AES book, made by the designers and containing information that I REALLY want to learn. At a whopping $50, I'll have to get a little more money until I can order it. Thanks again!
Post 09 Mar 2008, 07:42
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.