flat assembler
Message board for the users of flat assembler.

Index > Main > Speed up Mersenne Twister

Author
Thread Post new topic Reply to topic
FrozenKnight



Joined: 24 Jun 2005
Posts: 128
FrozenKnight
i'm trying to speed up this section of code but i cant see how to remove all of the address interlocks. i managed to get the rest of the code to execute at an average of 8 cycles per iteration but this section where the math is much simpler takes 2 times that. any suggestions?
Code:
  mov     eax, [ebx]
  ;---interlock---
  mov     eax, [mt_buffer+eax*4]
  ;---interlock---
  mov     ebx, eax
  ;---interlock---

  shr     ebx, 0Bh
  ;---interlock---
  xor     eax, ebx
  ;---interlock---
  mov     ebx, eax

  shl     eax, 7
  ;---interlock---
  and     eax, 9D2C5680h
  ;---interlock---
  xor     eax, ebx
  ;---interlock---
  mov     ebx, eax

  shl     eax, 0Fh
  ;---interlock---
  and     eax, 0EFC60000h
  ;---interlock---
  xor     eax, ebx
  ;---interlock---
  mov     ebx, eax

  shr     eax, 12h
  ;---interlock---
  xor     eax, ebx    
Post 25 Jan 2007, 10:00
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17474
Location: In your JS exploiting you and your system
revolution
Run two (or more) iterations in parallel to fill in the gaps with other registers and memory accesses.
Post 25 Jan 2007, 10:30
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
if unrolling the loop to multiple AND/OR parallel iterations isn't something you want to do you can always consider a different algorithm to try an accomplish the same goal as your code snippet above.
Post 26 Jan 2007, 04:10
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
FrozenKnight wrote:

any suggestions?


(EDIT: removed because some people can't take a joke Razz )


Last edited by rugxulo on 28 Jan 2007, 04:34; edited 1 time in total
Post 26 Jan 2007, 05:00
View user's profile Send private message Visit poster's website Reply with quote
FrozenKnight



Joined: 24 Jun 2005
Posts: 128
FrozenKnight
rugxulo - the first 2 are just stupid and the third doesn't make sense. the 486 doesn't have dual pipes to allow for 2 instructions to run at one time. which means that the address interlocks aren't as pronounced.

r22 - nice suggestion but i examined the code and I'm not entirely sure that the i could change the algorithm to get the same output any faster but I'll look into seeing if it might be possible (which i doubt) to combine a couple of those shifts ands or xor's.

revolution - i tried that approach but by tests showed that i actually added 10 cycles by doing that. i might have been able to cut a few of those off if i were to change the conditional jump before it so that it isn't alternating every iteration allowing for better processor prediction.

currently the MS rand with I've been testing against (16 bit return but uses 32 bit registers. copied from visual C++ 6) counts at only 7 cycles. while my mt test counts at 22 cycles (average per iteration)

while most implementations of the Mersenne Twister algorithm claim to be as fast as the standard rand function i've found that most of these either don't include this section or cheat on it and only do one xor some funny addition. the version i'm using is a true Mersenne Twister which includes all of the standard math.


i plugged the shift into msvc++ compiler and found that my code is just as fast as msvc++ except it uses one less register and because i like to preserve registers i actually gained a cycle because of not needing the extra register.
for reference
Code:
mov     ecx, eax
sar     eax, 0Bh
xor     ecx, eax
mov     edx, ecx
and     edx, 0FF3A58ADh
shl     edx, 7
xor     ecx, edx
mov     eax, ecx
and     eax, 0FFFFDF8Ch
shl     eax, 0Fh
xor     ecx, eax
mov     eax, ecx
sar     eax, 12h
xor     eax, ecx    

yes i did notice the slight differences in the math and once i saw them i adjusted my code and still managed to keep the same speed and still used one less register.
Post 27 Jan 2007, 08:51
View user's profile Send private message Reply with quote
FrozenKnight



Joined: 24 Jun 2005
Posts: 128
FrozenKnight
update i managed to gain an average of about half a cycle per iteration by changing one register shift from ebx to eax
new code
Code:
  mov     eax, [ebx]
  ;---interlock---
  mov     eax, [mt_buffer+eax*4]
  ;---interlock---
  mov     ebx, eax
  ;---interlock---removed by changeing following ebx to eax

  sar     eax, 0Bh
  ;---interlock---
  xor     eax, ebx
  ;---interlock---
  mov     ebx, eax

  shl     eax, 7
  ;---interlock---
  and     eax, 0FF3A58ADh;09D2C5680h
  ;---interlock---
  xor     eax, ebx
  ;---interlock---
  mov     ebx, eax

  shl     eax, 0Fh
  ;---interlock---
  and     eax, 0FFFFDF8Ch;0EFC60000h
  ;---interlock---
  xor     eax, ebx
  ;---interlock---
  mov     ebx, eax

  sar     eax, 12h
  ;---interlock---
  xor     eax, ebx    
the output has been adjusted so that it is now identical to what the compiled code would be.

code now averages at 21 cycles but the decimal behind runs a little high. (this code is now effectively faster than what msvc++ outputs.) the 21 cycles is an average which includes the twist algrothim this segment runs in about 7 cycles.
Post 27 Jan 2007, 21:31
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.