flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
revolution 25 Jan 2007, 10:30
Run two (or more) iterations in parallel to fill in the gaps with other registers and memory accesses.
|
|||
![]() |
|
r22 26 Jan 2007, 04:10
if unrolling the loop to multiple AND/OR parallel iterations isn't something you want to do you can always consider a different algorithm to try an accomplish the same goal as your code snippet above.
|
|||
![]() |
|
rugxulo 26 Jan 2007, 05:00
FrozenKnight wrote:
(EDIT: removed because some people can't take a joke ![]() Last edited by rugxulo on 28 Jan 2007, 04:34; edited 1 time in total |
|||
![]() |
|
FrozenKnight 27 Jan 2007, 08:51
rugxulo - the first 2 are just stupid and the third doesn't make sense. the 486 doesn't have dual pipes to allow for 2 instructions to run at one time. which means that the address interlocks aren't as pronounced.
r22 - nice suggestion but i examined the code and I'm not entirely sure that the i could change the algorithm to get the same output any faster but I'll look into seeing if it might be possible (which i doubt) to combine a couple of those shifts ands or xor's. revolution - i tried that approach but by tests showed that i actually added 10 cycles by doing that. i might have been able to cut a few of those off if i were to change the conditional jump before it so that it isn't alternating every iteration allowing for better processor prediction. currently the MS rand with I've been testing against (16 bit return but uses 32 bit registers. copied from visual C++ 6) counts at only 7 cycles. while my mt test counts at 22 cycles (average per iteration) while most implementations of the Mersenne Twister algorithm claim to be as fast as the standard rand function i've found that most of these either don't include this section or cheat on it and only do one xor some funny addition. the version i'm using is a true Mersenne Twister which includes all of the standard math. i plugged the shift into msvc++ compiler and found that my code is just as fast as msvc++ except it uses one less register and because i like to preserve registers i actually gained a cycle because of not needing the extra register. for reference Code: mov ecx, eax sar eax, 0Bh xor ecx, eax mov edx, ecx and edx, 0FF3A58ADh shl edx, 7 xor ecx, edx mov eax, ecx and eax, 0FFFFDF8Ch shl eax, 0Fh xor ecx, eax mov eax, ecx sar eax, 12h xor eax, ecx yes i did notice the slight differences in the math and once i saw them i adjusted my code and still managed to keep the same speed and still used one less register. |
|||
![]() |
|
FrozenKnight 27 Jan 2007, 21:31
update i managed to gain an average of about half a cycle per iteration by changing one register shift from ebx to eax
new code Code: mov eax, [ebx] ;---interlock--- mov eax, [mt_buffer+eax*4] ;---interlock--- mov ebx, eax ;---interlock---removed by changeing following ebx to eax sar eax, 0Bh ;---interlock--- xor eax, ebx ;---interlock--- mov ebx, eax shl eax, 7 ;---interlock--- and eax, 0FF3A58ADh;09D2C5680h ;---interlock--- xor eax, ebx ;---interlock--- mov ebx, eax shl eax, 0Fh ;---interlock--- and eax, 0FFFFDF8Ch;0EFC60000h ;---interlock--- xor eax, ebx ;---interlock--- mov ebx, eax sar eax, 12h ;---interlock--- xor eax, ebx code now averages at 21 cycles but the decimal behind runs a little high. (this code is now effectively faster than what msvc++ outputs.) the 21 cycles is an average which includes the twist algrothim this segment runs in about 7 cycles. |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.