flat assembler
Message board for the users of flat assembler.
Index
> Main > Fastest Next Power of Two function |
Author |
|
Kain 20 Jun 2006, 02:32
shr sets zero flag so "cmp eax,0" is unnecessary.
you might also try using add instead of inc. add is faster on many modern processors. anyway, here is mine, don't know if it's faster. Code: sub eax, 1 pow: mov ecx, eax add eax, 1 and ecx, eax jnz pow |
|||
20 Jun 2006, 02:32 |
|
Ivan2k2 20 Jun 2006, 03:47
2 Kain: it's much slower for big values
|
|||
20 Jun 2006, 03:47 |
|
LocoDelAssembly 20 Jun 2006, 04:09
Nice idea anyway, and if you replace sub/add with dec/inc you get a really small function, fits in an MMX register!!
|
|||
20 Jun 2006, 04:09 |
|
Ivan2k2 20 Jun 2006, 04:35
and mine:
Code: dec eax mov edx,eax shr edx,01h or edx,eax mov eax,edx shr eax,02h or eax,edx mov edx,eax shr edx,04h or edx,eax mov eax,edx shr eax,08h or eax,edx mov edx,eax shr edx,10h or edx,eax inc edx mov eax,edx |
|||
20 Jun 2006, 04:35 |
|
r22 21 Jun 2006, 04:00
Code: ;;in eax;;out eax nextpow2: bsr ecx,eax mov eax,1 jz .end add ecx,1 ;; could replace with LUT mov eax,dword[LUT+ecx*4] shl eax,cl ;; .data LUT dd 1,2,4,8,16,32,64,128, ... 2147483648 .end: ret 0 ON modern processors BSR is pretty fast 11clocks the only time when the above would be slower is when eax contains 0, 1, 2 ,3 or 4 but for all other numbers up to ~ 2billion the above is fastest. The reason it's wouldn't be fastest for the small numbers is because the looping methods would only do 1 or 2 iterations so they would be slightly faster. |
|||
21 Jun 2006, 04:00 |
|
mattst88 21 Jun 2006, 23:20
r22 wrote: ON modern processors BSR is pretty fast 11clocks Where can I find information on numbers of clock cycles? The only stuff I can find is for 486 clock cycles, and I believe x86 CPUs have progressed a little since then. |
|||
21 Jun 2006, 23:20 |
|
Ivan2k2 21 Jun 2006, 23:25
2 mattst88: http://agner.org/assem/
|
|||
21 Jun 2006, 23:25 |
|
mattst88 22 Jun 2006, 20:54
Thanks, I've tried that before. I assume I'm doing something wrong though. Here's my code:
Code: #include <stdio.h> #include "asmlib.h" inline unsigned int NextPow2(unsigned int value) { unsigned int x; asm("xorl %%ecx, %%ecx\n" "dec %0\n" "1:\n" "inc %%ecx\n" "shr $1, %0\n" "jnz 1b\n" "inc %0\n" "shl %%cl, %0\n" : "=r"(x) : "0"(value) : "ecx" ); return x; } inline unsigned int NextPow2_BSR(unsigned int value) { unsigned int x; asm("dec %1\n\t" "movl $2,%0\n\t" "bsr %1,%1\n\t" "shl %b1,%0\n\t" : "=r" (x) : "c" (value) ); return x; } int main() { unsigned int x = 0; unsigned int y = 0; int first, last; for (x = 0; x < 32770; x++) { first = ReadClock(); y = NextPow2_BSR(x); last = ReadClock(); printf("%u -> %u - %d clocks\n", x, y, last - first); } return 0; } When calling NextPow2 it takes 116 clocks for everything but the first interation, while calling NextPow2_BSR is 128 clocks. Is this correct? |
|||
22 Jun 2006, 20:54 |
|
Ivan2k2 22 Jun 2006, 22:38
What CPU do you have ???
|
|||
22 Jun 2006, 22:38 |
|
mattst88 22 Jun 2006, 22:43
Sempron 64 2800+.
|
|||
22 Jun 2006, 22:43 |
|
Ivan2k2 23 Jun 2006, 05:14
i thought that you have intel, AMD have slightly different optimization technics and different clocks
|
|||
23 Jun 2006, 05:14 |
|
mattst88 23 Jun 2006, 22:23
So the function utilizing bsr posted in my 3rd post is definitely faster than the other?
|
|||
23 Jun 2006, 22:23 |
|
mattst88 01 Dec 2007, 17:15
Code: ; FASM syntax ; Input - eax ; Output - eax nextpow2: bsr ecx,eax mov eax,2 jz .end shl eax,cl .end: ret Just change 'mov eax,2' to 'mov eax,1' if you want the previous power of two. Can anyone do any better? How about some SIMD versions? |
|||
01 Dec 2007, 17:15 |
|
levicki 22 Dec 2007, 15:18
Assuming that you are working with unsigned int how does that code handle input of 0x80000001? What does it return?
|
|||
22 Dec 2007, 15:18 |
|
LocoDelAssembly 22 Dec 2007, 18:04
$0 with "mov eax, 2" and $80000000 with "mov eax, 1", perfectly valid considering that unsigned numbers are always positive or zero, and obey the laws of arithmetic modulo 2^n, where n is the number of bits in the type (n=32 in this case).
|
|||
22 Dec 2007, 18:04 |
|
revolution 22 Dec 2007, 21:13
The code with mov eax,2 generates the wrong answer with an input of 0.
|
|||
22 Dec 2007, 21:13 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.