flat assembler
Message board for the users of flat assembler.

Index > Main > algoritm optimization: bitstreams

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
decard



Joined: 11 Sep 2003
Posts: 1092
Location: Poland
decard 28 Oct 2004, 14:10
Bitstream is a small library for reading/writing on single bit. I wrote it for my LZSS compressor (it's easiest to achieve bigger compression ratio when you operate on single bits). Bitstreaming isn's a critical part of whe whole algo, but optimizing it could make whole compressor faster. Here I attach two functions: one that stores bits in a bit stream, and the other that reads them.
Could you help me in making some speed improvements to these functions? Meaby there is some other way that would be faster?

Here's a short description of the algorithm I used:
- BitStream is a structure that contains all data needed to do streaming:
Code:
struct BitStream
  .buf dd ?
  .ptr dd ?
  .bit_ptr dd ?
ends    

.buf is a pointer to the first byte of the stream
.ptr points on a current DWORD
.bit_ptr is an bit offset relative to the .ptr

function BitsPut stores given number of given bits into the stream, and then fixes [.ptr] and [bit_ptr] members. It can write at most 32 bits into the stream.
BitsGet reads given number of bits, and moves the pointer too.
Both functions firstly generate bit mask that is used to store bits, then they actually write them. The complexity comes because bits to be stored can "cross" two DWORDs.

So, could you make any improvements? I would be really grateful...
If you want to see how the bitstreams work in a real program, then just download lzss library and test code.

thanks,
decard

Code:
; puts given number of bits into stream
proc BitsPut, .stream,.bits,.bit_count
        begin
        push    eax ebx ecx edx esi edi
        ; generate mask in edi:esi
        xor     eax,eax
        dec     eax
        mov     esi,eax
        mov     edi,eax
        mov     ebx,[.stream]
        mov     ecx,[.bit_count]
        shl     esi,cl
        mov     ecx,[ebx+BitStream.bit_ptr]
        shld    edi,esi,cl
        shld    esi,eax,cl
        ; load new bits into edx:ecx; at this point ecx==[ebx+BitStream.bit_ptr]
        xor     edx,edx
        mov     eax,[.bits]
        shld    edx,eax,cl
        shl     eax,cl
        mov     ecx,eax
        ; finally, store new bits using
        mov     ebx,[ebx+BitStream.ptr]
        and     [ebx],esi
        or      [ebx],ecx
        add     ebx,4
        and     [ebx],edi
        or      [ebx],edx
        ; and fix the pointers
        mov     ebx,[.stream]
        mov     eax,[.bit_count]
        add     [ebx+BitStream.bit_ptr],eax
        cmp     [ebx+BitStream.bit_ptr],32
        jb      .finish
        sub     [ebx+BitStream.bit_ptr],32
        add     [ebx+BitStream.ptr],4
      .finish:
        pop     edi esi edx ecx ebx eax
        return
endp

; reads given number of bits form the stream
proc BitsGet, .stream,.bit_count
        begin
        push    ebx ecx edx esi
        ; generate mask (in esi)
        xor     esi,esi
        dec     esi
        mov     ecx,[.bit_count]
        shl     esi,cl
        not     esi
        mov     ebx,[.stream]
        mov     eax,[ebx+BitStream.ptr]
        mov     edx,[eax+4]
        mov     eax,[eax]
        mov     ecx,[ebx+BitStream.bit_ptr]
        shrd    eax,edx,cl
        and     eax,esi
        ; fix the pointers
        mov     ebx,[.stream]
        mov     edx,[.bit_count]
        add     [ebx+BitStream.bit_ptr],edx
        cmp     [ebx+BitStream.bit_ptr],32
        jb      .finish
        sub     [ebx+BitStream.bit_ptr],32
        add     [ebx+BitStream.ptr],4
     .finish:
        pop     esi edx ecx ebx
        return
endp    
Post 28 Oct 2004, 14:10
View user's profile Send private message Visit poster's website Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 28 Oct 2004, 15:27
Hello Decard,
my suggestrion is you should try to minimalize memory usage,
and
in core functions don't use push/pop unless you really need them
other improvements won't make your code Amaizingly fast,
you know, CPU is many times faster than memory,
and:
the smaller the memory range you are using the bigger chance your system can cache the data in Data cache, which is about at least 8KB, but in newer processors, 16KB, 32KB and above.

another suggestion, if you can make use of fpu in core functions, you should do it, FPU will compute things in the background.
Fast MATRIX routines take adventage of FPU in parallel.

i have a question now:
anyone knows if pushing/poping all registers is faster, or pushad/popad ?
i haven't tested it yet.
Post 28 Oct 2004, 15:27
View user's profile Send private message Visit poster's website Reply with quote
roticv



Joined: 19 Jun 2003
Posts: 374
Location: Singapore
roticv 28 Oct 2004, 15:42
Hi Decard,

Why don't you post it on win32asm where people like The Svin and BitRAKE and mark larson(if he visits it) and some other talented assembly coder can beat it to pieces.

I will take a look when I have time. Currently don't really have the time to digest the codes and do any testing. To increase speed you definitely need to use sse for simulatenous calculation if it is possible.
Post 28 Oct 2004, 15:42
View user's profile Send private message Visit poster's website MSN Messenger Reply with quote
decard



Joined: 11 Sep 2003
Posts: 1092
Location: Poland
decard 28 Oct 2004, 15:56
Thanks, Victor, I will post it there Smile
Post 28 Oct 2004, 15:56
View user's profile Send private message Visit poster's website Reply with quote
Reverend



Joined: 24 Aug 2004
Posts: 408
Location: Poland
Reverend 28 Oct 2004, 21:41
Elo decard,
So first of all: a very common problem with speed is about "partial stalls". They occur, when you use a 16- or 8-bit register just after using 32-bit one, and vice versa. As I see, you have this very often. Second thing is that, you should avoid as much as possible writing to the register that has been previously changed. For example:
Code:
mov eax,[edx]
add eax,100 ; stall
    

but
Code:
mov [edx],eax
add eax,100
    

will be better
Of course these are not the same situations Smile but I just wanted to show you that writing after reading is ok, but reading/writing after writing slows down your code.

So now maybe let's put theory aside Smile and I will show you some piece of code:
Code:
proc BitsPut, .stream,.bits,.bit_count 
        begin 
        push    eax ebx ecx edx esi edi 

        or      eax,-1
        mov     ecx,[.bit_count] 
        mov     esi,eax
        mov     ebx,[.stream]
        shl     esi,cl 
        mov     edi,eax 
        mov     ecx,[ebx+BitStream.bit_ptr] 
        xor     edx,edx 
        shld    edi,esi,cl 
        shld    esi,eax,cl 

        mov     eax,[.bits] ; i couldn't find a way to
        shld    edx,eax,cl  ; avoid this stall :/
        shl     eax,cl

        mov     ebx,[ebx+BitStream.ptr] 
        mov     ecx,eax 

        and     [ebx],esi 
        and     [ebx+4],edi 
        or      [ebx],ecx 
        or      [ebx+4],edx 

        mov     eax,[.bit_count] 
        mov     ebx,[.stream] 
        add     eax,[ebx+BitStream.bit_ptr]
        cmp     eax,32
        mov     [ebx+BitStream.bit_ptr],eax
        jb      .finish
        
        sub     [ebx+BitStream.bit_ptr],32 
        add     [ebx+BitStream.ptr],4 
      .finish: 
        pop     edi esi edx ecx ebx eax 
        return 
endp
    

As you see, I mixed the code as much as I could so that there should be no (or the least possible) register stalls. And also I used the fact that instruction 'mov' doesn't change any flags, so that I could make some calculations on registers, not on memory (which is much slower). Look here:
Code:
        mov     eax,[.bit_count] 
        [...]
        add     eax,[ebx+BitStream.bit_ptr]
        cmp     eax,32
        mov     [ebx+BitStream.bit_ptr],eax
        jb      .finish

You had:

        mov     eax,[.bit_count] 
        add     [ebx+BitStream.bit_ptr],eax 
        cmp     [ebx+BitStream.bit_ptr],32 
        jb      .finish
    

In your code there's a stall, because you put something to eax, and just after it, you read from eax. And even tough, you read from that memory you just changed. Look what I did Laughing it's a better solution
Post 28 Oct 2004, 21:41
View user's profile Send private message Visit poster's website Reply with quote
Octavio



Joined: 21 Jun 2003
Posts: 366
Location: Spain
Octavio 28 Oct 2004, 22:09
decard wrote:
Bitstream is a small library for reading/writing on single bit. I wrote it for my LZSS compressor (it's easiest to achieve bigger compression ratio when you operate on single bits). Bitstreaming isn's a critical part of whe whole algo, but optimizing it could make whole compressor faster. Here I attach two functions: one that stores bits in a bit stream, and the other that reads them.
Could you help me in making some speed improvements to these functions? Meaby there is some other way that would be faster?




Code:
 
;cl  bit offset 0-31
;ch number of bits to read\write 0-32 
;edx  bits to write or bits readed
;edi   ptr to bufer better if aligned
;eax  stores bits for write it must be write at the end of the bistream.
 
bitsput:
xor ebx,ebx    
shld ebx,edx,cl ;save bits of next dword in ebx
shl edx,cl         ;put bits in position
or eax,edx       ;eax temporaly stores bits until we have 32 bits to write
add cl,ch
btr ecx,5         ;32 bits?
jc .l1
stosd               ;store 32 bits
mov eax,ebx
.l1:
ret

bitsget:
xor eax,eax
movzx edx,ch  ;prepare mask
bts eax,edx  ;ej: for size=4  10000b -1=1111b  
dec eax  ;mask
mov edx,[edi]
mov ebx,[edi+4]
shrd edx,ebx,cl
and edx,eax
add cl,ch
btr ecx,5        ;if offset >=32
sbb eax,eax    ;increment ptr 
and eax,4
add edi,eax
ret

    


perhaps you will need to add some push,pop,mov instructions to store variables in memory ,but this will slow down things.
I have not tested the code , so start by debug.
Post 28 Oct 2004, 22:09
View user's profile Send private message Visit poster's website Reply with quote
S.T.A.S.



Joined: 09 Jan 2004
Posts: 173
Location: Ru#27
S.T.A.S. 29 Oct 2004, 07:07
Hi, I don't think I can win The Svin & bitRAKE, however I can give some hints:
- your routine is rater small, so don't bother calling it, it's better to inline the code: call & ret are slow;
- pass parameters via registers;
- avoid 'shld' instruction - it's deadly slow on all CPUs;
- branches could be substitued with linear code in some cases.

Here's my code (the first half), I didn't test it heavily, but it seems Ok for me.
It's written as a macro, not proc, because of above notes.
Code:
macro   move    dest, src
{
        if dest eq src
        else
                mov dest,src
        end if
}


macro   put_bits        stream, bits, bit_count
{
        ; mask out unnecessary bits of source dword value
        or      ebx, -1
        move    ecx, bit_count
        move    eax, bits
        shl     ebx, cl
        not     ebx
        and     eax, ebx

        ; calculate masks & pointers
        move    esi, stream
        mov     edx, [esi + BitStream.bit_ptr]
        add     [esi + BitStream.bit_ptr], ecx  ; I belive there's no need to mask higher bits
        mov     ecx, edx
        or      edx, -1
        mov     edi, [esi + BitStream.ptr]
        shl     ebx, cl
        setc    bl
        shl     edx, cl
        rol     eax, cl ; this is slow on PIV.. may be substitued with some shl / shr
        movzx   ebx, bl
        mov     ecx, edx

        ; append bits to bitsteream
        and     edx, eax
        or      eax, ecx
        xor     eax, ecx

        or      [edi], edx      ; first dword of destination sould be set to zero before this routine is called first time
        mov     [edi + 4], eax  ; this also clears next dword

        lea     edi, [edi + ebx * 4]
        mov     [esi + BitStream.ptr], edi
}     

edit: BitStream.ptr increase fixed


Last edited by S.T.A.S. on 29 Oct 2004, 16:48; edited 1 time in total
Post 29 Oct 2004, 07:07
View user's profile Send private message Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 29 Oct 2004, 11:31
helloo roticv,
i guess you love talented gods like bitrake, but you shouldn't bother them with these very easy problems, why don't you give us amateurs a chance?

hy decard,
i wrote some code for you, after you removed unnecesary push / pop instructions, you might wanna test it.

tell me if you have a problem with it.
its not long code so i put it in code instead, i see you don't like to download Smile

Code:
; Matrix bitstream functions
; here you are, i knew xchg was slow, it could be a size optimization
; use rol cx,8 if its faster for your processor

write_bitstream:; eax = input bitstream from lsb, edi - ptr to buffer, better if aligned
xor ebx,ebx     ; cl  = bit offset 0-31
or ch,ch        ; ch  = number of bits to read\write 0-32
jz .done
mov edx,$FFFFFFFF ; 11111111111111111111
rol cx,8   ;xchg cl,ch ; put bit count in cl
shl edx,cl ; bit count in cl 11111111111111000000
not edx    ; complement      00000000000000111111
rol cx,8   ;xchg cl,ch ; bit offset in cl
shl edx,cl ; offset          00000001111111000000
not edx    ; complement      11111110000000111111
shld ebx,eax,cl ; ebx = upper part
shl eax,cl      ; offset eax
and [edi],edx   ; mask out first dword
or  [edi],eax   ; set bits in first dword
add cl,ch       ; get second dword number of bits
cmp cl,32       ; compare second dword count with 0,
jbe .done       ; if second dword not modified then skip
sub cl,32       ; get number of bits in second dword
mov edx,$FFFFFFFF ; 11111111111111111111
rol cx,8   ;xchg cl,ch        ; put bit count in cl
shl edx,cl        ; bit count in cl 11111111111111000000
and [edi+4],edx   ; mask out first dword
or  [edi+4],ebx   ; set bits in first dword
.done:ret

read_bitstream:; eax = output bitstream from lsb, edi - ptr to buffer, better if aligned
or ch,ch       ; cl  = bit offset 0-31
jz .done       ; ch  = number of bits to read\write 0-32
mov eax,[edi]  ; get first dword
mov ebx,[edi+4]; get second dword
shrd eax,ebx,cl; adjust by offset
mov edx,$FFFFFFFF ; 11111111111111111111
rol cx,8       ;xchg cl,ch     ; put bit count in cl
shl edx,cl     ; bit count in cl 11111111111111000000
not edx        ; complement      00000000000000111111
and eax,edx    ; mask out eax
.done:ret
    


Last edited by Matrix on 30 Oct 2004, 13:26; edited 2 times in total
Post 29 Oct 2004, 11:31
View user's profile Send private message Visit poster's website Reply with quote
Octavio



Joined: 21 Jun 2003
Posts: 366
Location: Spain
Octavio 29 Oct 2004, 16:12
Matrix wrote:
helloo roticv,
i guess you love talented gods like bitrake, but you shouldn't bother them with these very easy problems, why don't you give us amateurs a chance?

hy decard,
i wrote some code for you, after you removed unnecesary push / pop instructions, you might wanna test it.

tell me if you have a problem with it.

you forgot "xchg cl,ch" at the end
Post 29 Oct 2004, 16:12
View user's profile Send private message Visit poster's website Reply with quote
S.T.A.S.



Joined: 09 Jan 2004
Posts: 173
Location: Ru#27
S.T.A.S. 30 Oct 2004, 02:48
Are we optimizing for size, not for speed?
Post 30 Oct 2004, 02:48
View user's profile Send private message Reply with quote
roticv



Joined: 19 Jun 2003
Posts: 374
Location: Singapore
roticv 30 Oct 2004, 06:49
Speed.

Quote:

Could you help me in making some speed improvements to these functions?
Post 30 Oct 2004, 06:49
View user's profile Send private message Visit poster's website MSN Messenger Reply with quote
Octavio



Joined: 21 Jun 2003
Posts: 366
Location: Spain
Octavio 30 Oct 2004, 09:25
Matrix wrote:
helloo roticv,
i guess you love talented gods like bitrake, but you shouldn't bother them with these very easy problems, why don't you give us amateurs a chance?

hy decard,
i wrote some code for you, after you removed unnecesary push / pop instructions, you might wanna test it.

tell me if you have a problem with it.

without testing ,i can say it doesn´t work.
in the putsbits routine you use the number off bits to make the mask
but remember that part of the bits are writen in on doubleword and part are writen in the next.
also the two 'not edx' instructions are unnecessary.
and your code will never be faster than the one i wrote. Twisted Evil
Post 30 Oct 2004, 09:25
View user's profile Send private message Visit poster's website Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 30 Oct 2004, 10:58
hy Octavio, you might be right, your code contains less instructions than my,
but i don't see the main purpose of the function in yours.
therer are errors in it.
you should compare what it does with my code.
ps.: have you been thinking of instruction pairings on pentium processors? Smile
for example these coud not pair on pentium class processors:

Code:
xor eax,eax
bts eax,edx
dec eax
    


these could pair i guess, both first 2 or second 2 lines:

Code:
mov edx,$FFFFFFFF ; 11111111111111111111
xchg cl,ch ; put bit count in cl
shl edx,cl
    
Post 30 Oct 2004, 10:58
View user's profile Send private message Visit poster's website Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 30 Oct 2004, 12:46
Octavio wrote:

without testing ,i can say it doesn´t work.
in the putsbits routine you use the number off bits to make the mask
but remember that part of the bits are writen in on doubleword and part are writen in the next.
also the two 'not edx' instructions are unnecessary.
and your code will never be faster than the one i wrote. Twisted Evil


well thank you, i thought i was understanding the purpose of logical operations,

i think you should test codes before quoting them as not working if you don't see the code executing.
Post 30 Oct 2004, 12:46
View user's profile Send private message Visit poster's website Reply with quote
Octavio



Joined: 21 Jun 2003
Posts: 366
Location: Spain
Octavio 30 Oct 2004, 13:06
Matrix wrote:
ps.: have you been thinking of instruction pairings on pentium processors? Smile
for example these coud not pair on pentium class processors:

Code:
xor eax,eax
bts eax,edx
dec eax
    


these could pair i guess, both first 2 or second 2 lines:

Code:
mov edx,$FFFFFFFF ; 11111111111111111111
xchg cl,ch ; put bit count in cl
shl edx,cl
    


have you readed intel and amd optimizations manual?
mov edx,0ffffffffh ;on amd this instructions can take 0 cpu clocks
;but increases cache misses because the opcode is 5 bytes
xchg cl,ch ;slow on some cpus i think is not pareable and can cause stalls
shl edx,cl ;another slow and not pareable instruction
not edx ;you forget this one, not pareable with previous instruction

it´s better to think on good algorithms than on these kind of optimizations
for example instead of writing some bits in memory on every funcion call
using 'and mem,mask or mem,bits' what i do is store bits on a register and when the register is full i wrote it to memory, so i make less memory
operations ,and my code is faster,and also smaller,but still i have not tested it ,can you tell me please where you found errors?
Post 30 Oct 2004, 13:06
View user's profile Send private message Visit poster's website Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 30 Oct 2004, 13:21
Well i have a Celeron II, i have not read all optimization manuals yet, maeby someday ...

here 's your code:

Code:
;cl  bit offset 0-31
;ch number of bits to read\write 0-32
;edx  bits to write or bits readed
;edi   ptr to bufer better if aligned
;eax  stores bits for write it must be write at the end of the bistream.

bitsput:
xor ebx,ebx
shld ebx,edx,cl ;save bits of next dword in ebx , <<< you have not read the words yet
shl edx,cl         ;put bits in position
or eax,edx       ;eax temporaly stores bits until we have 32 bits to write <<< you have not zeroed EAX
add cl,ch
btr ecx,5         ;32 bits? <<< you are using a complex microcode for testing a bit
jc .l1
stosd               ;store 32 bits <<< where is the upper part? you know write 32 bits with offset, or 16 bit with >16bit offset
mov eax,ebx     ;<<< sure Smile , but where's the rest?
.l1:
ret

bitsget:
xor eax,eax
movzx edx,ch  ;prepare mask <<< well actually you wrote mov edx,ch first, you can't do that, but are you sure this is the best way to generate a mask?
bts eax,edx  ;ej: for size=4  10000b -1=1111b ; <<< complex microcode again
dec eax  ;mask <<< you are using edx again
mov edx,[edi]
mov ebx,[edi+4]
shrd edx,ebx,cl
and edx,eax
; <<< what do you mean by the followings? i thought this is only a get function
add cl,ch
btr ecx,5        ;if offset >=32
sbb eax,eax    ;increment ptr
and eax,4
add edi,eax
ret
    
Post 30 Oct 2004, 13:21
View user's profile Send private message Visit poster's website Reply with quote
Octavio



Joined: 21 Jun 2003
Posts: 366
Location: Spain
Octavio 30 Oct 2004, 14:58
Matrix wrote:
Code:
; <<< what do you mean by the followings? i thought this is only a get function
add cl,ch
btr ecx,5        ;if offset >=32
sbb eax,eax    ;increment ptr
and eax,4
add edi,eax
ret
    

Oops you are right there is another bug ...... in your code Laughing
in the first post is especified :

>BitsGet reads given number of bits, and moves the pointer too.
but you don´t increment the bufer pointer.


Matrix wrote:

shld ebx,edx,cl ;save bits of next dword in ebx , <<< you have not read the words

;edx bits to write or bits readed
this means that the calling rutines places bits to be write in edx register, there is no need to read any word.
since i don´t want to write a book to explain my code ,because even you will not buy it, i suggest that you run the code with a debugger to understand it.
Post 30 Oct 2004, 14:58
View user's profile Send private message Visit poster's website Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 30 Oct 2004, 15:20
octavio wrote:
Oops you are right there is another bug ...... in your code
in the first post is especified :

i hate bugs, don't say such a thing
and i guess everyone knows how to increase a number

Smile
okey that was i thought it did, so i thought you need to read a number of bits from an array indexed by [edi] , number of bits=0-32 , bit offset = 0-31

here's my code without shld/shrd (corrected version 2004.11.01)
Code:
; Matrix bitstream functions
; here you are, i knew xchg was slow, it could be a size optimization
; use rol cx,8 if its faster for your processor
; of course note that call does a push and a pop too
; in this version i have avoided the use of shld, shrd

write_bitstream:; eax = input bitstream from lsb, edi - ptr to buffer, better if aligned
or ch,ch        ; cl  = bit offset 0-31
jz .done        ; ch  = number of bits to read\write 0-32
mov edx,$FFFFFFFF ; 11111111111111111111
rol eax,cl ; hehe
mov ebx,eax
; --- this part is for zeroing out unneeded bits ( if any ) in the upper part
rol cx,8   ;xchg cl,ch ; put bit count in cl
add cl,ch
cmp cl,32
jbe .notzeroe
sub cl,32
shl edx,cl ; bit count in cl 11111111111111111000
and [edi+4],edx   ; mask out first dword
not edx    ; complement      00000000000000000111
and ebx,edx
or  [edi+4],ebx   ; set bits in first dword
add cl,32
.notzeroe:
sub cl,ch
rol cx,8   ;xchg cl,ch ; put offset in cl
; --- this part is for zeroing out unneeded bits ( if any ) in the upper part
mov edx,$FFFFFFFF ; 11111111111111111111
shl edx,cl ; offset in cl 11111111111111000000
; if you are zeroing out use this instead
and eax,edx

rol cx,8   ;xchg cl,ch ; put bit count in cl
mov edx,$FFFFFFFF ; 11111111111111111111
shl edx,cl ; bit count in cl 11111111111111000000
rol cx,8   ;xchg cl,ch ; put offset in cl
shl edx,cl ; offset          00000001111111000000
not edx    ; complement      11111110000000111111
rol cx,8   ;xchg cl,ch ; put bit count in cl
and [edi],edx   ; mask out first dword
or  [edi],eax   ; set bits in first dword
ret

read_bitstream:; eax = output bitstream from lsb, edi - ptr to buffer, better if aligned
or ch,ch       ; cl  = bit offset 0-31
jz .done       ; ch  = number of bits to read\write 0-32
mov eax,[edi]  ; get first dword
mov ebx,[edi+4]; get second dword
mov edx,$FFFFFFFF ; 11111111111111111111
shr eax,cl
shr edx,cl ; offset in cl 00000011111111111111
ror ebx,cl
not edx    ; offset in cl 11111100000000000000
and ebx,edx
or eax,ebx
mov edx,$FFFFFFFF ; 11111111111111111111
rol cx,8       ;xchg cl,ch     ; put bit count in cl
shl edx,cl     ; bit count in cl 11111111111111000000
not edx        ; complement      00000000000000111111
and eax,edx    ; mask out eax
.done:ret
    

decard, tell me if i were all wrong.
ps.:
decard, the hi part of ecx is free for use Smile


Last edited by Matrix on 01 Nov 2004, 04:03; edited 1 time in total
Post 30 Oct 2004, 15:20
View user's profile Send private message Visit poster's website Reply with quote
S.T.A.S.



Joined: 09 Jan 2004
Posts: 173
Location: Ru#27
S.T.A.S. 31 Oct 2004, 04:53
Matrix wrote:
i guess everyone knows how to increase a number
Yes, you're right - it's rather simple task Smile
But we want conditional increase. Jcc isn't nice way to do that - branches misprediction is very bad.
Also, your last code heavily uses partial registers - IMHO, this could be considered slow, too.

I see you figured out how to use rol instead of shld Wink
But you miss some boolean algebra:
Code:
not edx
and ebx,edx
not edx
and [edi+4],edx   ; mask out first dword     
may be substituted with
Code:
or  ebx,edx
xor ebx,edx
and [edi+4],edx    
Also, use
Code:
or  edx, -1 ;11111111111111111111    
instead of
Code:
mov edx,$FFFFFFFF ; 11111111111111111111    
it's smaller in size, so better for decoding unit.

PS: I agree with Octavio - initial code is a bit HLL fashioned, assembly style is to accumulate bits in a register, and save it to memory then..
Post 31 Oct 2004, 04:53
View user's profile Send private message Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 31 Oct 2004, 09:11
S.T.A.S. wrote:
PS: I agree with Octavio - initial code is a bit HLL fashioned, assembly style is to accumulate bits in a register, and save it to memory then..

Hello S.T.A.S.
thank you for your quote.
i'll take your advice and think more.

my quote is:
"and [mem],reg" and "or [mem],reg" preserves storing the data in a register, wonder why you don't like it.

however i used to do several bytes of size & some noticable speed optimizations using logical operations instead of mov.
But i wasn't sure about what you said, its that mov ebx,$ffffffff is sllower than or ebx,$ffffffff , and out friend here said before that mov ebx,$ffffffff takes 0 ticks on some cpu-s.

S.T.A.S. wrote:

But we want conditional increase. Jcc isn't nice way to do that - branches misprediction is very bad.

i used adc reg,0 in some cases
Post 31 Oct 2004, 09:11
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.