flat assembler
Message board for the users of flat assembler.
Index
> Main > algoritm optimization: bitstreams Goto page 1, 2 Next |
Author |
|
decard 28 Oct 2004, 14:10
Bitstream is a small library for reading/writing on single bit. I wrote it for my LZSS compressor (it's easiest to achieve bigger compression ratio when you operate on single bits). Bitstreaming isn's a critical part of whe whole algo, but optimizing it could make whole compressor faster. Here I attach two functions: one that stores bits in a bit stream, and the other that reads them.
Could you help me in making some speed improvements to these functions? Meaby there is some other way that would be faster? Here's a short description of the algorithm I used: - BitStream is a structure that contains all data needed to do streaming: Code: struct BitStream .buf dd ? .ptr dd ? .bit_ptr dd ? ends .buf is a pointer to the first byte of the stream .ptr points on a current DWORD .bit_ptr is an bit offset relative to the .ptr function BitsPut stores given number of given bits into the stream, and then fixes [.ptr] and [bit_ptr] members. It can write at most 32 bits into the stream. BitsGet reads given number of bits, and moves the pointer too. Both functions firstly generate bit mask that is used to store bits, then they actually write them. The complexity comes because bits to be stored can "cross" two DWORDs. So, could you make any improvements? I would be really grateful... If you want to see how the bitstreams work in a real program, then just download lzss library and test code. thanks, decard Code: ; puts given number of bits into stream proc BitsPut, .stream,.bits,.bit_count begin push eax ebx ecx edx esi edi ; generate mask in edi:esi xor eax,eax dec eax mov esi,eax mov edi,eax mov ebx,[.stream] mov ecx,[.bit_count] shl esi,cl mov ecx,[ebx+BitStream.bit_ptr] shld edi,esi,cl shld esi,eax,cl ; load new bits into edx:ecx; at this point ecx==[ebx+BitStream.bit_ptr] xor edx,edx mov eax,[.bits] shld edx,eax,cl shl eax,cl mov ecx,eax ; finally, store new bits using mov ebx,[ebx+BitStream.ptr] and [ebx],esi or [ebx],ecx add ebx,4 and [ebx],edi or [ebx],edx ; and fix the pointers mov ebx,[.stream] mov eax,[.bit_count] add [ebx+BitStream.bit_ptr],eax cmp [ebx+BitStream.bit_ptr],32 jb .finish sub [ebx+BitStream.bit_ptr],32 add [ebx+BitStream.ptr],4 .finish: pop edi esi edx ecx ebx eax return endp ; reads given number of bits form the stream proc BitsGet, .stream,.bit_count begin push ebx ecx edx esi ; generate mask (in esi) xor esi,esi dec esi mov ecx,[.bit_count] shl esi,cl not esi mov ebx,[.stream] mov eax,[ebx+BitStream.ptr] mov edx,[eax+4] mov eax,[eax] mov ecx,[ebx+BitStream.bit_ptr] shrd eax,edx,cl and eax,esi ; fix the pointers mov ebx,[.stream] mov edx,[.bit_count] add [ebx+BitStream.bit_ptr],edx cmp [ebx+BitStream.bit_ptr],32 jb .finish sub [ebx+BitStream.bit_ptr],32 add [ebx+BitStream.ptr],4 .finish: pop esi edx ecx ebx return endp |
|||
28 Oct 2004, 14:10 |
|
roticv 28 Oct 2004, 15:42
Hi Decard,
Why don't you post it on win32asm where people like The Svin and BitRAKE and mark larson(if he visits it) and some other talented assembly coder can beat it to pieces. I will take a look when I have time. Currently don't really have the time to digest the codes and do any testing. To increase speed you definitely need to use sse for simulatenous calculation if it is possible. |
|||
28 Oct 2004, 15:42 |
|
decard 28 Oct 2004, 15:56
Thanks, Victor, I will post it there
|
|||
28 Oct 2004, 15:56 |
|
Reverend 28 Oct 2004, 21:41
Elo decard,
So first of all: a very common problem with speed is about "partial stalls". They occur, when you use a 16- or 8-bit register just after using 32-bit one, and vice versa. As I see, you have this very often. Second thing is that, you should avoid as much as possible writing to the register that has been previously changed. For example: Code: mov eax,[edx] add eax,100 ; stall but Code: mov [edx],eax add eax,100 will be better Of course these are not the same situations but I just wanted to show you that writing after reading is ok, but reading/writing after writing slows down your code. So now maybe let's put theory aside and I will show you some piece of code: Code: proc BitsPut, .stream,.bits,.bit_count begin push eax ebx ecx edx esi edi or eax,-1 mov ecx,[.bit_count] mov esi,eax mov ebx,[.stream] shl esi,cl mov edi,eax mov ecx,[ebx+BitStream.bit_ptr] xor edx,edx shld edi,esi,cl shld esi,eax,cl mov eax,[.bits] ; i couldn't find a way to shld edx,eax,cl ; avoid this stall :/ shl eax,cl mov ebx,[ebx+BitStream.ptr] mov ecx,eax and [ebx],esi and [ebx+4],edi or [ebx],ecx or [ebx+4],edx mov eax,[.bit_count] mov ebx,[.stream] add eax,[ebx+BitStream.bit_ptr] cmp eax,32 mov [ebx+BitStream.bit_ptr],eax jb .finish sub [ebx+BitStream.bit_ptr],32 add [ebx+BitStream.ptr],4 .finish: pop edi esi edx ecx ebx eax return endp As you see, I mixed the code as much as I could so that there should be no (or the least possible) register stalls. And also I used the fact that instruction 'mov' doesn't change any flags, so that I could make some calculations on registers, not on memory (which is much slower). Look here: Code: mov eax,[.bit_count] [...] add eax,[ebx+BitStream.bit_ptr] cmp eax,32 mov [ebx+BitStream.bit_ptr],eax jb .finish You had: mov eax,[.bit_count] add [ebx+BitStream.bit_ptr],eax cmp [ebx+BitStream.bit_ptr],32 jb .finish In your code there's a stall, because you put something to eax, and just after it, you read from eax. And even tough, you read from that memory you just changed. Look what I did it's a better solution |
|||
28 Oct 2004, 21:41 |
|
Octavio 28 Oct 2004, 22:09
decard wrote: Bitstream is a small library for reading/writing on single bit. I wrote it for my LZSS compressor (it's easiest to achieve bigger compression ratio when you operate on single bits). Bitstreaming isn's a critical part of whe whole algo, but optimizing it could make whole compressor faster. Here I attach two functions: one that stores bits in a bit stream, and the other that reads them. Code: ;cl bit offset 0-31 ;ch number of bits to read\write 0-32 ;edx bits to write or bits readed ;edi ptr to bufer better if aligned ;eax stores bits for write it must be write at the end of the bistream. bitsput: xor ebx,ebx shld ebx,edx,cl ;save bits of next dword in ebx shl edx,cl ;put bits in position or eax,edx ;eax temporaly stores bits until we have 32 bits to write add cl,ch btr ecx,5 ;32 bits? jc .l1 stosd ;store 32 bits mov eax,ebx .l1: ret bitsget: xor eax,eax movzx edx,ch ;prepare mask bts eax,edx ;ej: for size=4 10000b -1=1111b dec eax ;mask mov edx,[edi] mov ebx,[edi+4] shrd edx,ebx,cl and edx,eax add cl,ch btr ecx,5 ;if offset >=32 sbb eax,eax ;increment ptr and eax,4 add edi,eax ret perhaps you will need to add some push,pop,mov instructions to store variables in memory ,but this will slow down things. I have not tested the code , so start by debug. |
|||
28 Oct 2004, 22:09 |
|
S.T.A.S. 29 Oct 2004, 07:07
Hi, I don't think I can win The Svin & bitRAKE, however I can give some hints:
- your routine is rater small, so don't bother calling it, it's better to inline the code: call & ret are slow; - pass parameters via registers; - avoid 'shld' instruction - it's deadly slow on all CPUs; - branches could be substitued with linear code in some cases. Here's my code (the first half), I didn't test it heavily, but it seems Ok for me. It's written as a macro, not proc, because of above notes. Code: macro move dest, src { if dest eq src else mov dest,src end if } macro put_bits stream, bits, bit_count { ; mask out unnecessary bits of source dword value or ebx, -1 move ecx, bit_count move eax, bits shl ebx, cl not ebx and eax, ebx ; calculate masks & pointers move esi, stream mov edx, [esi + BitStream.bit_ptr] add [esi + BitStream.bit_ptr], ecx ; I belive there's no need to mask higher bits mov ecx, edx or edx, -1 mov edi, [esi + BitStream.ptr] shl ebx, cl setc bl shl edx, cl rol eax, cl ; this is slow on PIV.. may be substitued with some shl / shr movzx ebx, bl mov ecx, edx ; append bits to bitsteream and edx, eax or eax, ecx xor eax, ecx or [edi], edx ; first dword of destination sould be set to zero before this routine is called first time mov [edi + 4], eax ; this also clears next dword lea edi, [edi + ebx * 4] mov [esi + BitStream.ptr], edi } edit: BitStream.ptr increase fixed Last edited by S.T.A.S. on 29 Oct 2004, 16:48; edited 1 time in total |
|||
29 Oct 2004, 07:07 |
|
Matrix 29 Oct 2004, 11:31
helloo roticv,
i guess you love talented gods like bitrake, but you shouldn't bother them with these very easy problems, why don't you give us amateurs a chance? hy decard, i wrote some code for you, after you removed unnecesary push / pop instructions, you might wanna test it. tell me if you have a problem with it. its not long code so i put it in code instead, i see you don't like to download Code: ; Matrix bitstream functions ; here you are, i knew xchg was slow, it could be a size optimization ; use rol cx,8 if its faster for your processor write_bitstream:; eax = input bitstream from lsb, edi - ptr to buffer, better if aligned xor ebx,ebx ; cl = bit offset 0-31 or ch,ch ; ch = number of bits to read\write 0-32 jz .done mov edx,$FFFFFFFF ; 11111111111111111111 rol cx,8 ;xchg cl,ch ; put bit count in cl shl edx,cl ; bit count in cl 11111111111111000000 not edx ; complement 00000000000000111111 rol cx,8 ;xchg cl,ch ; bit offset in cl shl edx,cl ; offset 00000001111111000000 not edx ; complement 11111110000000111111 shld ebx,eax,cl ; ebx = upper part shl eax,cl ; offset eax and [edi],edx ; mask out first dword or [edi],eax ; set bits in first dword add cl,ch ; get second dword number of bits cmp cl,32 ; compare second dword count with 0, jbe .done ; if second dword not modified then skip sub cl,32 ; get number of bits in second dword mov edx,$FFFFFFFF ; 11111111111111111111 rol cx,8 ;xchg cl,ch ; put bit count in cl shl edx,cl ; bit count in cl 11111111111111000000 and [edi+4],edx ; mask out first dword or [edi+4],ebx ; set bits in first dword .done:ret read_bitstream:; eax = output bitstream from lsb, edi - ptr to buffer, better if aligned or ch,ch ; cl = bit offset 0-31 jz .done ; ch = number of bits to read\write 0-32 mov eax,[edi] ; get first dword mov ebx,[edi+4]; get second dword shrd eax,ebx,cl; adjust by offset mov edx,$FFFFFFFF ; 11111111111111111111 rol cx,8 ;xchg cl,ch ; put bit count in cl shl edx,cl ; bit count in cl 11111111111111000000 not edx ; complement 00000000000000111111 and eax,edx ; mask out eax .done:ret Last edited by Matrix on 30 Oct 2004, 13:26; edited 2 times in total |
|||
29 Oct 2004, 11:31 |
|
Octavio 29 Oct 2004, 16:12
Matrix wrote: helloo roticv, you forgot "xchg cl,ch" at the end |
|||
29 Oct 2004, 16:12 |
|
S.T.A.S. 30 Oct 2004, 02:48
Are we optimizing for size, not for speed?
|
|||
30 Oct 2004, 02:48 |
|
roticv 30 Oct 2004, 06:49
Speed.
Quote:
|
|||
30 Oct 2004, 06:49 |
|
Octavio 30 Oct 2004, 09:25
Matrix wrote: helloo roticv, without testing ,i can say it doesn´t work. in the putsbits routine you use the number off bits to make the mask but remember that part of the bits are writen in on doubleword and part are writen in the next. also the two 'not edx' instructions are unnecessary. and your code will never be faster than the one i wrote. |
|||
30 Oct 2004, 09:25 |
|
Matrix 30 Oct 2004, 10:58
hy Octavio, you might be right, your code contains less instructions than my,
but i don't see the main purpose of the function in yours. therer are errors in it. you should compare what it does with my code. ps.: have you been thinking of instruction pairings on pentium processors? for example these coud not pair on pentium class processors: Code: xor eax,eax bts eax,edx dec eax these could pair i guess, both first 2 or second 2 lines: Code: mov edx,$FFFFFFFF ; 11111111111111111111 xchg cl,ch ; put bit count in cl shl edx,cl |
|||
30 Oct 2004, 10:58 |
|
Matrix 30 Oct 2004, 12:46
Octavio wrote:
well thank you, i thought i was understanding the purpose of logical operations, i think you should test codes before quoting them as not working if you don't see the code executing. |
|||
30 Oct 2004, 12:46 |
|
Octavio 30 Oct 2004, 13:06
Matrix wrote: ps.: have you been thinking of instruction pairings on pentium processors? have you readed intel and amd optimizations manual? mov edx,0ffffffffh ;on amd this instructions can take 0 cpu clocks ;but increases cache misses because the opcode is 5 bytes xchg cl,ch ;slow on some cpus i think is not pareable and can cause stalls shl edx,cl ;another slow and not pareable instruction not edx ;you forget this one, not pareable with previous instruction it´s better to think on good algorithms than on these kind of optimizations for example instead of writing some bits in memory on every funcion call using 'and mem,mask or mem,bits' what i do is store bits on a register and when the register is full i wrote it to memory, so i make less memory operations ,and my code is faster,and also smaller,but still i have not tested it ,can you tell me please where you found errors? |
|||
30 Oct 2004, 13:06 |
|
Matrix 30 Oct 2004, 13:21
Well i have a Celeron II, i have not read all optimization manuals yet, maeby someday ...
here 's your code: Code: ;cl bit offset 0-31 ;ch number of bits to read\write 0-32 ;edx bits to write or bits readed ;edi ptr to bufer better if aligned ;eax stores bits for write it must be write at the end of the bistream. bitsput: xor ebx,ebx shld ebx,edx,cl ;save bits of next dword in ebx , <<< you have not read the words yet shl edx,cl ;put bits in position or eax,edx ;eax temporaly stores bits until we have 32 bits to write <<< you have not zeroed EAX add cl,ch btr ecx,5 ;32 bits? <<< you are using a complex microcode for testing a bit jc .l1 stosd ;store 32 bits <<< where is the upper part? you know write 32 bits with offset, or 16 bit with >16bit offset mov eax,ebx ;<<< sure , but where's the rest? .l1: ret bitsget: xor eax,eax movzx edx,ch ;prepare mask <<< well actually you wrote mov edx,ch first, you can't do that, but are you sure this is the best way to generate a mask? bts eax,edx ;ej: for size=4 10000b -1=1111b ; <<< complex microcode again dec eax ;mask <<< you are using edx again mov edx,[edi] mov ebx,[edi+4] shrd edx,ebx,cl and edx,eax ; <<< what do you mean by the followings? i thought this is only a get function add cl,ch btr ecx,5 ;if offset >=32 sbb eax,eax ;increment ptr and eax,4 add edi,eax ret |
|||
30 Oct 2004, 13:21 |
|
Octavio 30 Oct 2004, 14:58
Matrix wrote:
Oops you are right there is another bug ...... in your code in the first post is especified : >BitsGet reads given number of bits, and moves the pointer too. but you don´t increment the bufer pointer. Matrix wrote:
;edx bits to write or bits readed this means that the calling rutines places bits to be write in edx register, there is no need to read any word. since i don´t want to write a book to explain my code ,because even you will not buy it, i suggest that you run the code with a debugger to understand it. |
|||
30 Oct 2004, 14:58 |
|
Matrix 30 Oct 2004, 15:20
octavio wrote: Oops you are right there is another bug ...... in your code i hate bugs, don't say such a thing and i guess everyone knows how to increase a number okey that was i thought it did, so i thought you need to read a number of bits from an array indexed by [edi] , number of bits=0-32 , bit offset = 0-31 here's my code without shld/shrd (corrected version 2004.11.01) Code: ; Matrix bitstream functions ; here you are, i knew xchg was slow, it could be a size optimization ; use rol cx,8 if its faster for your processor ; of course note that call does a push and a pop too ; in this version i have avoided the use of shld, shrd write_bitstream:; eax = input bitstream from lsb, edi - ptr to buffer, better if aligned or ch,ch ; cl = bit offset 0-31 jz .done ; ch = number of bits to read\write 0-32 mov edx,$FFFFFFFF ; 11111111111111111111 rol eax,cl ; hehe mov ebx,eax ; --- this part is for zeroing out unneeded bits ( if any ) in the upper part rol cx,8 ;xchg cl,ch ; put bit count in cl add cl,ch cmp cl,32 jbe .notzeroe sub cl,32 shl edx,cl ; bit count in cl 11111111111111111000 and [edi+4],edx ; mask out first dword not edx ; complement 00000000000000000111 and ebx,edx or [edi+4],ebx ; set bits in first dword add cl,32 .notzeroe: sub cl,ch rol cx,8 ;xchg cl,ch ; put offset in cl ; --- this part is for zeroing out unneeded bits ( if any ) in the upper part mov edx,$FFFFFFFF ; 11111111111111111111 shl edx,cl ; offset in cl 11111111111111000000 ; if you are zeroing out use this instead and eax,edx rol cx,8 ;xchg cl,ch ; put bit count in cl mov edx,$FFFFFFFF ; 11111111111111111111 shl edx,cl ; bit count in cl 11111111111111000000 rol cx,8 ;xchg cl,ch ; put offset in cl shl edx,cl ; offset 00000001111111000000 not edx ; complement 11111110000000111111 rol cx,8 ;xchg cl,ch ; put bit count in cl and [edi],edx ; mask out first dword or [edi],eax ; set bits in first dword ret read_bitstream:; eax = output bitstream from lsb, edi - ptr to buffer, better if aligned or ch,ch ; cl = bit offset 0-31 jz .done ; ch = number of bits to read\write 0-32 mov eax,[edi] ; get first dword mov ebx,[edi+4]; get second dword mov edx,$FFFFFFFF ; 11111111111111111111 shr eax,cl shr edx,cl ; offset in cl 00000011111111111111 ror ebx,cl not edx ; offset in cl 11111100000000000000 and ebx,edx or eax,ebx mov edx,$FFFFFFFF ; 11111111111111111111 rol cx,8 ;xchg cl,ch ; put bit count in cl shl edx,cl ; bit count in cl 11111111111111000000 not edx ; complement 00000000000000111111 and eax,edx ; mask out eax .done:ret decard, tell me if i were all wrong. ps.: decard, the hi part of ecx is free for use Last edited by Matrix on 01 Nov 2004, 04:03; edited 1 time in total |
|||
30 Oct 2004, 15:20 |
|
S.T.A.S. 31 Oct 2004, 04:53
Matrix wrote: i guess everyone knows how to increase a number But we want conditional increase. Jcc isn't nice way to do that - branches misprediction is very bad. Also, your last code heavily uses partial registers - IMHO, this could be considered slow, too. I see you figured out how to use rol instead of shld But you miss some boolean algebra: Code: not edx and ebx,edx not edx and [edi+4],edx ; mask out first dword Code: or ebx,edx xor ebx,edx and [edi+4],edx Code: or edx, -1 ;11111111111111111111 Code: mov edx,$FFFFFFFF ; 11111111111111111111 PS: I agree with Octavio - initial code is a bit HLL fashioned, assembly style is to accumulate bits in a register, and save it to memory then.. |
|||
31 Oct 2004, 04:53 |
|
Matrix 31 Oct 2004, 09:11
S.T.A.S. wrote: PS: I agree with Octavio - initial code is a bit HLL fashioned, assembly style is to accumulate bits in a register, and save it to memory then.. Hello S.T.A.S. thank you for your quote. i'll take your advice and think more. my quote is: "and [mem],reg" and "or [mem],reg" preserves storing the data in a register, wonder why you don't like it. however i used to do several bytes of size & some noticable speed optimizations using logical operations instead of mov. But i wasn't sure about what you said, its that mov ebx,$ffffffff is sllower than or ebx,$ffffffff , and out friend here said before that mov ebx,$ffffffff takes 0 ticks on some cpu-s. S.T.A.S. wrote:
i used adc reg,0 in some cases |
|||
31 Oct 2004, 09:11 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.