flat assembler
Message board for the users of flat assembler.

Index > Projects and Ideas > My Inflate implementation. (was Working with bit streams...)

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
Recently I am working on Deflate decompression algorithm. ( the recent version ).

I tried several ways of implementation, using SHLD instructions, but they seems to be too slow. At least my implementation was much slower than gzip on Linux and Windows.

Anyway, the second attempt was made, using different technique. Here, as an illustration, is a small sample of code, reading CL bits from the input stream:
Code:
        
; At the start: 
;    esi: pointer to the current byte in the input stream.
;    eax: the current dword of the stream.
;    ch: how many bits remains for reading in eax
;    cl: how many bits to be read in edx.
; At the end:
;    edx - bits read from the stream.
        or      edx, -1
        shl     edx, cl
        not     edx

        and     edx, eax
        shr     eax, cl

        sub     ch, cl
        jg      @f

        add     esi, 2
        cmp     esi, [.pEnd]
        jae     .error

        mov     cl, ch
        mov     eax, [esi]
        add     ch, $10
        rol     eax, cl

@@:    


This snippet illustrates the worst case reading arbitrary count of bits (Up to 16 at once actually).
When needed fixed count, I am using more simple code, but the ideas are the same:
Code:
; The same as the above, but read fixed count of 5 bits.
; It does not need bit mask to be constructed.
        mov     edx, eax
        shr     eax, 5

        and     edx, $1f

        sub     ch, 5
        jg      @f

        add     esi, 2
        cmp     esi, [.pEnd]
        jae     .error

        mov     cl, ch
        mov     eax, [esi]
        add     ch, $10
        rol     eax, cl

@@:    


This code works better that the first versions and works approximately with the same speed as GZIP. (which is pretty optimized piece of code) The size of the whole library is also acceptable (~1400 bytes).

So, how this code can be improved? Other, faster technique able to provide this functionality? MMX? Smile

[EDIT]The subject changed.[/EDIT]

_________________
Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9


Last edited by JohnFound on 06 Dec 2014, 12:02; edited 1 time in total
Post 26 Nov 2014, 17:12
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
Tomasz Grysztar
Assembly Artist


Joined: 16 Jun 2003
Posts: 7420
Location: Kraków, Poland
Tomasz Grysztar
On a slight tangent, this reminds me that I still don't own a processor that would support BMI1/BMI2 (or even TBM) instructions. I really wanted to try them out! Wink
Post 26 Nov 2014, 18:58
View user's profile Send private message Visit poster's website Reply with quote
cod3b453



Joined: 25 Aug 2004
Posts: 619
cod3b453
What is the range of- or set of possible- shifts that need to be applied?
Post 26 Nov 2014, 19:10
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
@cod3b453: Needed bit fields are from 1 to 13 bits wide. The direction is only forward from the beginning to the end of the stream.
( If this is what you are asking about... Wink )
Post 26 Nov 2014, 19:15
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
@Tomasz: My new netbook has A4-1200 CPU, that IIRC supports BMI1 and ABM.
Post 26 Nov 2014, 19:16
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 331
Location: Australia
redsock
Having written the full boat of deflate and inflate in x86_64, I can only tell you how I implemented the bit buffering... both deflate and inflate in my library are on average 30% faster than the reference gzip/zlib.

I am sure you are probably already aware of how the reference C version does it, and that is it keeps a "hold" and a bitcount that is in it, and just snips off the lower bits as needed (and occasionally of course puts them back when necessary).

My macros to actually feed my own version wouldn't make sense here in standalone fashion, but since I am living in the land-o-plenty register-wise, I dedicated a register for each. My hold is 64 bits wide, and I use a 32bit register to keep track of its bitcount.

I disliked how the reference version only ever fed its hold with 8 bits at a time (and due to its portability requirements/etc it must be that way), so I feed my "accumulator" with 32 bits at a time, and it is quite fast. All of the non-general purpose instructions to deal with bit shuffling around are indeed much slower. Smile

My actual full-spec-compatible inflate routine is 12736 bytes including embedded table and jump data, then toss in the adler32, crc32, file/memory management, and everything else that is needed to make a standalone inflate, my end static binary is ~30K.

Anyway, using my accumulator method ended up being nearly 1:1 with how the reference zlib does it... in the below code snippet, the accumulator (especially early on) is prepopulated, so the macros don't actually do anything other than to verify there is indeed enough bits in them, and then it just accesses the accumulator lower bits directly (accumulator here is r12):

Code:
.mode_typedo:
        cmp     dword [rbx+zlib_istate_last_ofs], 0
        jne     .mode_typedo_last
        zlib_inflate_needbits 3
        mov     eax, r12d
        and     eax, 1
        mov     dword [rbx+zlib_istate_last_ofs], eax
        zlib_inflate_dropbits 1
        mov     eax, r12d
        and     eax, 3
        cmp     eax, 0
        je      .mode_typedo_stored
        cmp     eax, 1
        je      .mode_typedo_fixed
        cmp     eax, 2
        je      .mode_typedo_dynamic
        ; invalid block type
        zlib_inflate_dropbits 2
        jmp     .mode_bad
    


it ends up being pretty much the same amount of work in the end I spose..

I am not quite done doing the finishwork to my whole library yet :-/ coming soon, hahah

so a modified-so-it-makes-sense zlib_inflate_needbits macro:
Code:
        ; zlib_inflate_needbits:
        ; we smash ecx, edx
macro zlib_inflate_needbits n* {
        ; r12 == hold
        ; r13d == bits
        ; r14 == source pointer
        ; r15 == bytes remaining
        local   .getfour, .allgood
        cmp     r13d, n
        jae     .allgood
        cmp     r15, 4
        jae     .getfour
        ; less than 4 bytes remain, determine existing bits + (remaining bytes * Cool is >= n, else goto .inf_leave
        ; it will be safe here to pull a dword even if it is past the end of input
        mov     ecx, r13d
        mov     edx, dword [r14]
        shl     rdx, cl
        add     r12, rdx                ; add vs or makes no diff
        add     r14, r15                ; move pointer forward
        shl     r15d, 3
        add     r13d, r15d              ; number of bits we really added
        xor     r15d, r15d      ; no more bytes left
        cmp     r13d, n
        jae     .allgood
        jmp     .inf_leave
.getfour:
        mov     ecx, r13d
        mov     edx, dword [r14]
        shl     rdx, cl
        add     r12, rdx                ; add vs or makes no diff
        add     r13d, 32
        add     r14, 4
        sub     r15, 4
        ; fallthrough to allgood
.allgood:
}
    


and the zlib_dropbits macro as-is:
Code:
macro zlib_inflate_dropbits n* {
        shr     r12, n
        sub     r13d, n
}
    


$0.02 Smile
Post 26 Nov 2014, 20:33
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
@redsock - At first I never read the sources of zlib or gunzip, simply because I am not skillful enough in C/C++ and reading such sources is very boring for me. My Implementation is entirely following RFC-1951.

Although, your code looks pretty similar to one I use, except it is 64bit (It is good to have so many registers. Smile )
Anyway, it seems gzip for 64bit does not use all the power of 64 bit platform, because I didn't noticed any difference in speed between 32bit and 64bit versions.
Post 26 Nov 2014, 20:58
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 331
Location: Australia
redsock
redsock wrote:
it ends up being pretty much the same amount of work in the end I spose..


probably should have emphasized that a little more, haha Smile
Post 26 Nov 2014, 21:03
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
BTW, isn't 12K too much for such an algorithm? My implementation is 1400 bytes (well after some size optimizations) and it inlines all inner loops code because of the speed optimizations. (Read it as "All code", because Inflate is actually one big inner loop Wink )
Post 26 Nov 2014, 21:47
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 331
Location: Australia
redsock
JohnFound wrote:
BTW, isn't 12K too much for such an algorithm? My implementation is 1400 bytes (well after some size optimizations) and it inlines all inner loops code because of the speed optimizations. (Read it as "All code", because Inflate is actually one big inner loop Wink )


Hahah, about 2500 bytes of it is fixed inline tables, the actual spec-compliant inflate "one big inner loop" is ~2100 lines of assembler (which includes of course a good many inlined macro calls that expand to more still).

I think you'll find that in order to be zlib (wrap==1), gzip (wrap==2) compliant, it must be around this size Smile

Also re: your earlier comment of gzip/zlib not taking advantage of 64 bit, it has to do with several things that are a core part of the zlib design. I find it most amusing that the contributed x86_64 assembler "longest_match" function actually slows it down (and thus why it isn't a part of the actual x86_64 build by default).

Most people don't realize how it all looks from a profiling perspective. For example, this is a profiler run from my deflate routine (profiling of course adds a great deal of overhead, but gives a very accurate picture of each routine and where time is spent):

Image

The input for that profiler run is 460MB.

Inflate is obviously much simpler, and here is profiler run from inflating the output of the previous step:

Image

My code is already 30% faster than the reference version, but you can see a great deal of optimization room here, more so if I opted not to do proper crc32/adler32/etc...

I could spend a full year doing nothing but optimizing them further, haha... register stalls and dependency chains through it are nasty Razz

The 32bit versus 64bit zlib isn't "dramatic" due to the way it deals with bit buffering, windows, and longest match searching, and without gutting a good chunk of it (thus ruining portability), hmm, I'd say it is the way it is quite intentionally (the reference version that is)...

Smile
Post 26 Nov 2014, 22:07
View user's profile Send private message Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 331
Location: Australia
redsock
I should also add that by default I compile my library to align every single inner jump target by 16, so there is a fair few multibyte NOPs spread throughout it Smile
Post 26 Nov 2014, 22:20
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
So, after some optimization, my routine is now about 20% faster than gzip implementation and is under 1300 bytes in size, which I decided is good enough result.

Profiling the code, I discovered that the most time is wasted scanning the stream bit by bit and traversing the Huffman tree, so optimizations of memory access or multi-bit reading gives very small performance gain and is useless at all.

Also, all kinds of code alignments was useless - the speed was always the same.

The final (for now) code is in the repository.


P.S. It is interesting isn't it possible to traverse the Huffman tree (binary) by 2 bits? I.e. transforming the binary tree into a 4-tree. But how then to handle codes with odd length?...
Post 29 Nov 2014, 15:56
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
@JohnFound - http://www.imfm.si/preprinti/PDF/00600.pdf

The paper describes a mutation of the huffman tree with an added code size based lookup table that can improve the performance.

I was going to suggest starting with a LUT based on the code size then the search based on the actual code could be optimized on a per-size basis but the above papers solution is a bit more elegant.
Post 01 Dec 2014, 13:42
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 331
Location: Australia
redsock
JohnFound wrote:
So, after some optimization, my routine is now about 20% faster than gzip implementation and is under 1300 bytes in size, which I decided is good enough result.


When you say your routine is 20% faster than gzip/zlib, do you mean that you disabled zlib's error and integrity checking for your comparison?
Post 01 Dec 2014, 18:47
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
I am not using zlib for testing. Simply have one big compressed file and use "time gzip -d testfile.gz" and "time TestDeflate" in order to test the speed. Notice that I am not handling zip or gz files formats. I am implementing only Inflate algorithm. My test program reads the compressed file, decompress it and writes the output file. And this is implemented only as a quick test code, not to be released as a gzip replacement. But all the integrity and error checks for the Deflate stream are implemented and working. My main goal is to provide the algorithm for using in PNG decoding library. That is why for now only Inflate part is implemented. Maybe later, some day, I will write Deflate as well, but not now.
Post 01 Dec 2014, 22:41
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
r22 wrote:
@JohnFound - http://www.imfm.si/preprinti/PDF/00600.pdf

The paper describes a mutation of the huffman tree with an added code size based lookup table that can improve the performance.


I read this article and the trick is really interesting, but I am afraid it is not usable for Deflate/Inflate algorithm, because it needs the whole Huffman tree building algorithm to be changed. But this way it will be new compression/decompression algorithm, not compatible with Deflate/Inflate.

_________________
Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9
Post 01 Dec 2014, 22:45
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 331
Location: Australia
redsock
JohnFound wrote:
I am not using zlib for testing. Simply have one big compressed file and use "time gzip -d testfile.gz" and "time TestDeflate" in order to test the speed. Notice that I am not handling zip or gz files formats. I am implementing only Inflate algorithm. My test program reads the compressed file, decompress it and writes the output file. And this is implemented only as a quick test code, not to be released as a gzip replacement. But all the integrity and error checks for the Deflate stream are implemented and working. My main goal is to provide the algorithm for using in PNG decoding library. That is why for now only Inflate part is implemented. Maybe later, some day, I will write Deflate as well, but not now.

Yeah that's what I figured Smile My only point was that from the code you posted, there were no crc32 verifications, nor code/distance error checking... so it might be misleading for you to say your routine is 20% faster than gzip -d, considering that gzip -d is also doing crc32 validation against all of its input, as well as inner-loop distance/code validations. I think you may find that modifying gzip/zlib's inflate_fast routine to behave as yours is, that zlib's would be faster still.

I also use my inflate routine for PNG support, though I didn't bother with the entire subset of PNG encodings as I only wanted PNG24 support.

Smile Cheers, and appreciate your effort.
Post 01 Dec 2014, 23:40
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
Well, crc32 is much faster than the Inflate decoder, so I don't think there will be significant difference. But still I am not trying to beat gzip. I only wanted to have a code with normal performance. Anyway, as I wrote above, my research shows that big performance increase can be achieved only with different Huffman tree traversing algorithms, not simple code optimizations.
Post 02 Dec 2014, 04:49
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 331
Location: Australia
redsock
@JohnFound: Absolutely no offense intended, but you are way off the mark here.

As opposed to discussing the internals of zlib (which is effectively gzip/gunzip), I have taken the liberty of modifying zlib-1.2.8 specifically to highlight the differences that really are happening when you do your simplistic timing view.

I would like to point out that the _only_ reason I feel compelled to do so is that by you stating your routine is 20% faster than gzip, you are doing a tremendous disrespect to Mark Adler (whom I do not know, and am not affiliated with whatsoever). Zlib itself has been the subject of a great many respected people carefully reviewing and optimising it. When I say my code is 30% faster than it, I am performing the exact same amount of work (to detect bitrot in archives, etc, etc). And it isn't because my own library and/or statements is correct, it is only that you don't seem to understand just how it all comes together (nor should you, given you have only read the RFC).

I appreciate that you are not trying to outperform gzip, but there are a great many factors involved in what gzip is that your code simply does not do, and your passing crc32 off as being insignificant/negligible is just plain wrong.

So, back to my case-in-point: I downloaded a fresh copy of zlib-1.2.8 from zlib.net, and then spent 10 minutes modifying its inflate routines to include provisions for JOHNFOUND #ifdefs.

Here is my modified library: http://nologic.com/zlib-1.2.8-JohnFound.tar.gz

Once you tar xf it, you'll end up with a normal zlib-1.2.8-JohnFound directory. When you get in there, configure it by doing:
Code:
CFLAGS="-O3 -DJOHNFOUND=1" ./configure    

then proceed with a normal
Code:
make    


then, you should have a few binaries in your directory that you didn't have before. Namely: minigzip.

It will accept the same syntax you have been doing to decompress files, but it will do _no_ integrity checking (noting here, all I disabled for the purpose of making my point was crc32/adler32).

If you wish to compare against the reference version (without my JOHNFOUND mods), either download a separate copy straight from zlib.net, or do:
Code:
make distclean
./configure
make    
again and re-run your tests.

Using the exact same output from my prior post with my own profiling runs, I tested the same with a 39,551,556 byte gzip input file (that uncompressed results in 460MB file) my results are:

Code:
JOHNFOUND minigzip: time ./minigzip -d -c test.input.gz >/dev/null

real    0m0.760s
user    0m0.745s
sys     0m0.016s

REFERENCE version (aka gzip): time ./minigzip -d -c test.input.gz >/dev/null

real    0m1.259s
user    0m1.248s
sys     0m0.012s
    


If I went further, and disabled all invalid lengths/distances/codes checking (and there are many), I think you'd find that zlib itself would go yet-again much faster.

My point mainly is to highlight to other readers that you can't say your code is faster than someone else's when in fact it is not, whether you appreciate the associated nasties involved in the what/why/how isn't really important.

$$rant complete$$ hahah again, I not trying to be rude or offensive, I have just spent a LOT of time dwelling in this area (and thus why I latched onto your thread in the first place).

Cheers again
Post 02 Dec 2014, 06:16
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16858
Location: In your JS exploiting you and your system
revolution
JohnFound wrote:
Well, crc32 is much faster than the Inflate decoder, so I don't think there will be significant difference.
Modern CPUs are very complex and I think you can't simply shrug off such a thing so easily without actually doing a test to see the difference. There are many things interacting and using resources that even a seemingly simple change can have a dramatic effect in the right (wrong?) conditions.
JohnFound wrote:
But still I am not trying to beat gzip. I only wanted to have a code with normal performance.
Then I guess you must have changed your target after the initial posting:
JohnFound wrote:
what is the fastest way
I expect many people are confused by your response since it appears to go against the initial purpose.
Post 02 Dec 2014, 07:20
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2019, Tomasz Grysztar.

Powered by rwasa.