flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register 
 Profile   Log in to check your private messages   Log in 
flat assembler > Projects and Ideas > My Inflate implementation. (was Working with bit streams...)

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
JohnFound



Joined: 16 Jun 2003
Posts: 3434
Location: Bulgaria
My Inflate implementation. (was Working with bit streams...)
Recently I am working on Deflate decompression algorithm. ( the recent version ).

I tried several ways of implementation, using SHLD instructions, but they seems to be too slow. At least my implementation was much slower than gzip on Linux and Windows.

Anyway, the second attempt was made, using different technique. Here, as an illustration, is a small sample of code, reading CL bits from the input stream:

Code:
        
; At the start: 
;    esi: pointer to the current byte in the input stream.
;    eax: the current dword of the stream.
;    ch: how many bits remains for reading in eax
;    cl: how many bits to be read in edx.
; At the end:
;    edx - bits read from the stream.
        or      edx, -1
        shl     edxcl
        not     edx

        and     edxeax
        shr     eaxcl

        sub     chcl
        jg      @f

        add     esi2
        cmp     esi, [.pEnd]
        jae     .error

        mov     clch
        mov     eax, [esi]
        add     ch$10
        rol     eaxcl

@@:



This snippet illustrates the worst case reading arbitrary count of bits (Up to 16 at once actually).
When needed fixed count, I am using more simple code, but the ideas are the same:

Code:

; The same as the above, but read fixed count of 5 bits.
; It does not need bit mask to be constructed.
        mov     edxeax
        shr     eax5

        and     edx$1f

        sub     ch5
        jg      @f

        add     esi2
        cmp     esi, [.pEnd]
        jae     .error

        mov     clch
        mov     eax, [esi]
        add     ch$10
        rol     eaxcl

@@:



This code works better that the first versions and works approximately with the same speed as GZIP. (which is pretty optimized piece of code) The size of the whole library is also acceptable (~1400 bytes).

So, how this code can be improved? Other, faster technique able to provide this functionality? MMX? Smile

[EDIT]The subject changed.[/EDIT]

_________________
Tox ID: A48DEF727DF44C3B5C2E576B65021F1A45D8FA52E2F8E257F1CAE148BBADB162FDF7820BD1F9


Last edited by JohnFound on 06 Dec 2014, 12:02; edited 1 time in total
Post 26 Nov 2014, 17:12
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
Tomasz Grysztar
Assembly Artist


Joined: 16 Jun 2003
Posts: 6435
Location: Kraków, Poland
On a slight tangent, this reminds me that I still don't own a processor that would support BMI1/BMI2 (or even TBM) instructions. I really wanted to try them out! Wink
Post 26 Nov 2014, 18:58
View user's profile Send private message Visit poster's website Reply with quote
cod3b453



Joined: 25 Aug 2004
Posts: 616
What is the range of- or set of possible- shifts that need to be applied?
Post 26 Nov 2014, 19:10
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3434
Location: Bulgaria
@cod3b453: Needed bit fields are from 1 to 13 bits wide. The direction is only forward from the beginning to the end of the stream.
( If this is what you are asking about... Wink )
Post 26 Nov 2014, 19:15
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3434
Location: Bulgaria
@Tomasz: My new netbook has A4-1200 CPU, that IIRC supports BMI1 and ABM.
Post 26 Nov 2014, 19:16
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 260
Location: Australia
Having written the full boat of deflate and inflate in x86_64, I can only tell you how I implemented the bit buffering... both deflate and inflate in my library are on average 30% faster than the reference gzip/zlib.

I am sure you are probably already aware of how the reference C version does it, and that is it keeps a "hold" and a bitcount that is in it, and just snips off the lower bits as needed (and occasionally of course puts them back when necessary).

My macros to actually feed my own version wouldn't make sense here in standalone fashion, but since I am living in the land-o-plenty register-wise, I dedicated a register for each. My hold is 64 bits wide, and I use a 32bit register to keep track of its bitcount.

I disliked how the reference version only ever fed its hold with 8 bits at a time (and due to its portability requirements/etc it must be that way), so I feed my "accumulator" with 32 bits at a time, and it is quite fast. All of the non-general purpose instructions to deal with bit shuffling around are indeed much slower. Smile

My actual full-spec-compatible inflate routine is 12736 bytes including embedded table and jump data, then toss in the adler32, crc32, file/memory management, and everything else that is needed to make a standalone inflate, my end static binary is ~30K.

Anyway, using my accumulator method ended up being nearly 1:1 with how the reference zlib does it... in the below code snippet, the accumulator (especially early on) is prepopulated, so the macros don't actually do anything other than to verify there is indeed enough bits in them, and then it just accesses the accumulator lower bits directly (accumulator here is r12):


Code:
.mode_typedo:
        cmp     dword [rbx+zlib_istate_last_ofs], 0
        jne     .mode_typedo_last
        zlib_inflate_needbits 3
        mov     eaxr12d
        and     eax1
        mov     dword [rbx+zlib_istate_last_ofs], eax
        zlib_inflate_dropbits 1
        mov     eaxr12d
        and     eax3
        cmp     eax0
        je      .mode_typedo_stored
        cmp     eax1
        je      .mode_typedo_fixed
        cmp     eax2
        je      .mode_typedo_dynamic
        ; invalid block type
        zlib_inflate_dropbits 2
        jmp     .mode_bad




it ends up being pretty much the same amount of work in the end I spose..

I am not quite done doing the finishwork to my whole library yet :-/ coming soon, hahah

so a modified-so-it-makes-sense zlib_inflate_needbits macro:

Code:
        ; zlib_inflate_needbits:
        ; we smash ecx, edx
macro zlib_inflate_needbits n* {
        ; r12 == hold
        ; r13d == bits
        ; r14 == source pointer
        ; r15 == bytes remaining
        local   .getfour.allgood
        cmp     r13dn
        jae     .allgood
        cmp     r154
        jae     .getfour
        ; less than 4 bytes remain, determine existing bits + (remaining bytes * Cool is >= n, else goto .inf_leave
        ; it will be safe here to pull a dword even if it is past the end of input
        mov     ecxr13d
        mov     edxdword [r14]
        shl     rdxcl
        add     r12rdx                ; add vs or makes no diff
        add     r14r15                ; move pointer forward
        shl     r15d3
        add     r13dr15d              ; number of bits we really added
        xor     r15dr15d      ; no more bytes left
        cmp     r13dn
        jae     .allgood
        jmp     .inf_leave
.getfour:
        mov     ecxr13d
        mov     edxdword [r14]
        shl     rdxcl
        add     r12rdx                ; add vs or makes no diff
        add     r13d32
        add     r144
        sub     r154
        ; fallthrough to allgood
.allgood:
}




and the zlib_dropbits macro as-is:

Code:
macro zlib_inflate_dropbits n* {
        shr     r12n
        sub     r13dn
}




$0.02 Smile
Post 26 Nov 2014, 20:33
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3434
Location: Bulgaria
@redsock - At first I never read the sources of zlib or gunzip, simply because I am not skillful enough in C/C++ and reading such sources is very boring for me. My Implementation is entirely following RFC-1951.

Although, your code looks pretty similar to one I use, except it is 64bit (It is good to have so many registers. Smile )
Anyway, it seems gzip for 64bit does not use all the power of 64 bit platform, because I didn't noticed any difference in speed between 32bit and 64bit versions.
Post 26 Nov 2014, 20:58
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 260
Location: Australia

redsock wrote:
it ends up being pretty much the same amount of work in the end I spose..



probably should have emphasized that a little more, haha Smile
Post 26 Nov 2014, 21:03
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3434
Location: Bulgaria
BTW, isn't 12K too much for such an algorithm? My implementation is 1400 bytes (well after some size optimizations) and it inlines all inner loops code because of the speed optimizations. (Read it as "All code", because Inflate is actually one big inner loop Wink )
Post 26 Nov 2014, 21:47
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 260
Location: Australia

JohnFound wrote:
BTW, isn't 12K too much for such an algorithm? My implementation is 1400 bytes (well after some size optimizations) and it inlines all inner loops code because of the speed optimizations. (Read it as "All code", because Inflate is actually one big inner loop Wink )



Hahah, about 2500 bytes of it is fixed inline tables, the actual spec-compliant inflate "one big inner loop" is ~2100 lines of assembler (which includes of course a good many inlined macro calls that expand to more still).

I think you'll find that in order to be zlib (wrap==1), gzip (wrap==2) compliant, it must be around this size Smile

Also re: your earlier comment of gzip/zlib not taking advantage of 64 bit, it has to do with several things that are a core part of the zlib design. I find it most amusing that the contributed x86_64 assembler "longest_match" function actually slows it down (and thus why it isn't a part of the actual x86_64 build by default).

Most people don't realize how it all looks from a profiling perspective. For example, this is a profiler run from my deflate routine (profiling of course adds a great deal of overhead, but gives a very accurate picture of each routine and where time is spent):

Image

The input for that profiler run is 460MB.

Inflate is obviously much simpler, and here is profiler run from inflating the output of the previous step:

Image

My code is already 30% faster than the reference version, but you can see a great deal of optimization room here, more so if I opted not to do proper crc32/adler32/etc...

I could spend a full year doing nothing but optimizing them further, haha... register stalls and dependency chains through it are nasty Razz

The 32bit versus 64bit zlib isn't "dramatic" due to the way it deals with bit buffering, windows, and longest match searching, and without gutting a good chunk of it (thus ruining portability), hmm, I'd say it is the way it is quite intentionally (the reference version that is)...

Smile
Post 26 Nov 2014, 22:07
View user's profile Send private message Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 260
Location: Australia
I should also add that by default I compile my library to align every single inner jump target by 16, so there is a fair few multibyte NOPs spread throughout it Smile
Post 26 Nov 2014, 22:20
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3434
Location: Bulgaria
So, after some optimization, my routine is now about 20% faster than gzip implementation and is under 1300 bytes in size, which I decided is good enough result.

Profiling the code, I discovered that the most time is wasted scanning the stream bit by bit and traversing the Huffman tree, so optimizations of memory access or multi-bit reading gives very small performance gain and is useless at all.

Also, all kinds of code alignments was useless - the speed was always the same.

The final (for now) code is in the repository.


P.S. It is interesting isn't it possible to traverse the Huffman tree (binary) by 2 bits? I.e. transforming the binary tree into a 4-tree. But how then to handle codes with odd length?...
Post 29 Nov 2014, 15:56
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
@JohnFound - http://www.imfm.si/preprinti/PDF/00600.pdf

The paper describes a mutation of the huffman tree with an added code size based lookup table that can improve the performance.

I was going to suggest starting with a LUT based on the code size then the search based on the actual code could be optimized on a per-size basis but the above papers solution is a bit more elegant.
Post 01 Dec 2014, 13:42
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 260
Location: Australia

JohnFound wrote:
So, after some optimization, my routine is now about 20% faster than gzip implementation and is under 1300 bytes in size, which I decided is good enough result.



When you say your routine is 20% faster than gzip/zlib, do you mean that you disabled zlib's error and integrity checking for your comparison?
Post 01 Dec 2014, 18:47
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3434
Location: Bulgaria
I am not using zlib for testing. Simply have one big compressed file and use "time gzip -d testfile.gz" and "time TestDeflate" in order to test the speed. Notice that I am not handling zip or gz files formats. I am implementing only Inflate algorithm. My test program reads the compressed file, decompress it and writes the output file. And this is implemented only as a quick test code, not to be released as a gzip replacement. But all the integrity and error checks for the Deflate stream are implemented and working. My main goal is to provide the algorithm for using in PNG decoding library. That is why for now only Inflate part is implemented. Maybe later, some day, I will write Deflate as well, but not now.
Post 01 Dec 2014, 22:41
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3434
Location: Bulgaria

r22 wrote:
@JohnFound - http://www.imfm.si/preprinti/PDF/00600.pdf

The paper describes a mutation of the huffman tree with an added code size based lookup table that can improve the performance.



I read this article and the trick is really interesting, but I am afraid it is not usable for Deflate/Inflate algorithm, because it needs the whole Huffman tree building algorithm to be changed. But this way it will be new compression/decompression algorithm, not compatible with Deflate/Inflate.

_________________
Tox ID: A48DEF727DF44C3B5C2E576B65021F1A45D8FA52E2F8E257F1CAE148BBADB162FDF7820BD1F9
Post 01 Dec 2014, 22:45
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 260
Location: Australia

JohnFound wrote:
I am not using zlib for testing. Simply have one big compressed file and use "time gzip -d testfile.gz" and "time TestDeflate" in order to test the speed. Notice that I am not handling zip or gz files formats. I am implementing only Inflate algorithm. My test program reads the compressed file, decompress it and writes the output file. And this is implemented only as a quick test code, not to be released as a gzip replacement. But all the integrity and error checks for the Deflate stream are implemented and working. My main goal is to provide the algorithm for using in PNG decoding library. That is why for now only Inflate part is implemented. Maybe later, some day, I will write Deflate as well, but not now.


Yeah that's what I figured Smile My only point was that from the code you posted, there were no crc32 verifications, nor code/distance error checking... so it might be misleading for you to say your routine is 20% faster than gzip -d, considering that gzip -d is also doing crc32 validation against all of its input, as well as inner-loop distance/code validations. I think you may find that modifying gzip/zlib's inflate_fast routine to behave as yours is, that zlib's would be faster still.

I also use my inflate routine for PNG support, though I didn't bother with the entire subset of PNG encodings as I only wanted PNG24 support.

Smile Cheers, and appreciate your effort.
Post 01 Dec 2014, 23:40
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3434
Location: Bulgaria
Well, crc32 is much faster than the Inflate decoder, so I don't think there will be significant difference. But still I am not trying to beat gzip. I only wanted to have a code with normal performance. Anyway, as I wrote above, my research shows that big performance increase can be achieved only with different Huffman tree traversing algorithms, not simple code optimizations.
Post 02 Dec 2014, 04:49
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 260
Location: Australia
@JohnFound: Absolutely no offense intended, but you are way off the mark here.

As opposed to discussing the internals of zlib (which is effectively gzip/gunzip), I have taken the liberty of modifying zlib-1.2.8 specifically to highlight the differences that really are happening when you do your simplistic timing view.

I would like to point out that the _only_ reason I feel compelled to do so is that by you stating your routine is 20% faster than gzip, you are doing a tremendous disrespect to Mark Adler (whom I do not know, and am not affiliated with whatsoever). Zlib itself has been the subject of a great many respected people carefully reviewing and optimising it. When I say my code is 30% faster than it, I am performing the exact same amount of work (to detect bitrot in archives, etc, etc). And it isn't because my own library and/or statements is correct, it is only that you don't seem to understand just how it all comes together (nor should you, given you have only read the RFC).

I appreciate that you are not trying to outperform gzip, but there are a great many factors involved in what gzip is that your code simply does not do, and your passing crc32 off as being insignificant/negligible is just plain wrong.

So, back to my case-in-point: I downloaded a fresh copy of zlib-1.2.8 from zlib.net, and then spent 10 minutes modifying its inflate routines to include provisions for JOHNFOUND #ifdefs.

Here is my modified library: http://nologic.com/zlib-1.2.8-JohnFound.tar.gz

Once you tar xf it, you'll end up with a normal zlib-1.2.8-JohnFound directory. When you get in there, configure it by doing:

Code:
CFLAGS="-O3 -DJOHNFOUND=1" ./configure


then proceed with a normal

Code:
make



then, you should have a few binaries in your directory that you didn't have before. Namely: minigzip.

It will accept the same syntax you have been doing to decompress files, but it will do _no_ integrity checking (noting here, all I disabled for the purpose of making my point was crc32/adler32).

If you wish to compare against the reference version (without my JOHNFOUND mods), either download a separate copy straight from zlib.net, or do:

Code:
make distclean
./configure
make

again and re-run your tests.

Using the exact same output from my prior post with my own profiling runs, I tested the same with a 39,551,556 byte gzip input file (that uncompressed results in 460MB file) my results are:


Code:
JOHNFOUND minigziptime ./minigzip -d -c test.input.gz >/dev/null

real    0m0.760s
user    0m0.745s
sys     0m0.016s

REFERENCE version (aka gzip): time ./minigzip -d -c test.input.gz >/dev/null

real    0m1.259s
user    0m1.248s
sys     0m0.012s




If I went further, and disabled all invalid lengths/distances/codes checking (and there are many), I think you'd find that zlib itself would go yet-again much faster.

My point mainly is to highlight to other readers that you can't say your code is faster than someone else's when in fact it is not, whether you appreciate the associated nasties involved in the what/why/how isn't really important.

$$rant complete$$ hahah again, I not trying to be rude or offensive, I have just spent a LOT of time dwelling in this area (and thus why I latched onto your thread in the first place).

Cheers again
Post 02 Dec 2014, 06:16
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 14923
Location: 6EQUJ5

JohnFound wrote:
Well, crc32 is much faster than the Inflate decoder, so I don't think there will be significant difference.

Modern CPUs are very complex and I think you can't simply shrug off such a thing so easily without actually doing a test to see the difference. There are many things interacting and using resources that even a seemingly simple change can have a dramatic effect in the right (wrong?) conditions.

JohnFound wrote:
But still I am not trying to beat gzip. I only wanted to have a code with normal performance.

Then I guess you must have changed your target after the initial posting:

JohnFound wrote:
what is the fastest way

I expect many people are confused by your response since it appears to go against the initial purpose.
Post 02 Dec 2014, 07:20
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2005 phpBB Group.

Main index   Download   Documentation   Examples   Message board
Copyright © 2004-2016, Tomasz Grysztar.