flat assembler
Message board for the users of flat assembler.
Index
> Projects and Ideas > My Inflate implementation. (was Working with bit streams...) Goto page 1, 2 Next |
Author |
|
JohnFound 26 Nov 2014, 17:12
Recently I am working on Deflate decompression algorithm. ( the recent version ).
I tried several ways of implementation, using SHLD instructions, but they seems to be too slow. At least my implementation was much slower than gzip on Linux and Windows. Anyway, the second attempt was made, using different technique. Here, as an illustration, is a small sample of code, reading CL bits from the input stream: Code: ; At the start: ; esi: pointer to the current byte in the input stream. ; eax: the current dword of the stream. ; ch: how many bits remains for reading in eax ; cl: how many bits to be read in edx. ; At the end: ; edx - bits read from the stream. or edx, -1 shl edx, cl not edx and edx, eax shr eax, cl sub ch, cl jg @f add esi, 2 cmp esi, [.pEnd] jae .error mov cl, ch mov eax, [esi] add ch, $10 rol eax, cl @@: This snippet illustrates the worst case reading arbitrary count of bits (Up to 16 at once actually). When needed fixed count, I am using more simple code, but the ideas are the same: Code: ; The same as the above, but read fixed count of 5 bits. ; It does not need bit mask to be constructed. mov edx, eax shr eax, 5 and edx, $1f sub ch, 5 jg @f add esi, 2 cmp esi, [.pEnd] jae .error mov cl, ch mov eax, [esi] add ch, $10 rol eax, cl @@: This code works better that the first versions and works approximately with the same speed as GZIP. (which is pretty optimized piece of code) The size of the whole library is also acceptable (~1400 bytes). So, how this code can be improved? Other, faster technique able to provide this functionality? MMX? [EDIT]The subject changed.[/EDIT] _________________ Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9 Last edited by JohnFound on 06 Dec 2014, 12:02; edited 1 time in total |
|||
26 Nov 2014, 17:12 |
|
cod3b453 26 Nov 2014, 19:10
What is the range of- or set of possible- shifts that need to be applied?
|
|||
26 Nov 2014, 19:10 |
|
JohnFound 26 Nov 2014, 19:15
@cod3b453: Needed bit fields are from 1 to 13 bits wide. The direction is only forward from the beginning to the end of the stream.
( If this is what you are asking about... ) |
|||
26 Nov 2014, 19:15 |
|
JohnFound 26 Nov 2014, 19:16
@Tomasz: My new netbook has A4-1200 CPU, that IIRC supports BMI1 and ABM.
|
|||
26 Nov 2014, 19:16 |
|
redsock 26 Nov 2014, 20:33
Having written the full boat of deflate and inflate in x86_64, I can only tell you how I implemented the bit buffering... both deflate and inflate in my library are on average 30% faster than the reference gzip/zlib.
I am sure you are probably already aware of how the reference C version does it, and that is it keeps a "hold" and a bitcount that is in it, and just snips off the lower bits as needed (and occasionally of course puts them back when necessary). My macros to actually feed my own version wouldn't make sense here in standalone fashion, but since I am living in the land-o-plenty register-wise, I dedicated a register for each. My hold is 64 bits wide, and I use a 32bit register to keep track of its bitcount. I disliked how the reference version only ever fed its hold with 8 bits at a time (and due to its portability requirements/etc it must be that way), so I feed my "accumulator" with 32 bits at a time, and it is quite fast. All of the non-general purpose instructions to deal with bit shuffling around are indeed much slower. My actual full-spec-compatible inflate routine is 12736 bytes including embedded table and jump data, then toss in the adler32, crc32, file/memory management, and everything else that is needed to make a standalone inflate, my end static binary is ~30K. Anyway, using my accumulator method ended up being nearly 1:1 with how the reference zlib does it... in the below code snippet, the accumulator (especially early on) is prepopulated, so the macros don't actually do anything other than to verify there is indeed enough bits in them, and then it just accesses the accumulator lower bits directly (accumulator here is r12): Code: .mode_typedo: cmp dword [rbx+zlib_istate_last_ofs], 0 jne .mode_typedo_last zlib_inflate_needbits 3 mov eax, r12d and eax, 1 mov dword [rbx+zlib_istate_last_ofs], eax zlib_inflate_dropbits 1 mov eax, r12d and eax, 3 cmp eax, 0 je .mode_typedo_stored cmp eax, 1 je .mode_typedo_fixed cmp eax, 2 je .mode_typedo_dynamic ; invalid block type zlib_inflate_dropbits 2 jmp .mode_bad it ends up being pretty much the same amount of work in the end I spose.. I am not quite done doing the finishwork to my whole library yet :-/ coming soon, hahah so a modified-so-it-makes-sense zlib_inflate_needbits macro: Code: ; zlib_inflate_needbits: ; we smash ecx, edx macro zlib_inflate_needbits n* { ; r12 == hold ; r13d == bits ; r14 == source pointer ; r15 == bytes remaining local .getfour, .allgood cmp r13d, n jae .allgood cmp r15, 4 jae .getfour ; less than 4 bytes remain, determine existing bits + (remaining bytes * is >= n, else goto .inf_leave ; it will be safe here to pull a dword even if it is past the end of input mov ecx, r13d mov edx, dword [r14] shl rdx, cl add r12, rdx ; add vs or makes no diff add r14, r15 ; move pointer forward shl r15d, 3 add r13d, r15d ; number of bits we really added xor r15d, r15d ; no more bytes left cmp r13d, n jae .allgood jmp .inf_leave .getfour: mov ecx, r13d mov edx, dword [r14] shl rdx, cl add r12, rdx ; add vs or makes no diff add r13d, 32 add r14, 4 sub r15, 4 ; fallthrough to allgood .allgood: } and the zlib_dropbits macro as-is: Code: macro zlib_inflate_dropbits n* { shr r12, n sub r13d, n } $0.02 |
|||
26 Nov 2014, 20:33 |
|
JohnFound 26 Nov 2014, 20:58
@redsock - At first I never read the sources of zlib or gunzip, simply because I am not skillful enough in C/C++ and reading such sources is very boring for me. My Implementation is entirely following RFC-1951.
Although, your code looks pretty similar to one I use, except it is 64bit (It is good to have so many registers. ) Anyway, it seems gzip for 64bit does not use all the power of 64 bit platform, because I didn't noticed any difference in speed between 32bit and 64bit versions. |
|||
26 Nov 2014, 20:58 |
|
redsock 26 Nov 2014, 21:03
redsock wrote: it ends up being pretty much the same amount of work in the end I spose.. probably should have emphasized that a little more, haha |
|||
26 Nov 2014, 21:03 |
|
JohnFound 26 Nov 2014, 21:47
BTW, isn't 12K too much for such an algorithm? My implementation is 1400 bytes (well after some size optimizations) and it inlines all inner loops code because of the speed optimizations. (Read it as "All code", because Inflate is actually one big inner loop )
|
|||
26 Nov 2014, 21:47 |
|
redsock 26 Nov 2014, 22:07
JohnFound wrote: BTW, isn't 12K too much for such an algorithm? My implementation is 1400 bytes (well after some size optimizations) and it inlines all inner loops code because of the speed optimizations. (Read it as "All code", because Inflate is actually one big inner loop ) Hahah, about 2500 bytes of it is fixed inline tables, the actual spec-compliant inflate "one big inner loop" is ~2100 lines of assembler (which includes of course a good many inlined macro calls that expand to more still). I think you'll find that in order to be zlib (wrap==1), gzip (wrap==2) compliant, it must be around this size Also re: your earlier comment of gzip/zlib not taking advantage of 64 bit, it has to do with several things that are a core part of the zlib design. I find it most amusing that the contributed x86_64 assembler "longest_match" function actually slows it down (and thus why it isn't a part of the actual x86_64 build by default). Most people don't realize how it all looks from a profiling perspective. For example, this is a profiler run from my deflate routine (profiling of course adds a great deal of overhead, but gives a very accurate picture of each routine and where time is spent): The input for that profiler run is 460MB. Inflate is obviously much simpler, and here is profiler run from inflating the output of the previous step: My code is already 30% faster than the reference version, but you can see a great deal of optimization room here, more so if I opted not to do proper crc32/adler32/etc... I could spend a full year doing nothing but optimizing them further, haha... register stalls and dependency chains through it are nasty The 32bit versus 64bit zlib isn't "dramatic" due to the way it deals with bit buffering, windows, and longest match searching, and without gutting a good chunk of it (thus ruining portability), hmm, I'd say it is the way it is quite intentionally (the reference version that is)... |
|||
26 Nov 2014, 22:07 |
|
redsock 26 Nov 2014, 22:20
I should also add that by default I compile my library to align every single inner jump target by 16, so there is a fair few multibyte NOPs spread throughout it
|
|||
26 Nov 2014, 22:20 |
|
JohnFound 29 Nov 2014, 15:56
So, after some optimization, my routine is now about 20% faster than gzip implementation and is under 1300 bytes in size, which I decided is good enough result.
Profiling the code, I discovered that the most time is wasted scanning the stream bit by bit and traversing the Huffman tree, so optimizations of memory access or multi-bit reading gives very small performance gain and is useless at all. Also, all kinds of code alignments was useless - the speed was always the same. The final (for now) code is in the repository. P.S. It is interesting isn't it possible to traverse the Huffman tree (binary) by 2 bits? I.e. transforming the binary tree into a 4-tree. But how then to handle codes with odd length?... |
|||
29 Nov 2014, 15:56 |
|
r22 01 Dec 2014, 13:42
@JohnFound - http://www.imfm.si/preprinti/PDF/00600.pdf
The paper describes a mutation of the huffman tree with an added code size based lookup table that can improve the performance. I was going to suggest starting with a LUT based on the code size then the search based on the actual code could be optimized on a per-size basis but the above papers solution is a bit more elegant. |
|||
01 Dec 2014, 13:42 |
|
redsock 01 Dec 2014, 18:47
JohnFound wrote: So, after some optimization, my routine is now about 20% faster than gzip implementation and is under 1300 bytes in size, which I decided is good enough result. When you say your routine is 20% faster than gzip/zlib, do you mean that you disabled zlib's error and integrity checking for your comparison? |
|||
01 Dec 2014, 18:47 |
|
JohnFound 01 Dec 2014, 22:41
I am not using zlib for testing. Simply have one big compressed file and use "time gzip -d testfile.gz" and "time TestDeflate" in order to test the speed. Notice that I am not handling zip or gz files formats. I am implementing only Inflate algorithm. My test program reads the compressed file, decompress it and writes the output file. And this is implemented only as a quick test code, not to be released as a gzip replacement. But all the integrity and error checks for the Deflate stream are implemented and working. My main goal is to provide the algorithm for using in PNG decoding library. That is why for now only Inflate part is implemented. Maybe later, some day, I will write Deflate as well, but not now.
|
|||
01 Dec 2014, 22:41 |
|
JohnFound 01 Dec 2014, 22:45
r22 wrote: @JohnFound - http://www.imfm.si/preprinti/PDF/00600.pdf I read this article and the trick is really interesting, but I am afraid it is not usable for Deflate/Inflate algorithm, because it needs the whole Huffman tree building algorithm to be changed. But this way it will be new compression/decompression algorithm, not compatible with Deflate/Inflate. _________________ Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9 |
|||
01 Dec 2014, 22:45 |
|
redsock 01 Dec 2014, 23:40
JohnFound wrote: I am not using zlib for testing. Simply have one big compressed file and use "time gzip -d testfile.gz" and "time TestDeflate" in order to test the speed. Notice that I am not handling zip or gz files formats. I am implementing only Inflate algorithm. My test program reads the compressed file, decompress it and writes the output file. And this is implemented only as a quick test code, not to be released as a gzip replacement. But all the integrity and error checks for the Deflate stream are implemented and working. My main goal is to provide the algorithm for using in PNG decoding library. That is why for now only Inflate part is implemented. Maybe later, some day, I will write Deflate as well, but not now. Yeah that's what I figured My only point was that from the code you posted, there were no crc32 verifications, nor code/distance error checking... so it might be misleading for you to say your routine is 20% faster than gzip -d, considering that gzip -d is also doing crc32 validation against all of its input, as well as inner-loop distance/code validations. I think you may find that modifying gzip/zlib's inflate_fast routine to behave as yours is, that zlib's would be faster still. I also use my inflate routine for PNG support, though I didn't bother with the entire subset of PNG encodings as I only wanted PNG24 support. Cheers, and appreciate your effort. |
|||
01 Dec 2014, 23:40 |
|
JohnFound 02 Dec 2014, 04:49
Well, crc32 is much faster than the Inflate decoder, so I don't think there will be significant difference. But still I am not trying to beat gzip. I only wanted to have a code with normal performance. Anyway, as I wrote above, my research shows that big performance increase can be achieved only with different Huffman tree traversing algorithms, not simple code optimizations.
|
|||
02 Dec 2014, 04:49 |
|
redsock 02 Dec 2014, 06:16
@JohnFound: Absolutely no offense intended, but you are way off the mark here.
As opposed to discussing the internals of zlib (which is effectively gzip/gunzip), I have taken the liberty of modifying zlib-1.2.8 specifically to highlight the differences that really are happening when you do your simplistic timing view. I would like to point out that the _only_ reason I feel compelled to do so is that by you stating your routine is 20% faster than gzip, you are doing a tremendous disrespect to Mark Adler (whom I do not know, and am not affiliated with whatsoever). Zlib itself has been the subject of a great many respected people carefully reviewing and optimising it. When I say my code is 30% faster than it, I am performing the exact same amount of work (to detect bitrot in archives, etc, etc). And it isn't because my own library and/or statements is correct, it is only that you don't seem to understand just how it all comes together (nor should you, given you have only read the RFC). I appreciate that you are not trying to outperform gzip, but there are a great many factors involved in what gzip is that your code simply does not do, and your passing crc32 off as being insignificant/negligible is just plain wrong. So, back to my case-in-point: I downloaded a fresh copy of zlib-1.2.8 from zlib.net, and then spent 10 minutes modifying its inflate routines to include provisions for JOHNFOUND #ifdefs. Here is my modified library: http://nologic.com/zlib-1.2.8-JohnFound.tar.gz Once you tar xf it, you'll end up with a normal zlib-1.2.8-JohnFound directory. When you get in there, configure it by doing: Code: CFLAGS="-O3 -DJOHNFOUND=1" ./configure then proceed with a normal Code: make then, you should have a few binaries in your directory that you didn't have before. Namely: minigzip. It will accept the same syntax you have been doing to decompress files, but it will do _no_ integrity checking (noting here, all I disabled for the purpose of making my point was crc32/adler32). If you wish to compare against the reference version (without my JOHNFOUND mods), either download a separate copy straight from zlib.net, or do: Code: make distclean
./configure
make Using the exact same output from my prior post with my own profiling runs, I tested the same with a 39,551,556 byte gzip input file (that uncompressed results in 460MB file) my results are: Code: JOHNFOUND minigzip: time ./minigzip -d -c test.input.gz >/dev/null real 0m0.760s user 0m0.745s sys 0m0.016s REFERENCE version (aka gzip): time ./minigzip -d -c test.input.gz >/dev/null real 0m1.259s user 0m1.248s sys 0m0.012s If I went further, and disabled all invalid lengths/distances/codes checking (and there are many), I think you'd find that zlib itself would go yet-again much faster. My point mainly is to highlight to other readers that you can't say your code is faster than someone else's when in fact it is not, whether you appreciate the associated nasties involved in the what/why/how isn't really important. $$rant complete$$ hahah again, I not trying to be rude or offensive, I have just spent a LOT of time dwelling in this area (and thus why I latched onto your thread in the first place). Cheers again |
|||
02 Dec 2014, 06:16 |
|
revolution 02 Dec 2014, 07:20
JohnFound wrote: Well, crc32 is much faster than the Inflate decoder, so I don't think there will be significant difference. JohnFound wrote: But still I am not trying to beat gzip. I only wanted to have a code with normal performance. JohnFound wrote: what is the fastest way |
|||
02 Dec 2014, 07:20 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.