flat assembler
Message board for the users of flat assembler.

Index > Main > carryless rangecoder

Author
Thread Post new topic Reply to topic
donkey7



Joined: 31 Jan 2005
Posts: 127
Location: Poland, Malopolska
donkey7 07 Jul 2005, 21:14
recently i made an entropy coder - subbotin carryless rangecoder. it's quite similiar to arithmetic coder.
i sent it in attachment. can anyone tell me how to improve its speed?


Description: Subbotin CarryLess Order-0 Range Coder
Download
Filename: subbotin.tgz
Filesize: 5.09 KB
Downloaded: 1271 Time(s)


_________________
Keep coding!
Post 07 Jul 2005, 21:14
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 11 Jul 2005, 21:42
Set .firstloop to:
Code:
.firstloop:
        ; load symbol cumulative probability
        movzx   eax,byte[frntbuf+ebp]
        xor     ebx,ebx
        xor     ecx,ecx

    imul eax,01010101h
    push eax
@@:
    mov  eax,[esp]
    xor  eax,dword[ecx+symbols]
    mov  edx,eax
    not  eax
    sub  edx,01010101h
    and  eax,80808080h
    and  eax,edx
    jne  @f
    add  ebx,[ecx*4+ranges+00]
    add  ebx,[ecx*4+ranges+04]
    add  ebx,[ecx*4+ranges+08]
    add  ebx,[ecx*4+ranges+12]
    add  ecx,4
    jmp  @b
@@:
    shr eax,8
    jc  @f
    add ebx,[ecx*4+ranges]
    inc ecx
    shr eax,8
    jc  @f
    add ebx,[ecx*4+ranges]
    inc ecx
    shr eax,8
    jc  @f
    add ebx,[ecx*4+ranges]
    inc ecx
@@:
    pop eax
;......until encode symbol
    


Last edited by Madis731 on 12 Jul 2005, 22:07; edited 1 time in total
Post 11 Jul 2005, 21:42
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
donkey7



Joined: 31 Jan 2005
Posts: 127
Location: Poland, Malopolska
donkey7 12 Jul 2005, 15:24
yeah, amazing!
it will be more readable if you change value 16843009 to hexadecimal form $01010101. also this algo can be used in .fourthloop too.
have you any idea about improving decoding speed (something like trees etc., i don't know)? it's more important to me.
anyway, thanks Madis!
Post 12 Jul 2005, 15:24
View user's profile Send private message Visit poster's website Reply with quote
THEWizardGenius



Joined: 14 Jan 2005
Posts: 382
Location: California, USA
THEWizardGenius 12 Jul 2005, 21:02
What is a range coder? What does it do and how does it work?
Post 12 Jul 2005, 21:02
View user's profile Send private message AIM Address Reply with quote
T&K(r)



Joined: 10 Jul 2005
Posts: 18
Location: Poland - > Krakow
T&K(r) 12 Jul 2005, 21:22
TheWizardGenius - visit http://www.arturocampos.com to know more how rangecider working ( and what is it )
Post 12 Jul 2005, 21:22
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 12 Jul 2005, 22:58
Have a look at these:
Its strange that the results are so symmetrical - for example 7zip packing:unpacking=1:10 in terms of speed.

I don't quite like the per-byte codec like adding and comparing. I know this can be streamlined but I have yet to discover how Sad


Description: 20% speed gain decompressor
Download
Filename: subb_adapt_d2.asm
Filesize: 8.94 KB
Downloaded: 1272 Time(s)

Description: 14% speed gain compressor
Download
Filename: subb_adapt_c2.asm
Filesize: 9.63 KB
Downloaded: 1260 Time(s)


_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 12 Jul 2005, 22:58
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
donkey7



Joined: 31 Jan 2005
Posts: 127
Location: Poland, Malopolska
donkey7 13 Jul 2005, 13:32
Speeds are quite identical because it's only an coder. Typical compressor have a model (wchich modeles the input, eg: LZ, PPM or BWT) and coder (wchich compresses output of previous stage). I'm actually writing an LZ-based compressor. Actually it have only an model - modelled data must be compressed externally, eg: by this coder.
7-zip also uses range coder (slightly different), ZIP uses Huffman or Shannon-Fano codes. Range coder seem to be better solution because it is faster than Huffman in adaptive implementations (and far slower in semi-static implementations). I have made an Huffman coder a short time ago, it's available on my website. You can also use shcodec http://webcenter.ru/~xander/, and comapre results.

Result of rangecoder are symmetrical because coder and decoder does almost the same work. ZIP decompresses faster than compresses because compresses must find repeated patterns and decompressor only repeats them based on informations gived by compressor.
Post 13 Jul 2005, 13:32
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 14 Jul 2005, 08:14
I was thinking of an optimization like NOT initializing symbols at first because you need to organize them anyways. So after you have put them at right places (less movement required) - you can fill in the rest (256-n) places with symbols not yet used.
I bumped up the addition speed by making it DQWORD wide but the best results is promised on variable width additions. You can double the width on any successice NotAbove-addition and fallback to DWORD if addition goes over eax (the compared value).
Divisions should be get rid of and any MUL DIV should be checked for ZERO/ONE before executing. Removing unneccessary chunks of code executed on preknown results.

btw is the goal to beat shcodec or to prove that you are capable of making something comparable to speeds of shcodec?
Post 14 Jul 2005, 08:14
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
donkey7



Joined: 31 Jan 2005
Posts: 127
Location: Poland, Malopolska
donkey7 14 Jul 2005, 16:37
can you write some code for it. i don't understand this 'variable width additions' (my english is not s good ;-( ).
i think that checking for one defore divisions will slow down the coder because at most of the time (over 99.9%) that cases won't appear.
to speed up i should not sort frequencies on first chunks, but only on further chunks (i think, but i don't tested this).

the goal is making the fastest coder available (i mean fast enough) that has best performance over all adaptive coders.

i was thinking on a implementation with binary trees and without frequency sorting. this will be faster on 'random' data.

i need coder as last step in lz-based compressor. it will code lenghts of values (in bits) (adaptively), distances and lengths(static, because it can't be adaptive due to high range of possibly values), and literals (also adaptively). lenghts of values are rather constant (and there frequency sorting will help very much) and literals are rather random (and there frequency sorting will not help or even make thing worse).

you can get and compare some other coders on a Eugene Shelwien website.
Post 14 Jul 2005, 16:37
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 15 Jul 2005, 02:15
If you want speed, you may want to consider using SSE instructions.
I'm not sure if it's possible but you may also want to process or than one symbol at a time.
If you do use SSE you'll want to align your variables and tables on 16byte bounds so you can use the Aligned as opposed to the Unaligned SSE instructions.
Post 15 Jul 2005, 02:15
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
donkey7



Joined: 31 Jan 2005
Posts: 127
Location: Poland, Malopolska
donkey7 15 Jul 2005, 14:06
r22: processing more than one symbol at a time seems to be impossible. but some instructions can be parallelized (i think). anyway, i don't want to make code for small group of processors (SSE is relatively new technology), but optimize the entire algorithm.

i will think over a binary trees, but it may be impossible to get to work Wink.
Madis731 improvements are quite good with frequency sorting, now there are are one thing to do - improve speed on 'random' data.
Post 15 Jul 2005, 14:06
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 16 Jul 2005, 22:46
Hrm, use SSE in a coder? I thought the algorithm in question used integer and not FP data? And afaik, SSE only supports floatingpoint datatypes (SSE2 adds double-width MMX operations, though). Or am I wrong?
Post 16 Jul 2005, 22:46
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 17 Jul 2005, 17:59
Takes time for me to suit in again because I have been on a holiday. I'll try to figure out some better algos and post if any. MMX came from early Pentiums and I think it is possible to use MMX instructions, but this means we have to sort all data to 8byte chunks and I really don't think of any way right now to access 8 different memorylocations at a time. Like donkey7 said - it seems to be impossible Wink
Post 17 Jul 2005, 17:59
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
donkey7



Joined: 31 Jan 2005
Posts: 127
Location: Poland, Malopolska
donkey7 17 Jul 2005, 18:53
yes, simd instruction won't help. we must find an algo with complexity like o(log n) or o(sqrt n)., for example binary trees.

i thought a bit on binary trees and there are my suggestions:
- don't sort by frequencies (version for 'random' data),
- we can divide entire set of symbols into two parts, compute cumulative frequency for second part and this would on average double the speed of searching symbols cumulative frequencies,
- we can divide sybols for smaller & smaller sets, this would result in something like binary tree, the main problem is how to manage them (when increasing or decreasing frequency of given symbol) ,
- this would have o(log2 n) complexity - good enough for 256 symbols,
Post 17 Jul 2005, 18:53
View user's profile Send private message Visit poster's website Reply with quote
donkey7



Joined: 31 Jan 2005
Posts: 127
Location: Poland, Malopolska
donkey7 18 Jul 2005, 21:03
here are my recent implementations of adaptive coder.
shindlet_bt_* - using thing like binary search trees,
shindlet_sl_* - using simple linear search
shindlet_fs_* - using frequency sorting,
i've included also Madis731 improvements.

bt has o(log n) complexity but it isn't much faster thal sl. could anyone 'fix' it ;-?

i've changed coder from subbotin to shindlet (minor change).

please test them on different types of files.


Description: set of shindlet-based coders
Download
Filename: range.tgz
Filesize: 7.49 KB
Downloaded: 1195 Time(s)


_________________
Keep coding!
Post 18 Jul 2005, 21:03
View user's profile Send private message Visit poster's website Reply with quote
donkey7



Joined: 31 Jan 2005
Posts: 127
Location: Poland, Malopolska
donkey7 19 Jul 2005, 19:10
i've updated my coders again.
bt_* are greatly improved, and sl_* are slightly faster (because i've made an mistake in sources before - i forgot to replace all searching alogs with Madis731 ones).

test them and tell me what can be changed to make coder faster.

note that bt_* and sl_* produces the same output (fs_* produces different output but it has the same length as previous ones).


Description: shindlet coders
Download
Filename: range.tgz
Filesize: 7.81 KB
Downloaded: 1260 Time(s)


_________________
Keep coding!
Post 19 Jul 2005, 19:10
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.