Roolz

ROLZ (from the English R educed Offset LZ - Lempel-Ziv algorithm with reduced offsets) is a dictionary data compression algorithm similar to LZ77 , but using some contextual methods to reduce the number of active offsets. The very concept of ROLZ was first introduced by Malcolm Taylor in its RK archiver in 1999, and this algorithm is one of the most modern approaches to building fast efficient compression algorithms.

In the standard LZ77, matches are encoded by a pair:

match length
offset ( Eng. offset ) or distance ( Eng. distance )

The problem with this scheme is that there is redundancy in coding. For example, with a dictionary size of 4 KB, there are 4096 offset options. It is clear that most offsets will not be used, and if out of 4096 options, for example, only 512 is used, then 9 bits are enough for encoding the offset instead of 12 (a 25% reduction).

Content

Algorithm Options

Many authors have attempted to reduce the possible values of the displacements, among them we can note:

LZFG-C2

Authors: Edward R. Fiala, Daniel H. Greene, 1989.

Matches are not encoded by a pair [length, offset], but by a special index corresponding to a specific line in the dictionary. In other words, the same phrases have the same index, due to which economical coding of matches is provided.

LZRW4

Posted by Ross Williams, 1991

In fact, LZRW4 is similar to ROLZ. Although the author did not undertake the creation of a full-fledged program, in the above demonstration compressor you can see the ROLZ scheme in its draft version.

LZP1-LZP4

Posted by Charles Bloom, 1995.

LZP is a dictionary compression algorithm that, when encoding a match, dispenses with offsets altogether. This is possible due to the fact that the displacements relative to the current context are stored in a special table and the compressor and decompressor operate in the same way on this table.

LZ77-PM

Authors: Dzung T. Hoang, Philip M. Long, Jeffrey Scott Vitter, 1995.

This algorithm is similar to ROLZ, with the difference that variable length contexts are used instead of fixed-order contexts to reduce active offsets.

ROLZ

According to the author’s description, this algorithm is a fast Lempel-Ziv algorithm with a large dictionary, which is able to cover large distances in the dictionary simultaneously quickly and efficiently.

It should be noted that RK is a commercial archiver with closed source code and many details of the algorithms used have not been disclosed. But thanks to individuals, the secret was ajar and even several free programs were written on this compression algorithm.

So, to reduce active offsets, a fixed-order context is used. In the original, this is a first-order context (i.e., one character preceding the current character), it is also possible to use other contexts - say, a second-order context (two characters preceding the current one), etc.

Instead of looking for matches for all offsets in the dictionary, we restrict ourselves to searching only from those offsets in front of which the current context was present. In the simplest case, you can use a certain offset table:

  // find the longest match
 
 context = buf [ position - 1 ];  // current first order context
 
 max_match = 0 ;  // match length for encoding
 index = 0 ;  // match index
   
 for ( i = 0 ; i < TAB_SIZE ; i ++ ) { // for each stored offset in the table for a given context
   offset = tab [ context ] [ i ];  // find the offset
 
   match_length = get_match ( offset );  // find the length of the match
 
   if ( match_length > max_match ) { // found a longer match
     max_match = match_length ;
     index = i ;  // save the index
   }
 }
 
 // update the table
 
 for ( i = TAB_SIZE - 1 ; i > 0 ; i - ) // first move all offsets up, deleting the oldest
   tab [ context ] [ i ] = tab [ context ] [ i - 1 ];
   
 tab [ context ] [ 0 ] = position ;  // then add the current offset
 
 // encode a literal or match
 
 if ( max_match > = MIN_MATCH ) { 
   encode_match_length ( max_match );  // encode the length of the match
   encode_match_index ( index );  // encode the match index
   position + = max_match ;
 }
 else { // no matching length found
   encode_literal ( buf [ position ]);  // encode the literal
   position ++ ;
 }

This is the easiest way. In practice, iterating over, say, 1024 offsets at each step can take too much time. To speed up the search, hashing and various structures for quick searching can be used, which are used in widespread implementations of the LZ77 family of algorithms.

In the original ROLZ, literals are encoded using a first-order context model, and this is no accident. The fact is that this scheme encodes a larger number of literals, when compared with the standard LZ77, since very short matches will simply not be touched by the ROLZ scheme. For example, when using a first-order context and with a minimum match length of three characters, the current minimum match length will be four (1 (context) + 3 (minimum match length) = 4). A first-order context model or a more complex PPM 1-0 model, when encoding literals, can compensate for this shortcoming of the algorithm.

ROLZ2-ROLZ3

Posted by Malcolm Taylor 2005

These algorithms represent a further development of ROLZ:

ROLZ2 - was designed to provide maximum unpacking speed.
ROLZ3 - for maximum compression, with a slight loss in decompression speed.

Both algorithms use a first-order context to reduce active offsets, and they are also able to use a table of up to 32,000 offsets for each context.

ROLZ2 uses a simple and fast first-order model to encode literals.
ROLZ3 uses a more sophisticated CM (Context Mixing) second-order model.

But the main distinguishing feature of these algorithms is the use of optimal parsing during coding. Dynamic Programming - the technique used here is able to calculate the optimal sequence of literals / matches for many steps forward, given the choice when choosing the real cost of encoding a literal or a match.

Links

RK and WinRK Homepage
Compressor BALZ (Public Domain)
QUAD Compressor (LGPL)
Description and source codes LZRW1-LZRW4 - theoretical justification of the advantages of ROLZ
Description and source codes of various LZP and LZCB options
Arturo Campos. Description of the LZP Algorithm