|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
Algorithms and Data Structures | Recommended Books | Recommended Links | Recommended Papers | Bit Tricks | gzip | bzip2 | xz | zip | |
Huffman | LZW | LZH | Bijective | RLE | JPEG | DNA sequencing data compression | History | Humor | Etc |
|
NOTE: These page is very incomplete and represents author working notes on the subject. Other pages give a much better overview of the various compression algorithms. The following types of compression are well documented elsewhere:
|
Compression now emerged as an important and a very simple and efficient way to increase the speed of transmission as well as to increase resistance to decoding of various known and widely used cryptographic algorithms. It is able to compensate for certain pre-existing weaknesses of algorithms such as DES. It is easy to understand that, for example, it can help to make "known plaintext" attacks more difficult, creates difficulties similar to difficulties of decoding of one-time pad. That on one hand allows to extent useful life of well known algorithms such as DES, which are implemented in some old hardware. Or you can use it with "secure" algorithms such as "triple DES" to increase difficulties for those who might try to break it (the context which still exist in compressed text due to the structure of the compressed text created by the compression algorithm used is usually quite small (for example LZ77 does not store any dictionary), Even very simple compression methods using Huffman encoding, arithmetic encoding (elegance, effective methods that allows dynamic adaptability), or differential encoding increases the complexity of breaking the cipher as all of them change the alphabet and length of symbols, breaking "one byte -- one symbol" (or two bytes one symbol) correspondence (and this making the "know text" somewhat "less known"). They also eliminate large part of the redundancy of the text.
In case differential compression is used and the reference in unknown to the attacker (pointer to the reference (say URL) made the part of the key) this makes the cipher theoretically unbreakable (although one can still spy of HTTP or other traffic at the recipient end) as the text simply can't be resorted without knowing the key. Which also created some similarities with one time pad cryptographic method.
In other words, the compression of the data stream before encryption extends useful life of well know "weak" cryptographic algorithms as DES and increases the strength of encryption with "strong" algorithms.
Another important modern application area for compression is storage of "Big Data". In many areas, such as biosciences, and imaging accumulated and pretty valuable data now reach several petabytes. Storing this data without compression is very expensive and here the quality of compression is "real money", as each saved 100TB of space is a significant reduction of costs.
You would think that generic algorithms can't complete with specialized for particular class of data as compression is about elimination of redundancy (increasing entropy) and specialized knowledge allow this to do more efficiently. But this is not always true. Compression like optimization (and compression can be viewed as as optimization) is one of the area, where it is very easy start digging "fools gold" and "spoil the broth" by enthusiastically venturing into areas of little or no return.
For example increasing the complexity of algorithms often backfire and the net result can be highly negative. At the same time it is clear the such algorithms as L77 does not work too well on certain important classes of data (for example genomic data represented as FASTA/FASTQ files) and other methods can archive equal or better compression. Although combination of LZ77 with Huffman encoding saves the face of gzip in this area and it does not perform too bad. Still its performance still is nothing to boast about and both bzip2 and xz beat it (although xz is way to slow for the size of typical genomic data set). Here is a very simple test that demonstrate performance issue quote well:
Sample | Compressor | Parameters | Orig size | Compressed size | Compress time | Compress speed | Decompress time | Speed relative to gzip -6 | Size relative to gzip | Qratio (speed divided on square of size) All relative to gzip | % of space occupied by archive | Compression ratio | |||
(GB) | (GB) | Min | sec | MB/sec | Min | sec | |||||||||
1 | FASTQ | gzip | -6 | 7.43 | 2.08 | 18 | 3 | 6.86 | 1 | 14 | 1.00 | 1.00 | 1.00 | 0.28 | 3.58 |
2 | FASTQ | pigz | 7.43 | 2.08 | 1 | 16 | 97.78 | 0 | 51 | 14.39 | 1.00 | 14.37 | 0.28 | 3.57 | |
3 | FASTQ | pigz | -9 | 7.43 | 2.04 | 2 | 32 | 48.89 | 0 | 51 | 7.20 | 0.98 | 7.47 | 0.27 | 3.64 |
4 | FASTQ | bzip2 | 7.43 | 1.53 | 20 | 42 | 5.98 | 7 | 21 | 0.88 | 0.74 | 1.63 | 0.21 | 4.86 | |
5 | FASTQ | pbzip2 | 7.43 | 1.60 | 1 | 51 | 66.95 | 0 | 56 | 9.86 | 0.77 | 16.63 | 0.22 | 4.64 | |
6 | FASTQ | xz | -9 | 7.43 | 1.34 | 400 | 5 | 0.31 | 3 | 19 | 0.05 | 0.64 | 0.11 | 0.18 | 5.56 |
7 | FASTQ | xz | 7.43 | 1.53 | 206 | 1 | 0.60 | 3 | 19 | 0.09 | 0.74 | 0.16 | 0.21 | 4.86 |
Please note that for FASTA file compression ration close to 4 can be achieved by 2-bit encoding of data stream. So you would expect at least 5 compression ratio as the lowest bound of "decent" compression for FASTA/FASTQ data for more complex algorithms.
But only xz which uses LZMA achieves this. And it achieves this at the cost of tremendous increase of compression time ten times in comparison with gzip on single CPU and 20 time is you use multithreaded implementation (capable to use large number of cores on modern servers (which reached 60 for two socket sever recently) of gzip (pigz) or bzip2(pbzip2). So this is a classic example of Pyrrhic victory making its suitable only for long term storage of old, little used data.
So it is clear that universal compression algorithms are inadequate for this time of data. But that does not by extension means that specialized algorithms that beat them to the punch are easy to write: for the last 20 year no clear leader for this type of data emerged despite many attempts made in this area.
In other words compression area like optimization problems is a very tricky area were "road to hell is paved with good intentions".
As computer science started to stagnate in late 80th, the same happen with compression algorithms too. The field experienced a short Renaissance in 2014 as described in a very good introductory book on the subject Understanding Compression: Data Compression for Modern Developers by Colt McAnlis, Aleks Haecky
The “Weissman Score”
I gotta hand it to Mike Judge, who really helped out the compression industry. Since the 1980s, attention and gains in compression have been relatively small and slow. Sure, we had BWT in the ’90s, and LZMA in the ’00s, and ZPAQ in the ’10s, but other than that, it’s been pretty quiet. However, in a single burst of laughter, suddenly the nerd world is interested in compression again. What happened?
In 2014, the TV show Silicon Valley burst onto the scene with much hilarity. The show follows a programmer’s journey into the startup world as he tries to build a company around a revolutionary new compression algorithm. While satire, at its core, the impact on the compression community has been pretty amazing. Suddenly, it’s cool to care about compression again. The media has been eating it up, and any story of a company doing something interesting with compression immediately gets compared to the TV show. So when Google released a new algorithm, the media took the opportunity to discuss art, imitating life, imitating art, and all that jazz.
Although the compression algorithm presented in the show is all fiction, the producers wanted to have a hint of truth in things, so they contacted professor Tsachy Weissman at the University of Stanford to help them walk that line. Now, Prof. Weissman created a method for measuring the performance of data compression, which takes the compression ratio of the data set and divides it by the encoding speed of the data set. The intention is to test out the performance of new compression algorithms by normalizing their ratio/encode speed by known existing encoders (like GZIP). This normalization provides some ability to compare algorithms against universal standard compressors, which can be helpful in the evaluation of the right system for the right data type.
This new metric was named in his honor by the show creators as the Weissman Score, and since has become the stuff of legend on the Internet, although it’s not clear whether the score has found roots in practical use in the compression world. (Or, at least no one on encode.ru seems to be using it yet…)
The takeaway here is that with a great new attention in the data compression space, we’re hopeful to see a new generation of algorithms and research focused on this area to help move compression into a new generation of breakthroughs.
This again demonstrates that computer science, like woman clothing, by-and-large is driven by fashion ;-)
As compression can be viewed as a specialized area of optimization, efficiency of implementation is of paramount important the only two viable implementation languages are C/C++ and assembler language (the latter destroyed the portability). Also the problem of existence of two typed of computers: Big Endian and Little Endian bite compression algorithms writers very hard.
As Intel CPU are dominant writing in assembler language is an option is you write the algorithms for a single CPU. Portability be damned ;-)
When wiring compression algorithms you can greatly benefit from the ability to do various tricks on bit fields. Which actually makes non-complied languages slightly decifient as an implementation language. you can also also benefit from the existence of various of forms of integers on modem computer.
Bijective compression means that for any file X, F( F'( X ) ) == X. (F is either the compressor or decompressor, and F' is its opposite). This defines lossless compression. This type of compression is important in compressing files and as a method of increasing the strength of crypto algorithms as it can be fast enough to be used in real time data transmissions.
Compression now is used as a standard first phase of encryption of large chunks of texts. It eliminates redundancy and thus creates significant additional difficulties for attempts to break particular cipher in many case making them simply impossible (using salt to make it unique and if a non standard algorithms with no headers is used). It destroys the notion of the dictionary of plaintext.
There were several attempts to establish upper limit on compression for particular type of data and/or optimality of particular algorithms or data representation (Huffman encoding) . This is a complex area but he following post gives you the general idea Is there a theoretical limit to data compression If so, how was it found - Quora
Lossless compression reduces bits by identifying and eliminating statistical redundancy. The more heavily the source (let's say a text file) text is prepossessed and studied, the more efficiently it can be compressed. So efficiently of compression is reverse proportional to the speed of the compression. For example, no generic algorithm can compete with algorithms designed for compressing text string on regular text articles and books. You can create even more specialized algorithm, for example specifically designed to compress Unix syslog files and it will beat any generic text compression algorithm. and so on and so forth. .
Compression always rely on non-randomness of the text: random sequence of bytes is poorly compressible.
The Lempel–Ziv (LZ) compression methods are among the most popular algorithms for lossless storage.[7] DEFLATE is a variation on LZ optimized for decompression speed and compression ratio, but compression can be slow. DEFLATE is used in PKZIP, Gzip and PNG. LZW (Lempel–Ziv–Welch) is used in GIF images. Also noteworthy is the LZR (Lempel-Ziv–Renau) algorithm, which serves as the basis for the Zip method.[citation needed] LZ methods use a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX). Current LZ-based coding schemes that perform well are Brotli and LZX. LZX is used in Microsoft's CAB format.
In general the compression can be viewed as a conversion of static text into a special program.
In other words compression consists of converting a static text into some kind of code for a specially constructed (may be unique for this case) abstract machine. Programs in this abstract machine are shorter then the original text while output of the program after processing by an interpreter (decompressor) is exactly the same (original) text.
For example if we first process the text and construct a dictionary of words used and then replace each word with the index to a dictionary that will be a primitive compression that might work well on text with a small dictionary and frequently repeated words. For example it might work well for the answer to some lengthy form like "Yes No n/a Yes No n/a n/a Yes No No No Yes n/a n/a No". In this case the dictionary consists of three entries and each reply can be encoded in two bits.
It is important to understand that the more you know about the structure of the text, the more specialized processor you and construct and the higher level of compression you can achieve. For example, if you know that a particular text is, say, an http log then you can compress it several times better then using any generic algorithms.
The "diff" command of UNIX system implement an algorithm of finding the shortest sequence of editing command able to convert one string to another. Informally, the result of a diff algorithm gives the minimum number of operations (insert a symbol, or delete a symbol) to transform one string into the other. This is related to an edit distance, called the Levenshtein distance, with the additional operation of substitution, and with weights associated to operations.
Hirschberg (1975) presents the computation of the LCS in linear space. This ideas can be very efficiently used for compression of http logs (for example Apache logs) and, say, windows of the last 1K lines often contains at least one string that is very close to the current (in Levenshhtein metric) so that you can encode the whole string by the reference to the "etalon" string and a few operations needs to convert "etalon" to the current string.
In its turn, the search of LCS is connected with efficient string matching algorithms. The first discovered linear-time string-matching algorithm is from Morris and Pratt (1970). It has been improved by Knuth, Morris, and Pratt (1976). The search behaves like a recognition process by automaton, and a character of the text is compared to a character of the pattern no more than \log_\Phi(m+1) (\Phi is the golden ratio (1+\sqrt 5)/2). The Boyer and Moore's algorithm (1977) is considered as the most efficient string-matching algorithm in usual applications. A simplified version of it (or the entire algorithm) is often implemented in text editors for the "search" and "substitute" commands. Several variants of Boyer and Moore's algorithm avoid the quadratic behavior when searching for all occurrences of the pattern. In 1975, Aho and Corasick designed an O(n\log\sigma) algorithm to solve this problem, with a running time independent of the number of patterns. It is implemented by the "fgrep" command under the UNIX operating system. In applications where the text is to be searched for several patterns, it is the text that needs to be preprocessed. Even if no further information is known on their syntactic structure, it is possible and indeed extremely efficient to built an index that supports searches. Data structures to represent indexes on text files are: suffix trees (Weiner 1973, McCreight 1976, Ukkonen 1994), direct acyclic word graph (Blumer et al., 1985), suffix automata (Crochemore, 1986), and suffix arrays (Manber and Myers, 1993). All algorithms (except for suffix arrays) build the index in time O(n\log\sigma).
If you can create a dictionary of symbols or words in the text (or just split the source into meaningful, repeatable chunks) the most optimal type of encoding is Huffman encoding. please note that for ordinary English text calculating the frequency of letters and compression based on this freqncy is far from optimal approach. Words represent far better target. If separation of into "words" is impossible then "digraphs" are better deal then single letters.
Huffman coding - Wikipedia, the free encyclopedia
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding and/or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".[1]
The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol. As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols. Huffman's method can be efficiently implemented, finding a code in linear time to the number of input weights if these weights are sorted.[2] However, although optimal among methods encoding symbols separately, Huffman coding is not always optimal among all compression methods.
... ... ...
Compression
A source generates 4 different symbols { a 1 , a 2 , a 3 , a 4 } {\displaystyle \{a_{1},a_{2},a_{3},a_{4}\}} with probability { 0.4 ; 0.35 ; 0.2 ; 0.05 } {\displaystyle \{0.4;0.35;0.2;0.05\}} . A binary tree is generated from left to right taking the two least probable symbols and putting them together to form another equivalent symbol having a probability that equals the sum of the two symbols. The process is repeated until there is just one symbol. The tree can then be read backwards, from right to left, assigning different bits to different branches. The final Huffman code is:
Symbol
Code
a1 0
a2 10
a3 110
a4 111The standard way to represent a signal made of 4 symbols is by using 2 bits/symbol, but the entropy of the source is 1.74 bits/symbol. If this Huffman code is used to represent the signal, then the average length is lowered to 1.85 bits/symbol; it is still far from the theoretical limit because the probabilities of the symbols are different from negative powers of two.
The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols, n {\displaystyle n} . A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to n {\displaystyle n} leaf nodes and n − 1 {\displaystyle n-1} internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths.The process essentially begins with the leaf nodes containing the probabilities of the symbol they represent, then a new node whose children are the 2 nodes with smallest probability is created, such that the new node's probability is equal to the sum of the children's probability. With the previous 2 nodes merged into one node (thus not considering them anymore), and with the new node being now considered, the procedure is repeated until only one node remains, the Huffman tree.
The simplest construction algorithm uses a priority queue where the node with lowest probability is given highest priority:
1.Create a leaf node for each symbol and add it to the priority queue.
2.While there is more than one node in the queue: 1.Remove the two nodes of highest priority (lowest probability) from the queue
2.Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
3.Add the new node to the queue.3.The remaining node is the root node and the tree is complete.
Since efficient priority queue data structures require O(log n) time per insertion, and a tree with n leaves has 2n−1 nodes, this algorithm operates in O(n log n) time, where n is the number of symbols.
If the symbols are sorted by probability, there is a linear-time (O(n)) method to create a Huffman tree using two queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues:
1.Start with as many leaves as there are symbols.
2.Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue).
3.While there is more than one node in the queues: 1.Dequeue the two nodes with the lowest weight by examining the fronts of both queues.
2.Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
3.Enqueue the new node into the rear of the second queue.4.The remaining node is the root node; the tree has now been generated.
Although linear-time given sorted input, in the general case of arbitrary input, using this algorithm requires pre-sorting. Thus, since sorting takes O(n log n) time in the general case, both methods have the same overall complexity.
In many cases, time complexity is not very important in the choice of algorithm here, since n here is the number of symbols in the alphabet, which is typically a very small number (compared to the length of the message to be encoded); whereas complexity analysis concerns the behavior when n grows to be very large.
It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To minimize variance, simply break ties between queues by choosing the item in the first queue. This modification will retain the mathematical optimality of the Huffman coding while both minimizing variance and minimizing the length of the longest character code.
Here's an example of optimized Huffman coding using the French subject string "j'aime aller sur le bord de l'eau les jeudis ou les jours impairs". Note that the original Huffman coding tree structure would be different from the given example:
Decompression
Generally speaking, the process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream (reaching a leaf node necessarily terminates the search for that particular byte value). Before this can take place, however, the Huffman tree must be somehow reconstructed. In the simplest case, where character frequencies are fairly predictable, the tree can be preconstructed (and even statistically adjusted on each compression cycle) and thus reused every time, at the expense of at least some measure of compression efficiency. Otherwise, the information to reconstruct the tree must be sent a priori. A naive approach might be to prepend the frequency count of each character to the compression stream. Unfortunately, the overhead in such a case could amount to several kilobytes, so this method has little practical use. If the data is compressed using canonical encoding, the compression model can be precisely reconstructed with just
bits of information (where B 2 B {\displaystyle B2^{B}} is the number of bits per symbol). Another method is to simply prepend the Huffman tree, bit by bit, to the output stream. For example, assuming that the value of 0 represents a parent node and 1 a leaf node, whenever the latter is encountered the tree building routine simply reads the next 8 bits to determine the character value of that particular leaf. The process continues recursively until the last leaf node is reached; at that point, the Huffman tree will thus be faithfully reconstructed. The overhead using such a method ranges from roughly 2 to 320 bytes (assuming an 8-bit alphabet). Many other techniques are possible as well. In any case, since the compressed data can include unused "trailing bits" the decompressor must be able to determine when to stop producing output. This can be accomplished by either transmitting the length of the decompressed data along with the compression model or by defining a special code symbol to signify the end of input (the latter method can adversely affect code length optimality, however). B {\displaystyle B} Main properties
The probabilities used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed. This requires that a frequency table must be stored with the compressed text. See the Decompression section above for more information about the various techniques employed for this purpose.
Optimality
Although Huffman's original algorithm is optimal for a symbol-by-symbol coding (i.e., a stream of unrelated symbols) with a known input probability distribution, it is not optimal when the symbol-by-symbol restriction is dropped, or when the probability mass functions are unknown. Also, if symbols are not independent and identically distributed, a single code may be insufficient for optimality. Other methods such as arithmetic coding and LZW coding often have better compression capability: Both of these methods can combine an arbitrary number of symbols for more efficient coding, and generally adapt to the actual input statistics, useful when input probabilities are not precisely known or vary significantly within the stream. However, these methods have higher computational complexity. Also, both arithmetic coding and LZW were historically a subject of some concern over patent issues. However, as of mid-2010, the most commonly used techniques for these alternatives to Huffman coding have passed into the public domain as the early patents have expired.
However, the limitations of Huffman coding should not be overstated; it can be used adaptively, accommodating unknown, changing, or context-dependent probabilities. In the case of known independent and identically distributed random variables, combining symbols ("blocking") reduces inefficiency in a way that approaches optimality as the number of symbols combined increases. Huffman coding is optimal when each input symbol is a known independent and identically distributed random variable having a probability that is an the inverse of a power of two.
Prefix codes tend to have inefficiency on small alphabets, where probabilities often fall between these optimal points. The worst case for Huffman coding can happen when the probability of a symbol exceeds 2−1 = 0.5, making the upper limit of inefficiency unbounded. These situations often respond well to a form of blocking called run-length encoding; for the simple case of Bernoulli processes, Golomb coding is a probably optimal run-length code.
For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding, e.g., ASCII coding. This reflects the fact that compression is not possible with such an input.
Variations
Many variations of Huffman coding exist, some of which use a Huffman-like algorithm, and others of which find optimal prefix codes (while, for example, putting different restrictions on the output). Note that, in the latter case, the method need not be Huffman-like, and, indeed, need not even be polynomial time. An exhaustive list of papers on Huffman coding and its variations is given by "Code and Parse Trees for Lossless Source Encoding".[4]
n-ary Huffman coding
The n-ary Huffman algorithm uses the {0, 1, ... , n − 1} alphabet to encode message and build an n-ary tree. This approach was considered by Huffman in his original paper. The same algorithm applies as for binary (n equals 2) codes, except that the n least probable symbols are taken together, instead of just the 2 least probable. Note that for n greater than 2, not all sets of source words can properly form an n-ary tree for Huffman coding. In these cases, additional 0-probability place holders must be added. This is because the tree must form an n to 1 contractor; for binary coding, this is a 2 to 1 contractor, and any sized set can form such a contractor. If the number of source words is congruent to 1 modulo n-1, then the set of source words will form a proper Huffman tree.
Adaptive Huffman coding
A variation called adaptive Huffman coding involves calculating the probabilities dynamically based on recent actual frequencies in the sequence of source symbols, and changing the coding tree structure to match the updated probability estimates. It is used rarely in practice, since the cost of updating the tree makes it slower than optimized adaptive arithmetic coding, that is more flexible and has a better compression.
Huffman template algorithm
Most often, the weights used in implementations of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only that the weights form a totally ordered commutative monoid, meaning a way to order weights and to add them. The Huffman template algorithm enables one to use any kind of weights (costs, frequencies, pairs of weights, non-numerical weights) and one of many combining methods (not just addition). Such algorithms can solve other minimization problems, such as minimizing
, a problem first applied to circuit design. max i [ w i + l e n g t h ( c i ) ] {\displaystyle \max _{i}\left[w_{i}+\mathrm {length} \left(c_{i}\right)\right]} Length-limited Huffman coding/minimum variance Huffman coding
Length-limited Huffman coding is a variant where the goal is still to achieve a minimum weighted path length, but there is an additional restriction that the length of each codeword must be less than a given constant. The package-merge algorithm solves this problem with a simple greedy approach very similar to that used by Huffman's algorithm. Its time complexity is
, where O ( n L ) {\displaystyle O(nL)} is the maximum length of a codeword. No algorithm is known to solve this problem in linear or linearithmic time, unlike the presorted and unsorted conventional Huffman problems, respectively. L {\displaystyle L} Huffman coding with unequal letter costs
In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is N digits will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing.
Huffman coding with unequal letter costs is the generalization without this assumption: the letters of the encoding alphabet may have non-uniform lengths, due to characteristics of the transmission medium. An example is the encoding alphabet of Morse code, where a 'dash' takes longer to send than a 'dot', and therefore the cost of a dash in transmission time is higher. The goal is still to minimize the weighted average codeword length, but it is no longer sufficient just to minimize the number of symbols used by the message. No algorithm is known to solve this in the same manner or with the same efficiency as conventional Huffman coding, though it has been solved by Karp whose solution has been refined for the case of integer costs by Golin.
Optimal alphabetic binary trees (Hu–Tucker coding)
In the standard Huffman coding problem, it is assumed that any codeword can correspond to any input symbol. In the alphabetic version, the alphabetic order of inputs and outputs must be identical. Thus, for example,
could not be assigned code A = { a , b , c } {\displaystyle A=\left\{a,b,c\right\}} , but instead should be assigned either H ( A , C ) = { 00 , 1 , 01 } {\displaystyle H\left(A,C\right)=\left\{00,1,01\right\}} or H ( A , C ) = { 00 , 01 , 1 } . This is also known as the Hu–Tucker problem, after T. C. Hu and Alan Tucker, the authors of the paper presenting the first linearithmic solution to this optimal binary alphabetic problem,[5] which has some similarities to Huffman algorithm, but is not a variation of this algorithm. These optimal alphabetic binary trees are often used as binary search trees. H ( A , C ) = { 0 , 10 , 11 } The canonical Huffman code
Main article: Canonical Huffman codeIf weights corresponding to the alphabetically ordered inputs are in numerical order, the Huffman code has the same lengths as the optimal alphabetic code, which can be found from calculating these lengths, rendering Hu–Tucker coding unnecessary. The code resulting from numerically (re-)ordered input is sometimes called the canonical Huffman code and is often the code used in practice, due to ease of encoding/decoding. The technique for finding this code is sometimes called Huffman-Shannon-Fano coding, since it is optimal like Huffman coding, but alphabetic in weight probability, like Shannon-Fano coding. The Huffman-Shannon-Fano code corresponding to the example is
, which, having the same codeword lengths as the original solution, is also optimal. But in canonical Huffman code, the result is { 000 , 001 , 01 , 10 , 11 } { 110 , 111 , 00 , 01 , 10 }
HTTP compression - Wikipedia, the free encyclopedia
Content-Encoding tokens[edit]
The official list of tokens available to servers and client is maintained by IANA,[4] and it includes:
- compress – UNIX "compress" program method (historic; deprecated in most applications and replaced by gzip or deflate)
- deflate – compression based on the deflate algorithm (described in RFC 1951), wrapped inside the zlib data format (RFC 1950);
- exi – W3C Efficient XML Interchange
- gzip – GNU zip format (described in RFC 1952). This method is the most broadly supported as of March 2011.[5]
- identity – No transformation is used. This is the default value for content coding.
- pack200-gzip – Network Transfer Format for Java Archives[6]
In addition to these, a number of unofficial or non-standardized tokens are used in the wild by either servers or clients:
- br – Brotli, a new open-sourced compression algorithm specifically designed for HTTP content encoding, implemented in Mozilla Firefox release 44 and with intent-to-implement from Chromium.
- bzip2 – compression based on the free bzip2 format, supported by lighttpd[7]
- lzma – compression based on (raw) LZMA is available in Opera 20, and in elinks via a compile-time option[8]
- peerdist[9] – Microsoft Peer Content Caching and Retrieval
- sdch[10][11] – Google Shared Dictionary Compression for HTTP, based on VCDIFF (RFC 3284); supported natively in recent versions of Google Chrome, Chromium and Android, as well as on Google websites.
- xpress - Microsoft compression protocol used by Windows 8 and later for Windows Store application updates. LZ77-based compression optionally using a Huffman encoding.[12]
- xz - LZMA2-based content compression, supported by a non-official Firefox patch;[13] and fully implemented in mget since 2013-12-31.[14]
Servers that support HTTP compression[edit]
- SAP NetWeaver
- Microsoft IIS: built-in or using third-party module
- Apache HTTP Server, via mod_deflate (despite its name, only supporting gzip[15][self-published source?][16])
- Hiawatha HTTP server: serves pre-compressed files[17]
- Cherokee HTTP server, On the fly gzip and deflate compressions
- Oracle iPlanet Web Server
- Zeus Web Server
- lighttpd, via mod_compress and the newer mod_deflate (1.5.x)
- nginx – built-in
- Applications based on Tornado, if "compress_response" is set to True in the application settings (for versions prior to 4.0, set "gzip" to True)
- Jetty Server – built-into default static content serving and available via servlet filter configurations
- GeoServer
- Apache Tomcat
- IBM Websphere
- AOLserver
- Ruby Rack, via the Rack::Deflater middleware
- Varnish – built-in. Works also with ESI
The compression in HTTP can also be achieved by using the functionality of server-side scripting languages like PHP, or programming languages like Java.
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
Jan 03, 2016 | www.biostars.org
2.6 years ago by Eric Normandeau ♦ 9.8k Quebec, Canada
Eric Normandeau ♦ 9.8k wrote:
I would like to use another compression algorithm on fastq files than gzip for long term storage. Decompressing and re compressing could free about half the disk space I am using, which would save me approximately 20-30 To of space now and more in the future. (Funny, for some people this is probably big and for others ridiculously small :) The bottom line here is that disk space is money and up-scaling means the costs won't be linear but probably more expensive.
I have found a few contenders, but the most solid of them is fqzcomp-4.6 ( http://sourceforge.net/projects/fqzcomp/files/) . It boasts very good compression rates (about twice smaller compressed files when compared to gzip), is fast enough, installs easily.
Now these are important data, so I need to be sure that in 2 or 5 or 10 years I will be able to get them back.
I saw this other question ( https://www.biostars.org/p/156303/#163646) , but I really want to know if one of these tools is a) good enough to improve compression b) dependable. The article cited by Charles Plessy doesn't help much in this regard.
For those of you that work in big groups, institutes, organisations... What fastq compression tool would you recommend from the point of view of data safety? What have you used with success?
** EDIT **
I did a quick compression ration comparison for gzip, bz2, and fqzcomp. I used the default parameter values for gzip and bz2 and -Q {0,3,5} -s5+ -e -q 3 for fqzcomp. Here are the compression ratios (compared to the non compressed files):
Algo ratio gzip 0.272 bz2 0.218 fqzcomp 0.101 - 0.181 (depending on the Q param value) bz2 / gzip 0.801 fqzcomp / gzip 0.371 fqzcomp / bz2 0.463From these figures, we can see that bz2 reduces files only by an additional ~20% when compared to gzip.
However, fqzcomp files can be as much as 2.7 times smaller than gzip ones and 2.2 times than bz2 files. This is why I am really considering this algorithm. Of course, these figures will change with different fastq files and it assumes you do not care much about the quality values (for Ion Proton, which we sequence in high volume, we actually don't really care that much) but the potential gain is considerable.
compression reliability fastq • 6.3k views ADD COMMENT • link • modified 12 months ago by Petr Ponomarenko • 2.5k • written 2.6 years ago by Eric Normandeau ♦ 9.8k 2Even assuming the absolute worst case of a computer 5-10 years from now being entirely incompatible with your toolkit of choice -- which frankly seems so unlikely a possibility as to be a non-issue -- so long as you have the source code for "custom" or specialized compression and extraction tools, consider that 5-10 years from now, you could run current operating systems (and older compilers) within a VM hypervisor, like VirtualBox, Docker, etc. to extract archives. To help rebuild the environment down the road, it might help to keep a manifest with an archive, a README that describes the host, development tools and versions of things.
ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by Alex Reynolds ♦ 24kIs that enough to define dependability? What about bugs that are hard to spot? I don't know, I just want to find a tool I can trust with our group's data :) Mainly, I would like to get recommendations from big users too.
ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by Eric Normandeau ♦ 9.8k 2I guess it may depend on the details of your use case, but to me this seems like a lot of worry over a very very tiny part of the problem. I think that as long as you pick a reasonable compression tool where you have the source code and enough instructions to build it on a clean system, you'll be fine.
I would be much more worried about other aspects of the long-term storage problem, like adequate off-site backups and catastrophic fail-safe plans at the physical level. For instance, what if your backup provider goes out of business, the AC and emergency power in your datacenter fails and all your hard drives are damaged, or your systems are compromised and a rogue user tries to permanently delete many files? Or maybe something simpler, like system or backup account credentials being lost or forgotten as people move on? At your timescale and data sizes, I might even be worried about (relatively) far-fetched problems like bit rot .
So in sum, I guess I think that if decompressing the data is the only problem you have after X years, you would be lucky. I would personally focus on having enough independent off-site backups and then think about applying redundancy at other filesystem layers, like maybe with ZFS.
ADD REPLY • link written 2.6 years ago by matted ♦ 6.7kA non-lossy compression tool should be deterministic, which I'd think is a minimum standard for dependability. In other words, if you run the same input bytes through the same compression or extraction algorithm in the same environment, you should get the same expected bytes as output on repeated trials. If you can take the environment out of the equation with a VM, then you just need to worry about the compression tool, which you could probably set up post-compression extraction tests to verify functionality.
ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by Alex Reynolds ♦ 24kConsidering the files are on different servers, how would you go about spinning up a VM to (de)compress hundreds of files across different systems weighting a few dozen To? It doesn't seem too fun to me. I need something that will work, now and in the foreseeable future, on *NIX machines without giving me or a possible descendant headaches.
Also, I don't see the link between deterministic and dependable. It just means it will give the same output for the same input, not that the code is well written and not bug ridden.
I am looking for input about quality fastq compression tools (a very specific need) that I can depend on. Your answer is basically that all tools are equal as long as they work in a VM. I don't think I can agree.
ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by Eric Normandeau ♦ 9.8kGood idea about the manifest and info about the system.
ADD REPLY • link written 2.6 years ago by Eric Normandeau ♦ 9.8kSo what did you settle on?
ADD REPLY • link written 10 months ago by Yannick Wurm • 2.3kYannick, I am very interested too since we made our lossless compression algorithm Lossless ALAPY Fastq Compressor (now for MacOS X with 10-20% improved speed and compression ratio) and we think it is worth mentioning.
ADD REPLY • link written 9 months ago by Petr Ponomarenko • 2.5k 1I never felt 100% sure about any of the alternative compression softwares, so I continued depending on gzip at the cost of having to buy more disk space. I would still love to find a better way but a major tradeoff is that fasta.gz and fastq.gz can be read by most bioinfo pieces of software so deviating from that format means a bit more work. This could be fine for long term storage though.
Anybody adopted something other than gzip or bz2 and would like to report?
ADD REPLY • link written 9 months ago by Eric Normandeau ♦ 9.8k 1Have you tried using compressed files with decompression in process substitution in Linux <(...) or read from stdin?
fastqc <(decompress compressed.file)ADD REPLY • link written 9 months ago by Petr Ponomarenko • 2.5kYou know, Clumpify's output is still just a gzipped fastq/fasta. The only difference is that the order of the reads is changed. So it's 100% compatible with all software that can read gzipped fastq. Or bzipped, for that matter.
ADD REPLY • link modified 9 months ago • written 9 months ago by Brian Bushnell ♦ 15k 2 23 months ago by John ♦ 12k Germany John ♦ 12k wrote:I tried LFQC, but it had a bug where if the bundled precompiled binaries (for lpaq and zpac) didn't work, the ruby script that controls them would still print "created successfully!", then delete the work space after moving an empty tar file over the top of your original fastq.
I then tried fqzcomp, and its really really fast, and the output is tiny (for ENCODE's ENCFF001LCY.fq, which is 600Mb+ after gzip, fqzcomp has got it down to 250Mb) - but it has an unfortunate condition where it can't decompress what it wrote :/
You do not want to be stuck with a
ADD COMMENT • link modified 23 months ago • written 23 months ago by John ♦ 12k 2Floating point exception 8
10 years from now, thats for sure - so i think the answer is "everything is terrible, just stick to lzma" :) You'll only squeeze out a few more Mb with the other tools. If however you start using some of the lossy functions of the other tools, filesize will drop quickly. For example, binning quality scores, renaming the ids to just 1, 2, 3, 4... , converting all Ns to low-quality As, not retaining the original order of the fastq and instead sort entries by what compresses best, etc. But these methods all lose some information, so it might not be that exciting.I would be sort of wary of any "lossless" compression that uses floating point anywhere...
I like lzma for personal use, but it seems pretty slow compared to low-compression or multithreaded gzip, for a big-data production environment. Is there a multithreaded implementation?
Also, when you say "not retaining the original order of the fastq and instead sort entries by what compresses best"... technically that's kind of not lossy. At least with Illumina, I think you can probably recover the original order (or close to it) by looking at the read names, and I've never thought the order was important except when diagnosing machine problems.
But if you are willing to discard that information, you might want to try Clumpify , a tool I wrote. It re-orders reads so that sequences sharing kmers are close together, quickly and without any mapping, and using an arbitrarily small amount of memory (how small depends on your system's file handle limit). This allows gzip to compress error-free reads generated from a bacterial genome down to the size of around the bacterial genome, even if there is, say, 40x coverage. This near-perfect compression requires you to replace the quality scores with a fixed value, and give the reads very short names (like 1, 2, 3, etc), and it works much better on long, single-ended reads (or merged pairs). But even with paired reads containing sequencing errors, and using the raw quality scores and names, you get a substantial increase in compression. The output fastq file is still a valid fastq file, and for purposes where you don't care about the order of the sequences (which are most purposes I care about), it will be no different... except faster in most cases like mapping, assembly, or kmer-counting, due to improved caching and branch prediction from similar reads being adjacent. Of course if you rename the reads with numbers you get better compression and can easily recover the original order as well.
P.S. I should note that core-counts are increasing, while IPC and frequencies have stagnated, and essentially been flat for 4 years on workloads I care about. 10 years from now, I expect multithreaded compression and decompression to be very important; a fast program capable of using 128 threads is crippled if it can only decompress at 160 MB/s, roughly the current limit of gzip... let alone lzma, which on my computer is many times slower.
ADD REPLY • link modified 23 months ago • written 23 months ago by Brian Bushnell ♦ 15k 2Yes, a floating point error for something thats lossless and doesn't contain floating point numbers is a bit weird, but hey - at least it didn't delete my data -_-;
For parallel lzma, the official xz tool (which is the new compressor that does lzma2 compression) has a --threads option, but i've never used it. I'm unsure if you need to decompress with the same number of threads, or what exactly threads means here in terms of speedup. Theres are also a github project called pxz for "parallel xz", which looks like its the lzma cousin of pigz, but however you slice it your point of it not being anywhere close to as fast as gzip for big-data is valid.
Clumpify on the otherhand is an approach I haven't seen anywhere else. Usually in the encoding step before compression, data is split into names/dna/quality to help out the compressor, and converted to binary. You only sort on the DNA and don't convert to binary - which I actually prefer since that's where obscure "floating-point-esk" errors usually crop up, and it's going to be really fast compared to the other methods. And the output is valid FASTQ, and that FASTQ will be faster at being processed by downstream tools. I think thats a really really neat idea, and should probably be something built into the sequencers that are outputting FASTQs in the first place. It might not get the compressed file size down as low as fqzcomp, but I think it answers the OPs question perfectly by being undoubtedly the most reliable method.
ADD REPLY • link written 23 months ago by John ♦ 12k 1Note: Clumpify.sh is part of BBMap suite .
ADD REPLY • link written 23 months ago by genomax ♦ 48k 2 12 months ago by Petr Ponomarenko • 2.5k United States / Los Angeles / ALAPY.com Petr Ponomarenko • 2.5k wrote:Eric, I am very interested in what have you selected for fastq compression. We developed ALAPY Compressor Lossless ALAPY Fastq Compressor (now with stdin/stdout support) and technically it is reliable based on about 2000 different files that were compressed, decompressed, md5 sums compared and found to be exact in all cases. It is available on GitHub for free https://github.com/ALAPY/alapy_arc as a compiled binary for Linux and Windows. We hope GitHub will be around in 10 years and it will provide the current functionality of distributing these files. Could you please tell us what do you need to consider compression tool reliable and dependable?
Overall I am very interested in your thought about ALAPY Compressor in general, ie if the compression ratio is good, memory usage, features, etc.
ADD COMMENT • link written 12 months ago by Petr Ponomarenko • 2.5k 1 2.6 years ago by Brian Bushnell ♦ 15k Walnut Creek, USA Brian Bushnell ♦ 15k wrote:BAM gives you some compression over fastq.gz (particularly if you map and sort first). And pigz produces gzip files, faster than gzip, which allows you to increase the compression level. There's also bz2, which has a parallel implementation and gives better compression than gz.
If you want to be confident in recovering the data at some point in the future... I would go with one of those rather than something that is not widely used.
ADD COMMENT • link written 2.6 years ago by Brian Bushnell ♦ 15k 1I've gone the bz2 way for longer term storage. I haven't really studied the compression algorithm in detail so I don't know how much your data affects the level of compression. As recent examples, 2 x 8.9G fastq files compressed into a 3.3G tar.bz2 archive and 2 x 21G fastq files compressed into a 6.9G tar.bz2 archive. So, overall ca. 6X reduction in size. In these cases, the reads were quite heterogenic (QC'd HiSeq-sequenced metagenomes). LZMA could perhaps achieve a better ratio still although memory requirements may prevent its use..
ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by 5heikki ♦ 7.3kHi, by fastq do you actually mean fastq.gz? Because if you mean fastq, then 8.9G to 3.3G seems less efficient than compressing with gz, or am I wrong?
ADD REPLY • link written 2.3 years ago by toharl • 02x8.9G fastq (total 17.8G) into 3.3G (total), i.e. ca. 6X reduction.
ADD REPLY • link modified 2.3 years ago • written 2.3 years ago by 5heikki ♦ 7.3kIn most cases, I do not have access to a reference genome or it would be incomplete and I would lose sequences, so BAM is not an option.
I know about pigz and use it on my computer but compression speed is not the major issue here. Space is. I am not convinced yet I want to go the more risky route of using a less known compression tool without some encouraging user stories from people who handle lots of data.
Thanks for the opinion.
ADD REPLY • link written 2.6 years ago by Eric Normandeau ♦ 9.8kYou can store unaligned reads in BAM just fine (all the mapping-related information is just blank). The sequences, quality scores, and read names are all there, so it's effectively lossless.
This has been proposed/pushed before (see here ) and discussed in various places like here and here , if you want to see some responses to it. For me, I like the cleanliness of the approach (a "universal" format), but it hasn't really caught on with the datasets and experiments that we see or do.
ADD REPLY • link written 2.6 years ago by matted ♦ 6.7kActually, If you save them as aligned BAM, the BAM files will be larger than fastq.gz, usually. I think. However, maybe you are right non-aligned BAM will be a little smaller compared with fastq.gz.
ADD REPLY • link written 18 months ago by Shicheng Guo • 4.8k
May 22, 2018 | cs.brandeis.edu
The Data Compression Conference (DCC) is an international forum for current work on data compression and related applications.
- Compression of specific types of data (text, images, video, etc.).
- Compression in networking, communications, and storage.
- Applications to bioinformatics.
- Applications to mobile computing.
- Applications to information retrieval.
- Computational issues for compression related applications.
- Inpainting-based compression, perceptual coding.
- Compressed data structures.
- Quantization theory, and vector quantization (VQ).
- Joint source-channel coding.
- Compression related standards.
Both theoretical and experimental work are of interest. More >
Contact UsIf you would like to be added to, or removed from the DCC email list, please contact us.
May 21, 2018 | superuser.com
I find myself having to compress a number of very large files (80-ish GB), and I am surprised at the (lack of) speed my system is exhibiting. I get about 500 MB / min conversion speed; using
top
, I seem to be using a single CPU at approximately 100%.I am pretty sure it's not (just) disk access speed, since creating a
tar
file (that's how the 80G file was created) took just a few minutes (maybe 5 or 10), but after more than 2 hours my simple gzip command is still not done.In summary:
tar -cvf myStuff.tar myDir/*Took <5 minutes to create an 87 G tar file
gzip myStuff.tarTook two hours and 10 minutes, creating a 55G zip file.
My question: Is this normal? Are there certain options in
gzip
to speed things up?Would it be faster to concatenate the commands and use
tar -cvfz
?I saw reference to
pigz
- Parallel Implementation of GZip - but unfortunatly I cannot install software on the machine I am using, so that is not an option for me. See for example this earlier question .I am intending to try some of these options myself and time them - but it is quite likely that I will not hit "the magic combination" of options. I am hoping that someone on this site knows the right trick to speed things up.
When I have the results of other trials available I will update this question - but if anyone has a particularly good trick available, I would really appreciate it. Maybe the gzip just takes more processing time than I realized...
UPDATE
As promised, I tried the tricks suggested below: change the amount of compression, and change the destination of the file. I got the following results for a tar that was about 4.1GB:
flag user system size sameDisk -1 189.77s 13.64s 2.786G +7.2s -2 197.20s 12.88s 2.776G +3.4s -3 207.03s 10.49s 2.739G +1.2s -4 223.28s 13.73s 2.735G +0.9s -5 237.79s 9.28s 2.704G -0.4s -6 271.69s 14.56s 2.700G +1.4s -7 307.70s 10.97s 2.699G +0.9s -8 528.66s 10.51s 2.698G -6.3s -9 722.61s 12.24s 2.698G -4.0sSo yes, changing the flag from the default
-6
to the fastest-1
gives me a 30% speedup, with (for my data) hardly any change to the size of the zip file. Whether I'm using the same disk or another one makes essentially no difference (I would have to run this multiple times to get any statistical significance).If anyone is interested, I generated these timing benchmarks using the following two scripts:
#!/bin/bash # compare compression speeds with different options sameDisk='./' otherDisk='/tmp/' sourceDir='/dirToCompress' logFile='./timerOutput' rm $logFile for i in {1..9} do /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $sameDisk $logFile do /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $otherDisk $logFile doneAnd the second script (
compressWith
):#!/bin/bash # use: compressWith sourceDir compressionFlag destinationDisk logFile echo "compressing $1 to $3 with setting $2" >> $4 tar -c $1 | gzip -$2 > $3test-$2.tar.gzThree things to note:
- Using
/usr/bin/time
rather thantime
, since the built-in command ofbash
has many fewer options than the GNU command- I did not bother using the
--format
option although that would make the log file easier to read- I used a script-in-a-script since
time
seemed to operate only on the first command in a piped sequence (so I made it look like a single command...).With all this learnt, my conclusions are
- Speed things up with the
-1
flag (accepted answer)- Much more time is spend compressing the data than reading from disk
- Invest in faster compression software (
pigz
seems like a good choice).Thanks everyone who helped me learn all this! You can change the speed of gzip using
--fast
--best
or-#
where # is a number between 1 and 9 (1 is fastest but less compression, 9 is slowest but more compression). By default gzip runs at level 6.The reason tar takes so little time compared to gzip is that there's very little computational overhead in copying your files into a single file (which is what it does). gzip on the otherhand, is actually using compression algorithms to shrink the tar file.The problem is that gzip is constrained (as you discovered) to a single thread.
Enter pigz , which can use multiple threads to perform the compression. An example of how to use this would be:
tar -c --use-compress-program=pigz -f tar.file dir_to_zipThere is a nice succint summary of the --use-compress-program option over on a sister site .
Thanks for your answer and links. I actually mentioned pigz in the question. –
David Spillett 21.6k 39 62
I seem to be using a single CPU at approximately 100%.
That implies there isn't an I/O performance issue but that the compression is only using one thread (which will be the case with gzip).
If you manage to achieve the access/agreement needed to get other tools installed, then 7zip also supports multiple threads to take advantage of multi core CPUs, though I'm not sure if that extends to the gzip format as well as its own.
If you are stuck to using just gzip for the time being and have multiple files to compress, you could try compressing them individually - that way you'll use more of that multi-core CPU by running more than one process in parallel.
Be careful not to overdo it though because as soon as you get anywhere near the capacity of your I/O subsystem performance will drop off precipitously (to lower than if you were using one process/thread) as the latency of head movements becomes a significant bottleneck.
Thanks for your input. You gave me an idea (for which you get an upvote): since I have multiple archives to create I can just write the individual commands followed by a
&
- then let the system take care of it from there. Each will run on its own processor, and since I spend far more time on compression than on I/O, it will take the same time to do one as to do all 10 of them. So I get "multi core performance" from an executable that's single threaded... –
May 20, 2018 | uppmax.uu.se
Short answer: The best compression using a widely available format is provided by bzip2 and its parallel equivalent pbzip2. The best compression ratio for FastQ is provided by fqz_comp in the fqzcomp/4.6 module. However, this tool is experimental and is not recommended for general, everyday use.
Long answerWe conducted an informal examination of two specialty FastQ compression tools by recompressing an existing fastq.gz file. The first tool fqzcomp (available in the module fqzcomp/4.6) uses a compiled executable (fqz_comp) that works similar to e.g., gzip, while the second tool (LFQC in the module lfqc/1.1) uses separate ruby-language scripts for compression (lfqc.rb) and decompression (lfqcd.rb). It does not appear the LFQC scripts accept piping but the documentation is limited.
module load bioinfo-tools module load fqzcomp/4.6 module load lfqc/1.1Both modules have 'module help' available for more info. The help for fqzcomp gives the location of their README which is very helpful in describing minor changes that might occur to the FastQ file during decompression (these do not affect the read name, sequence or quality data).
One thing changed from the 'standard' implementation of LFQC was to make the scripts stand-alone with #! header lines, rather than requiring e.g., 'ruby lfqc.rb ...' as you see in their documentation.
Since piping is not available with LFQC, it is preferable to avoid creating a large intermediate decompressed FastQ file. So, create a named pipe using mkfifo that is named like a fastq file.
mkfifo UME_081102_P05_WF03.se.fastq zcat UME_081102_P05_WF03.se.fastq.gz > UME_081102_P05_WF03.se.fastq & lfqc.rb UME_081102_P05_WF03.se.fastq rm UME_081102_P05_WF03.se.fastqThis took a long time, 310 wall seconds.
Next,fqz_comp from fqzcomp/4.6. Since this works like gzip, just use it in a pipe.
zcat UME_081102_P05_WF03.se.fastq.gz | fqz_comp > UME_081102_P05_WF03.se.fastq.fqzThis used a little multithreading (up to about 150% CPU) and was much faster than LFQC, just 2-3 seconds. There are other compression options (we tried -s1 and -s9+) but these did not outperform the default (equivalent to -s3). This is not necessarily a surprise; stronger compression means attempting to make better guesses and sometimes these guesses are not correct. No speedup/slowdown was noticed with other settings but the input file was relatively small.
-rw-rw-r-- 1 28635466 Mar 10 12:53 UME_081102_P05_WF03.se.fastq.fqz1 -rw-rw-r-- 1 29271063 Mar 10 12:52 UME_081102_P05_WF03.se.fastq.fqz9+ -rw-rw-r-- 1 46156932 Jun 6 2015 UME_081102_P05_WF03.se.fastq.gz -rw-rw-r-- 1 28015892 Mar 10 12:53 UME_081102_P05_WF03.se.fastq.fqz -rw-rw-r-- 1 24975360 Mar 10 12:45 UME_081102_P05_WF03.se.fastq.lfqcWe also compared against bzip2 and xz, which are general-use compressors. These both function like gzip (and thus like fqz_comp) and both outperform gzip, as expected. xz is becoming a more widely-used general compressor like bzip2, but for this file it required perhaps 20x as much time as bzip2 and did worse.
-rw-rw-r-- 1 35664555 Mar 10 13:10 UME_081102_P05_WF03.se.fastq.bz2 -rw-rw-r-- 1 36315260 Mar 10 13:10 UME_081102_P05_WF03.se.fastq.xzNeither of these improved general-use compressors did as well with FastQ as the specialty compressors. This makes sense given the specialty compressors can take advantages of the restrictions of the format.
Which is the best method in this trial?From the results of this trial, the tool that provides the best compression ratio in a reasonable amount of time is fqz_comp in the fqzcomp/4.6 module. It is as fast as bzip2 which is also much better than gzip but does a much better job of compressing FastQ. However, fqz_comp is experimental so we do not recommend using fqz_comp for everyday use. We recommend using bzip2 or its parallel equivalent, pbzip2.
The fqz_comp executable could be used to decompress FastQ within a named pipe if FastQ is required for input:
... <(fqz_comp -d < file.fastq.fqz) ...Note that fqz_comp is designed to compress FastQ files alone, and neither method here provides the blocked compression format suitable for random access that bgzip does; see http://www.uppmax.uu.se/support/faq/resources-faq/which-compression-format-should-i-use-for-ngs-related-files/ for more on that subject.
Why not LFQC?Though LFQC has the best compression of FastQ, there are some strong disadvantages. First, it takes quite a long time, perhaps 50x longer than fqz_comp. Second, it apparently cannot be used easily within a pipe like many other compressors. Third, it contains multiple scripts with multiple auxiliary programs, rather than a single executable. Fourth, it is quite verbose during operation, which can be helpful but cannot be turned off. Finally, it was difficult to track down for installation; two different links were provided in the publications and neither worked. It was finally found in a github repository, the location of which is provided in the module help.
May 20, 2018 | dskernel.blogspot.com
2012-11-06 Comparing fastq compression with gzip, bzip2 and xz
Storage is expensive. Compression is a lossless approach to reduce the storage requirements.
Sébastien Boisvert 2012-11-05Table 1: Comparison of compression methods on SRR001665_1.fastq -- 10408224 DNA sequences of length 36. Tests were on Fedora 17 x86_64 with a Intel Core i5-3360M processor and a Intel SSDSC2BW180A3 drive. Tests were not run in parallel. The time is the real entry from the time command. Each test was done twice.
Compression Time Size (bytes) none 0 2085533064
(100%)time cat SRR001665_1.fastq | gzip -9 > SRR001665_1.fastq.gz 7m31.519s
7m20.340s376373619
(18%)time cat SRR001665_1.fastq | bzip2 -9 > SRR001665_1.fastq.bz2 3m12.601s
3m25.243s295000140
(14%)time cat SRR001665_1.fastq | xz -9 > SRR001665_1.fastq.xz 32m45.933s 257621508
(12%)Table 2: Decompression tests. Each test was run twice.
Decompression Time time cat SRR001665_1.fastq.gz | gunzip > SRR001665_1.fastq 0m14.612s
0m13.247stime cat SRR001665_1.fastq.bz2 | bunzip2 > SRR001665_1.fastq 1m3.412s
1m4.337stime cat SRR001665_1.fastq.xz | unxz > SRR001665_1.fastq 0m24.194s
0m23.923s
It is strange that bzip2 is faster than gzip for compression.
Posted by Sébastien Boisvert at 00:071 comment:
- Mark Ziemann said...
- pbzip2 is very fast.
- Monday, November 26, 2012 at 11:45:00 PM EST
Jun 10, 2016 | link.springer.com
The Journal of SupercomputingDecember 2016 , Volume 72, Issue 12 , pp 4696–4717 | Cite as
We present an experimental performance comparison of lossless compression programs for DNA raw data in FASTQ format files. General-purpose (PBZIP2, P7ZIP and PIGZ) and domain-specific compressors (SCALCE, QUIP, FASTQZ and DSRC) were analyzed in terms of compression ratio, execution speed, parallel scalability and memory consumption. Results showed that domain-specific tools increased the compression ratios up to 70 %, while reducing the runtime of general-purpose tools up to 7× during compression and up to 3×during decompression. Parallelism scaled performance up to 13× when using 20 threads.
Our analysis indicates that QUIP, DSRC and PBZIP2 are the best tools in their respective categories, with acceptable memory requirements.
Nevertheless, the end user must consider the features of available hardware and define the priorities among its optimization objectives (compression ratio, runtime during compression or decompression, scalability, etc.) to properly select the best application for each particular scenario.
May 04, 2018 | unix.stackexchange.com
のbるしtyぱんky ,Jan 6, 2014 at 18:39
More and moretar
archives use thexz
format based on LZMA2 for compression instead of the traditionalbzip2(bz2)
compression. In fact kernel.org made a late " Good-bye bzip2 " announcement, 27th Dec. 2013 , indicating kernel sources would from this point on be released in both tar.gz and tar.xz format - and on the main page of the website what's directly offered is intar.xz
.Are there any specific reasons explaining why this is happening and what is the relevance of
gzip
in this context?> ,
For distributing archives over the Internet, the following things are generally a priority:
- Compression ratio (i.e., how small the compressor makes the data);
- Decompression time (CPU requirements);
- Decompression memory requirements; and
- Compatibility (how wide-spread the decompression program is)
Compression memory & CPU requirements aren't very important, because you can use a large fast machine for that, and you only have to do it once.
Compared to bzip2, xz has a better compression ratio and lower (better) decompression time. It, however -- at the compression settings typically used -- requires more memory to decompress [1] and is somewhat less widespread. Gzip uses less memory than either.
So, both gzip and xz format archives are posted, allowing you to pick:
- Need to decompress on a machine with very limited memory (<32 MB): gzip. Given, not very likely when talking about kernel sources.
- Need to decompress minimal tools available: gzip
- Want to save download time and/or bandwidth: xz
There isn't really a realistic combination of factors that'd get you to pick bzip2. So its being phased out.
I looked at compression comparisons in a blog post . I didn't attempt to replicate the results, and I suspect some of it has changed (mostly, I expect
xz
has improved, as its the newest.)(There are some specific scenarios where a good bzip2 implementation may be preferable to xz: bzip2 can compresses a file with lots of zeros and genome DNA sequences better than xz. Newer versions of xz now have an (optional) block mode which allows data recovery after the point of corruption and parallel compression and [in theory] decompression. Previously, only bzip2 offered these. [2] However none of these are relevant for kernel distribution)
1: In archive size,
xz -3
is aroundbzip -9
. Then xz uses less memory to decompress. Butxz -9
(as, e.g., used for Linux kernel tarballs) uses much more thanbzip -9
. (And evenxz -0
needs more thangzip -9
).2: F21 System Wide Change: lbzip2 as default bzip2 implementation
> ,
First of all, this question is not directly related totar
. Tar just creates an uncompressed archive, the compression is then applied later on.Gzip is known to be relatively fast when compared to LZMA2 and bzip2. If speed matters,
gzip
(especially the multithreaded implementationpigz
) is often a good compromise between compression speed and compression ratio. Although there are alternatives if speed is an issue (e.g. LZ4).However, if a high compression ratio is desired LZMA2 beats
bzip2
in almost every aspect. The compression speed is often slower, but it decompresses much faster and provides a much better compression ratio at the cost of higher memory usage.There is not much reason to use
bzip2
any more, except of backwards compatibility. Furthermore, LZMA2 was desiged with multithreading in mind and many implementations by default make use of multicore CPUs (unfortunatelyxz
on Linux does not do this, yet). This makes sense since the clock speeds won't increase any more but the number of cores will.There are multithreaded
bzip2
implementations (e.g.pbzip
), but they are often not installed by default. Also note that multithreadedbzip2
only really pay off while compressing whereas decompression uses a single thread if the file was compress using a single threadedbzip2
, in contrast to LZMA2. Parallelbzip2
variants can only leverage multicore CPUs if the file was compressed using a parallelbzip2
version, which is often not the case.Slyx ,Jan 6, 2014 at 19:14
Short answer : xz is more efficient in terms of compression ratio. So it saves disk space and optimizes the transfer through the network.
You can see this Quick Benchmark so as to discover the difference by practical tests.Mark Warburton ,Apr 14, 2016 at 14:15
LZMA2 is a block compression system whereas gzip is not. This means that LZMA2 lends itself to multi-threading. Also, if corruption occurs in an archive, you can generally recover data from subsequent blocks with LZMA2 but you cannot do this with gzip. In practice, you lose the entire archive with gzip subsequent to the corrupted block. With an LZMA2 archive, you only lose the file(s) affected by the corrupted block(s). This can be important in larger archives with multiple files.,
May 04, 2018 | www.rootusers.com
Gzip vs Bzip2 vs XZ Performance Comparison Posted by Jarrod on September 17, 2015 Leave a comment (21) Go to comments Gzip, Bzip2 and XZ are all popular compression tools used in UNIX based operating systems, but which should you use? Here we are going to benchmark and compare them against each other to get an idea of the trade off between the level of compression and time taken to achieve it.
For further information on how to use gzip, bzip2 or xz see our guides below:
The Test ServerThe test server was running CentOS 7.1.1503 with kernel 3.10.0-229.11.1 in use, all updates to date are fully applied. The server had 4 CPU cores and 16GB of available memory, during the tests only one CPU core was used as all of these tools run single threaded by default, while testing this CPU core would be fully utilized. With XZ it is possible to specify the amount of threads to run which can greatly increase performance , for further information see example 9 here .
All tests were performed on linux-3.18.19.tar, a copy of the Linux kernel from kernel.org . This file was 580,761,600 Bytes in size prior to compression.
The Benchmarking ProcessThe linux-3.18.19.tar file was compressed and decompressed 9 times each by gzip, bzip2 and xz at each available compression level from 1 to 9. A compression level of 1 indicates that the compression will be fastest but the compression ratio will not be as high so the file size will be larger. Compression level 9 on the other hand is the best possible compression level, however it will take the longest amount of time to complete.
There is an important trade off here between the compression levels between CPU processing time and the compression ratio. To get a higher compression ratio and save a greater amount of disk space, more CPU processing time will be required. To save and reduce CPU processing time a lower compression level can be used which will result in a lower compression ratio, using more disk space.
Each time the compression or decompression command was run, the 'time' command was placed in front so that we could accurately measure how long the command took to execute.
Below are the commands that were run for compression level 1:
time bzip2 -1v linux-3.18.19.tar time gzip -1v linux-3.18.19.tar time xz -1v linux-3.18.19.tarAll commands were run with the time command, verbosity and the compression level of -1 which was stepped through incrementally up to -9. To decompress, the same command was used with the -d flag.
The versions pf these tools were gzip 1.5, bzip2 1.0.6, and xz (XZ Utils) 5.1.2alpha.
ResultsThe raw data that the below graphs have been created from has been provided in tables below and can also be accessed in this spreadsheet .
Compressed SizeThe table below indicates the size in bytes of the linux-3.18.19.tar file after compression, the first column numbered 1..9 shows the compression level passed in to the compression tool.
Compression Time
gzip bzip2 xz 1 153617925 115280806 105008672 2 146373307 107406491 100003484 3 141282888 103787547 97535320 4 130951761 101483135 92377556 5 125581626 100026953 85332024 6 123434238 98815384 83592736 7 122808861 97966560 82445064 8 122412099 97146072 81462692 9 122349984 96552670 80708748 First we'll start with the compression time, this graph shows how long it took for the compression to complete at each compression level 1 through to 9.
gzip bzip2 xz 1 13.213 78.831 48.473 2 14.003 77.557 65.203 3 16.341 78.279 97.223 4 17.801 79.202 196.146 5 22.722 80.394 310.761 6 30.884 81.516 383.128 7 37.549 82.199 416.965 8 48.584 81.576 451.527 9 54.307 82.812 500.859 Gzip vs Bzip2 vs XZ Compression Time
So far we can see that gzip takes slightly longer to complete as the compression level increases, bzip2 does not change very much, while xz increases quite significantly after a compression level of 3.
Compression RatioNow that we have an idea of how long the compression took we can compare this with how well the file compressed. The compression ratio represents the percentage that the file has been reduced to. For example if a 100mb file has been compressed with a compression ratio of 25% it would mean that the compressed version of the file is 25mb.
gzip bzip2 xz 1 26.45 19.8 18.08 2 25.2 18.49 17.21 3 24.32 17.87 16.79 4 22.54 17.47 15.9 5 21.62 17.22 14.69 6 21.25 17.01 14.39 7 21.14 16.87 14.19 8 21.07 16.73 14.02 9 21.06 16.63 13.89 Gzip vs Bzip2 vs XZ Compression Ratio
The overall trend here is that with a higher compression level applied, the lower the compression ratio indicating that the overall file size is smaller. In this case xz is always providing the best compression ratio, closely followed by bzip2 with gzip coming in last, however as shown in the compression time graph xz takes a lot longer to get these results after compression level 3.
Compression SpeedThe compression speed in MB per second can also be observed.
gzip bzip2 xz 1 43.95 7.37 11.98 2 41.47 7.49 8.9 3 35.54 7.42 5.97 4 32.63 7.33 2.96 5 25.56 7.22 1.87 6 18.8 7.12 1.52 7 15.47 7.07 1.39 8 11.95 7.12 1.29 9 10.69 7.01 1.16 Gzip vs Bzip2 vs XZ Compression Speed Decompression Time
Next up is how long each file compressed at a particular compression level took to decompress.
gzip bzip2 xz 1 6.771 24.23 13.251 2 6.581 24.101 12.407 3 6.39 23.955 11.975 4 6.313 24.204 11.801 5 6.153 24.513 11.08 6 6.078 24.768 10.911 7 6.057 23.199 10.781 8 6.033 25.426 10.676 9 6.026 23.486 10.623 Gzip vs Bzip2 vs XZ Decompression Time
In all cases the file decompressed faster if it had been compressed with a higher compression level. Therefore if you are going to be serving out a compressed file over the Internet multiple times it may be worth compressing it with xz with a compression level of 9 as this will both reduce bandwidth over time when transferring the file, and will also be faster for everyone to decompress.
Decompression SpeedThe decompression speed in MB per second can also be observed.
1 85.77 23.97 43.83 2 88.25 24.1 46.81 3 90.9 24.24 48.5 4 91.99 24 49.21 5 94.39 23.7 52.42 6 95.55 23.45 53.23 7 95.88 25.03 53.87 8 96.26 22.84 54.4 9 96.38 24.72 54.67 Gzip vs Bzip2 vs XZ Decompression Speed Performance Differences and Comparison
By default when the compression level is not specified, gzip uses -6, bzip2 uses -9 and xz uses -6. The reason for this is pretty clear based on the results. For gzip and xz -6 as a default compression method provides a good level of compression yet does not take too long to complete, it's a fair trade off point as higher compression levels take longer to process. Bzip2 on the other hand is best used with the default compression level of 9 as is also recommended in the manual page, the results here confirm this, the compression ratio increases but the time taken is almost the same and differs by less than a second between levels 1 to 9.
In general xz achieves the best compression level, followed by bzip2 and then gzip. In order to achieve better compression however xz usually takes the longest to complete, followed by bzip2 and then gzip.
xz takes a lot more time with its default compression level of 6 while bzip2 only takes a little longer than gzip at compression level 9 and compresses a fair amount better, while the difference between bzip2 and xz is less than the difference between bzip2 and gzip making bzip2 a good trade off for compression.
Interestingly the lowest xz compression level of 1 results in a higher compression ratio than gzip with a compression level of 9 and even completes faster. Therefore using xz with a compression level of 1 instead of gzip for a better compression ratio in a faster time.
Based on these results, bzip2 is a good middle ground for compression, gzip is only a little faster while xz may not really worth it at its higher default compression ratio of 6 as it takes much longer to complete for little extra gain.
However decompressing with bzip2 takes much longer than xz or gzip, xz is a good middle ground here while gzip is again the fastest.
ConclusionSo which should you use? It's going to come down to using the right tool for the job and the particular data set that you are working with.
If you are interactively compressing files on the fly then you may want to do this quickly with gzip -6 (default compression level) or xz -1, however if you're configuring log rotation which will run automatically over night during a low resource usage period then it may be acceptable to use more CPU resources with xz -9 to save the greatest amount of space possible. For instance kernel.org compress the Linux kernel with xz, in this case spending extra time to compress the file well once makes sense when it will be downloaded and decompressed thousands of times resulting in bandwidth savings yet still decent decompression speeds.
Based on the results here, if you're simply after being able to compress and decompress files as fast as possible with little regard to the compression ratio, then gzip is the tool for you. If you want a better compression ratio to save more disk space and are willing to spend extra processing time to get it then xz will be best to use. Although xz takes the longest to compress at higher compression levels, it has a fairly good decompression speed and compresses quite fast at lower levels. Bzip2 provides a good trade off between compression ratio and processing speed however it takes the longest to decompress so it may be a good option if the content that is being compressed will be infrequently decompressed.
In the end the best option will come down to what you're after between processing time and compression ratio. With disk space continually becoming cheaper and available in larger sizes you may be fine with saving some CPU resources and processing time to store slightly larger files. Regardless of the tool that you use, compression is a great resource for saving storage space.
May 04, 2018 | en.wikipedia.org
Lempel–Ziv–Markov chain algorithm From Wikipedia, the free encyclopedia Jump to: navigation , search "LZMA" redirects here. For the airport with the ICAO code "LZMA", see Martin Airport (Slovakia) .
hide This article has multiple issues. Please help improve it or discuss these issues on the talk page . ( Learn how and when to remove these template messages )
This article's lead section does not adequately summarize key points of its contents . Please consider expanding the lead to provide an accessible overview of all important aspects of the article. Please discuss this issue on the article's talk page . (July 2014) ( Learn how and when to remove this template message )
This article is written like a manual or guidebook . Please help rewrite this article from a descriptive, neutral point of view , and remove advice or instruction. (July 2014) ( Learn how and when to remove this template message ) The Lempel–Ziv–Markov chain algorithm ( LZMA ) is an algorithm used to perform lossless data compression . It has been under development either since 1996 or 1998 [1] and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio (generally higher than bzip2 ) [2] [3] and a variable compression-dictionary size (up to 4 GB ), [4] while still maintaining decompression speed similar to other commonly used compression algorithms. [5]
LZMA2 is a simple container format that can include both uncompressed data and LZMA data, possibly with multiple different LZMA encoding parameters. LZMA2 supports arbitrarily scalable multithreaded compression and decompression and efficient compression of data which is partially incompressible.
Contents [ hide ]Overview [ edit ]
- 1 Overview
- 2 Compressed format overview
- 3 Decompression algorithm details
- 4 Compression algorithm details
- 5 7-Zip reference implementation
- 6 Other implementations
- 7 LZHAM
- 8 References
- 9 External links
This section needs additional citations for verification . Please help improve this article by adding citations to reliable sources . Unsourced material may be challenged and removed. (July 2010) ( Learn how and when to remove this template message ) LZMA uses a dictionary compression algorithm (a variant of LZ77 with huge dictionary sizes and special support for repeatedly used match distances), whose output is then encoded with a range encoder , using a complex model to make a probability prediction of each bit. The dictionary compressor finds matches using sophisticated dictionary data structures, and produces a stream of literal symbols and phrase references, which is encoded one bit at a time by the range encoder: many encodings are possible, and a dynamic programming algorithm is used to select an optimal one under certain approximations.
Prior to LZMA, most encoder models were purely byte-based (i.e. they coded each bit using only a cascade of contexts to represent the dependencies on previous bits from the same byte). The main innovation of LZMA is that instead of a generic byte-based model, LZMA's model uses contexts specific to the bitfields in each representation of a literal or phrase: this is nearly as simple as a generic byte-based model, but gives much better compression because it avoids mixing unrelated bits together in the same context. Furthermore, compared to classic dictionary compression (such as the one used in zip and gzip formats), the dictionary sizes can be and usually are much larger, taking advantage of the large amount of memory available on modern systems.
Compressed format overview [ edit ]
This section needs additional citations for verification . Please help improve this article by adding citations to reliable sources . Unsourced material may be challenged and removed. (July 2010) ( Learn how and when to remove this template message ) In LZMA compression, the compressed stream is a stream of bits, encoded using an adaptive binary range coder. The stream is divided into packets, each packet describing either a single byte, or an LZ77 sequence with its length and distance implicitly or explicitly encoded. Each part of each packet is modeled with independent contexts, so the probability predictions for each bit are correlated with the values of that bit (and related bits from the same field) in previous packets of the same type.
There are 7 types of packets: [ citation needed ]
packed code (bit sequence) packet name packet description 0 + byteCode LIT A single byte encoded using an adaptive binary range coder. 1+0 + len + dist MATCH A typical LZ77 sequence describing sequence length and distance. 1+1+0+0 SHORTREP A one-byte LZ77 sequence. Distance is equal to the last used LZ77 distance. 1+1+0+1 + len LONGREP[0] An LZ77 sequence. Distance is equal to the last used LZ77 distance. 1+1+1+0 + len LONGREP[1] An LZ77 sequence. Distance is equal to the second last used LZ77 distance. 1+1+1+1+0 + len LONGREP[2] An LZ77 sequence. Distance is equal to the third last used LZ77 distance. 1+1+1+1+1 + len LONGREP[3] An LZ77 sequence. Distance is equal to the fourth last used LZ77 distance. LONGREP[*] refers to LONGREP[0-3] packets, *REP refers to both LONGREP and SHORTREP, and *MATCH refers to both MATCH and *REP.
LONGREP[n] packets remove the distance used from the list of the most recent distances and reinsert it at the front, to avoid useless repeated entry, while MATCH just adds the distance to the front even if already present in the list and SHORTREP and LONGREP[0] don't alter the list.
The length is encoded as follows:
Length code (bit sequence) Description 0+ 3 bits The length encoded using 3 bits, gives the lengths range from 2 to 9. 1+0+ 3 bits The length encoded using 3 bits, gives the lengths range from 10 to 17. 1+1+ 8 bits The length encoded using 8 bits, gives the lengths range from 18 to 273. As in LZ77, the length is not limited by the distance, because copying from the dictionary is defined as if the copy was performed byte by byte, keeping the distance constant.
Distances are logically 32-bit and distance 0 points to the most recently added byte in the dictionary.
The distance encoding starts with a 6-bit "distance slot", which determines how many further bits are needed. Distances are decoded as a binary concatenation of, from most to least significant, two bits depending on the distance slot, some bits encoded with fixed 0.5 probability, and some context encoded bits, according to the following table (distance slots 0−3 directly encode distances 0−3).
Decompression algorithm details [ edit ]
6-bit distance slot Highest 2 bits Fixed 0.5 probability bits Context encoded bits 0 00 0 0 1 01 0 0 2 10 0 0 3 11 0 0 4 10 0 1 5 11 0 1 6 10 0 2 7 11 0 2 8 10 0 3 9 11 0 3 10 10 0 4 11 11 0 4 12 10 0 5 13 11 0 5 14–62 (even) 10 ((slot / 2) − 5) 4 15–63 (odd) 11 (((slot − 1) / 2) − 5) 4
This section possibly contains original research . Please improve it by verifying the claims made and adding inline citations . Statements consisting only of original research should be removed. (April 2012) ( Learn how and when to remove this template message ) No complete natural language specification of the compressed format seems to exist, other than the one attempted in the following text.
The description below is based on the compact XZ Embedded decoder by Lasse Collin included in the Linux kernel source [6] from which the LZMA and LZMA2 algorithm details can be relatively easily deduced: thus, while citing source code as reference isn't ideal, any programmer should be able to check the claims below with a few hours of work.
Range coding of bits [ edit ]LZMA data is at the lowest level decoded one bit at a time by the range decoder, at the direction of the LZMA decoder.
Context-based range decoding is invoked by the LZMA algorithm passing it a reference to the "context", which consists of the unsigned 11-bit variable prob (typically implemented using a 16-bit data type) representing the predicted probability of the bit being 0, which is read and updated by the range decoder (and should be initialized to 2^10, representing 0.5 probability).
Fixed probability range decoding instead assumes a 0.5 probability, but operates slightly differently from context-based range decoding.
The range decoder state consists of two unsigned 32-bit variables, range (representing the range size), and code (representing the encoded point within the range).
Initialization of the range decoder consists of setting range to 2 32 − 1, and code to the 32-bit value starting at the second byte in the stream interpreted as big-endian; the first byte in the stream is completely ignored.
Normalization proceeds in this way:
- Shift both range and code left by 8 bits
- Read a byte from the compressed stream
- Set the least significant 8 bits of code to the byte value read
Context-based range decoding of a bit using the prob probability variable proceeds in this way:
- If range is less than 2^24, perform normalization
- Set bound to floor( range / 2^11) * prob
- If code is less than bound :
- Set range to bound
- Set prob to prob + floor((2^11 - prob ) / 2^5)
- Return bit 0
- Otherwise (if code is greater than or equal to the bound ):
- Set range to range - bound
- Set code to code - bound
- Set prob to prob - floor( prob / 2^5)
- Return bit 1
Fixed-probability range decoding of a bit proceeds in this way:
- If range is less than 2^24, perform normalization
- Set range to floor( range / 2)
- If code is less than range :
- Return bit 0
- Otherwise (if code is greater or equal than range ):
- Set code to code - range
- Return bit 1
The Linux kernel implementation of fixed-probability decoding in rc_direct, for performance reasons, doesn't include a conditional branch, but instead subtracts range from code unconditionally, and uses the resulting sign bit to both decide the bit to return, and to generate a mask that is combined with code and added to range .
Note that:
Range coding of integers [ edit ]
- The division by 2^11 when computing bound and floor operation is done before the multiplication, not after (apparently to avoid requiring fast hardware support for 32-bit multiplication with a 64-bit result)
- Fixed probability decoding is not strictly equivalent to context-based range decoding with any prob value, due to the fact that context-based range decoding discards the lower 11 bits of range before multiplying by prob as just described, while fixed probability decoding only discards the last bit
The range decoder also provides the bit-tree, reverse bit-tree and fixed probability integer decoding facilities, which are used to decode integers, and generalize the single-bit decoding described above. To decode unsigned integers less than limit , an array of ( limit − 1) 11-bit probability variables is provided, which are conceptually arranged as the internal nodes of a complete binary tree with limit leaves.
Non-reverse bit-tree decoding works by keeping a pointer to the tree of variables, which starts at the root. As long as the pointer doesn't point to a leaf, a bit is decoded using the variable indicated by the pointer, and the pointer is moved to either the left or right children depending on whether the bit is 0 or 1; when the pointer points to a leaf, the number associated with the leaf is returned.
Non-reverse bit-tree decoding thus happens from most significant to least significant bit, stopping when only one value in the valid range is possible (this conceptually allows to have range sizes that are not powers of two, even though LZMA doesn't make use of this).
Reverse bit-tree decoding instead decodes from least significant bit to most significant bits, and thus only supports ranges that are powers of two, and always decodes the same number of bits. It is equivalent to performing non-reverse bittree decoding with a power of two limit , and reversing the last log2( limit ) bits of the result.
Note that in the rc_bittree function in the Linux kernel, integers are actually returned in the [ limit , 2 * limit ) range (with limit added to the conceptual value), and the variable at index 0 in the array is unused, while the one at index 1 is the root, and the left and right children indices are computed as 2 i and 2 i + 1. The rc_bittree_reverse function instead adds integers in the [0, limit ) range to a caller-provided variable, where limit is implicitly represented by its logarithm, and has its own independent implementation for efficiency reasons.
Fixed probability integer decoding simply performs fixed probability bit decoding repeatedly, reading bits from the most to the least significant.
LZMA configuration [ edit ]The LZMA decoder is configured by an lclppb "properties" byte and a dictionary size.
The value of the lclppb byte is lc + lp * 9 + pb * 9 * 5, where:
- lc is the number of high bits of the previous byte to use as a context for literal encoding (the default value used by the LZMA SDK is 3)
- lp is the number of low bits of the dictionary position to include in literal_pos_state (the default value used by the LZMA SDK is 0)
- pb is the number of low bits of the dictionary position to include in pos_state (the default value used by the LZMA SDK is 2)
In non-LZMA2 streams, lc must not be greater than 8, and lp and pb must not be greater than 4. In LZMA2 streams, ( lc + lp ) and pb must not be greater than 4.
In the 7-zip LZMA file format, configuration is performed by a header containing the "properties" byte followed by the 32-bit little-endian dictionary size in bytes. In LZMA2, the properties byte can optionally be changed at the start of LZMA2 LZMA packets, while the dictionary size is specified in the LZMA2 header as later described.
LZMA coding contexts [ edit ]The LZMA packet format has already been described, and this section specifies how LZMA statistically models the LZ-encoded streams, or in other words which probability variables are passed to the range decoder to decode each bit.
Those probability variables are implemented as multi-dimensional arrays; before introducing them, a few values that are used as indices in these multidimensional arrays are defined.
The state value is conceptually based on which of the patterns in the following table match the latest 2-4 packet types seen, and is implemented as a state machine state updated according to the transition table listed in the table every time a packet is output.
The initial state is 0, and thus packets before the beginning are assumed to be LIT packets.
state previous packets next state when next packet is 4th previous 3rd previous 2nd previous previous LIT MATCH LONGREP[*] SHORTREP 0 LIT LIT LIT 0 7 8 9 1 MATCH LIT LIT 0 7 8 9 2 LONGREP[*] LIT LIT 0 7 8 9 *MATCH SHORTREP 3 LIT SHORTREP LIT LIT 0 7 8 9 4 MATCH LIT 1 7 8 9 5 LONGREP[*] LIT 2 7 8 9 *MATCH SHORTREP 6 LIT SHORTREP LIT 3 7 8 9 7 LIT MATCH 4 10 11 11 8 LIT LONGREP[*] 5 10 11 11 9 LIT SHORTREP 6 10 11 11 10 *MATCH MATCH 4 10 11 11 11 *MATCH *REP 5 10 11 11 The pos_state and literal_pos_state values consist of respectively the pb and lp (up to 4, from the LZMA header or LZMA2 properties packet) least significant bits of the dictionary position (the number of bytes coded since the last dictionary reset modulo the dictionary size). Note that the dictionary size is normally the multiple of a large power of 2, so these values are equivalently described as the least significant bits of the number of uncompressed bytes seen since the last dictionary reset.
The prev_byte_lc_msbs value is set to the lc (up to 4, from the LZMA header or LZMA2 properties packet) most significant bits of the previous uncompressed byte.
The is_REP value denotes whether a packet that includes a length is a LONGREP rather than a MATCH.
The match_byte value is the byte that would have been decoded if a SHORTREP packet had been used (in other words, the byte found at the dictionary at the last used distance); it is only used just after a *MATCH packet.
literal_bit_mode is an array of 8 values in the 0-2 range, one for each bit position in a byte, which are 1 or 2 if the previous packet was a *MATCH and it is either the most significant bit position or all the more significant bits in the literal to encode/decode are equal to the bits in the corresponding positions in match_byte , while otherwise it is 0; the choice between the 1 or 2 values depends on the value of the bit at the same position in match_byte .
The literal/Literal set of variables can be seen as a "pseudo-bit-tree" similar to a bit-tree but with 3 variables instead of 1 in every node, chosen depending on the literal_bit_mode value at the bit position of the next bit to decode after the bit-tree context denoted by the node.
The claim, found in some sources, that literals after a *MATCH are coded as the XOR of the byte value with match_byte is incorrect; they are instead coded simply as their byte value, but using the pseudo-bit-tree just described and the additional context listed in the table below.
The probability variable groups used in LZMA are those:
LZMA2 format [ edit ]
XZ name LZMA SDK name parameterized by used when coding mode if bit 0 then if bit 1 then is_match IsMatch state , pos_state packet start bit LIT *MATCH is_rep IsRep state after bit sequence 1 bit MATCH *REP is_rep0 IsRepG0 state after bit sequence 11 bit SHORTREP/ LONGREP[0]
LONGREP[1-3] is_rep0_long IsRep0Long state , pos_state after bit sequence 110 bit SHORTREP LONGREP[0] is_rep1 IsRepG1 state after bit sequence 111 bit LONGREP[1] LONGREP[2/3] is_rep2 IsRepG2 state after bit sequence 1111 bit LONGREP[2] LONGREP[3] literal Literal prev_byte_lc_msbs , literal_pos_state , literal_bit_mode [bit position], bit-tree context after bit sequence 0 256 values pseudo-bit-tree literal byte value dist_slot PosSlot min( match_length , 5), bit-tree context distance: start 64 values bit-tree distance slot dist_special SpecPos distance_slot , reverse bit-tree context distance: 4-13 distance slots (( distance_slot >> 1) − 1)-bit reverse bit-tree low bits of distance dist_align Align reverse bit-tree context distance: 14+ distance slots, after fixed probability bits 4-bit reverse bit-tree low bits of distance len_dec.choice LenChoice is_REP match length: start bit 2-9 length 10+ length len_dec.choice2 LenChoice2 is_REP match length: after bit sequence 1 bit 10-17 length 18+ length len_dec.low LenLow is_REP , pos_state , bit-tree context match length: after bit sequence 0 8 values bit-tree low bits of length len_dec.mid LenMid is_REP , pos_state , bit-tree context match length: after bit sequence 10 8 values bit-tree middle bits of length len_dec.high LenHigh is_REP , bit-tree context match length: after bit sequence 11 256 values bit-tree high bits of length The LZMA2 container supports multiple runs of compressed LZMA data and uncompressed data. Each LZMA compressed run can have a different LZMA configuration and dictionary. This improves the compression of partially or completely incompressible files and allows multithreaded compression and multithreaded decompression by breaking the file into runs that can be compressed or decompressed independently in parallel.
The LZMA2 header consists of a byte indicating the dictionary size:
- 40 indicates a 4 GB − 1 dictionary size
- Even values less than 40 indicate a 2 v /2 + 12 bytes dictionary size
- Odd values less than 40 indicate a 3×2 ( v − 1)/2 + 11 bytes dictionary size
- Values higher than 40 are invalid
LZMA2 data consists of packets starting with a control byte, with the following values:
- 0 denotes the end of the file
- 1 denotes a dictionary reset followed by an uncompressed chunk
- 2 denotes an uncompressed chunk without a dictionary reset
- 3-0x7f are invalid values
- 0x80-0xff denotes an LZMA chunk, where the lowest 5 bits are used as bit 16-20 of the uncompressed size minus one, and bit 5-6 indicates what should be reset
Bits 5-6 for LZMA chunks can be:
- 0: nothing reset
- 1: state reset
- 2: state reset, properties reset using properties byte
- 3: state reset, properties reset using properties byte, dictionary reset
LZMA state resets cause a reset of all LZMA state except the dictionary, and specifically:
- The range coder
- The state value
- The last distances for repeated matches
- All LZMA probabilities
Uncompressed chunks consist of:
- A 16-bit big-endian value encoding the data size minus one
- The data to be copied verbatim into the dictionary and the output
LZMA chunks consist of:
xz and 7z formats [ edit ]
- A 16-bit big-endian value encoding the low 16-bits of the uncompressed size minus one
- A 16-bit big-endian value encoding the compressed size minus one
- A properties/lclppb byte if bit 6 in the control byte is set
- The LZMA compressed data, starting with the 5 bytes (of which the first is ignored) used to initialize the range coder (which are included in the compressed size)
The . xz format, which can contain LZMA2 data, is documented at tukaani.org , [7] while the .7z file format, which can contain either LZMA or LZMA2 data, is documented in the 7zformat.txt file contained in the LZMA SDK. [8]
Compression algorithm details [ edit ]Similar to the decompression format situation, no complete natural language specification of the encoding techniques in 7-zip or xz seems to exist, other than the one attempted in the following text.
The description below is based on the XZ for Java encoder by Lasse Collin, [9] which appears to be the most readable among several rewrites of the original 7-zip using the same algorithms: again, while citing source code as reference isn't ideal, any programmer should be able to check the claims below with a few hours of work.
Range encoder [ edit ]The range encoder cannot make any interesting choices, and can be readily constructed based on the decoder description.
Initialization and termination are not fully determined; the xz encoder outputs 0 as the first byte which is ignored by the decompressor, and encodes the lower bound of the range (which matters for the final bytes).
The xz encoder uses an unsigned 33-bit variable called low (typically implemented as a 64-bit integer, initialized to 0), an unsigned 32-bit variable called range (initialized to 2 32 − 1), an unsigned 8-bit variable called cache (initialized to 0), and an unsigned variable called cache_size which needs to be large enough to store the uncompressed size (initialized to 1, typically implemented as a 64-bit integer).
The cache / cache_size variables are used to properly handle carries, and represent a number defined by a big-endian sequence starting with the cache value, and followed by cache_size 0xff bytes, which has been shifted out of the low register, but hasn't been written yet, because it could be incremented by one due to a carry.
Note that the first byte output will always be 0 due to the fact that cache and low are initialized to 0, and the encoder implementation; the xz decoder ignores this byte.
Normalization proceeds in this way:
- If low is less than (2^32 - 2^24):
- Output the byte stored in cache to the compressed stream
- Output cache_size - 1 bytes with 0xff value
- Set cache to bits 24-31 of low
- Set cache_size to 0
- If low is greater or equal than 2^32:
- Output the byte stored in cache plus one to the compressed stream
- Output cache_size - 1 bytes with 0 value
- Set cache to bits 24-31 of low
- Set cache_size to 0
- Increment cache_size
- Set low to the lowest 24 bits of low shifted left by 8 bits
- Set range to range shifted left by 8 bits
Context-based range encoding of a bit using the prob probability variable proceeds in this way:
- If range is less than 2^24, perform normalization
- Set bound to floor( range / 2^11) * prob
- If encoding a 0 bit:
- Set range to bound
- Set prob to prob + floor((2^11 - prob ) / 2^5)
- Otherwise (if encoding a 1 bit):
- Set range to range - bound
- Set low to low + bound
- Set prob to prob - floor( prob / 2^5)
Fixed-probability range encoding of a bit proceeds in this way:
- If range is less than 2^24, perform normalization
- Set range to floor( range / 2)
- If encoding a 1 bit:
- Set low to low + range
Termination proceeds this way:
- Perform normalization 5 times
Bit-tree encoding is performed like decoding, except that bit values are taken from the input integer to be encoded rather than from the result of the bit decoding functions.
For algorithms that try to compute the encoding with the shortest post-range-encoding size, the encoder also needs to provide an estimate of that.
Dictionary search data structures [ edit ]The encoder needs to be able to quickly locate matches in the dictionary. Since LZMA uses very large dictionaries (potentially on the order of gigabytes) to improve compression, simply scanning the whole dictionary would result in an encoder too slow to be practically usable, so sophisticated data structures are needed to support fast match searches.
Hash chains [ edit ]The simplest approach, called "hash chains", is parameterized by a constant N which can be either 2, 3 or 4, which is typically chosen so that 2^(8× N ) is greater than or equal to the dictionary size.
It consists of creating, for each k less than or equal to N , a hash table indexed by tuples of k bytes, where each of the buckets contains the last position where the first k bytes hashed to the hash value associated with that hash table bucket.
Chaining is achieved by an additional array which stores, for every dictionary position, the last seen previous position whose first N bytes hash to the same value of the first N bytes of the position in question.
To find matches of length N or higher, a search is started using the N -sized hash table, and continued using the hash chain array; the search stop after a pre-defined number of hash chain nodes has been traversed, or when the hash chains "wraps around", indicating that the portion of the input that has been overwritten in the dictionary has been reached.
Matches of size less than N are instead found by simply looking at the corresponding hash table, which either contains the latest such match, if any, or a string that hashes to the same value; in the latter case, the encoder won't be able to find the match. This issue is mitigated by the fact that for distant short matches using multiple literals might require less bits, and having hash conflicts in nearby strings is relatively unlikely; using larger hash tables or even direct lookup tables can reduce the problem at the cost of higher cache miss rate and thus lower performance.
Note that all matches need to be validated to check that the actual bytes match currently at that specific dictionary position match, since the hashing mechanism only guarantees that at some past time there were characters hashing to the hash table bucket index (some implementations may not even guarantee that, because they don't initialize the data structures).
LZMA uses Markov chains , as implied by "M" in its name.
Binary trees [ edit ]The binary tree approach follows the hash chain approach, except that it logically uses a binary tree instead of a linked list for chaining.
The binary tree is maintained so that it is always both a search tree relative to the suffix lexicographic ordering, and a max-heap for the dictionary position [10] (in other words, the root is always the most recent string, and a child cannot have been added more recently than its parent): assuming all strings are lexicographically ordered, these conditions clearly uniquely determine the binary tree (this is trivially provable by induction on the size of the tree).
Since the string to search for and the string to insert are the same, it is possible to perform both dictionary search and insertion (which requires to rotate the tree) in a single tree traversal.
Patricia tries [ edit ]Some old LZMA encoders also supported a data structure based on Patricia tries , but such support has since been dropped since it was deemed inferior to the other options. [10]
LZMA encoder [ edit ]LZMA encoders can freely decide which match to output, or whether to ignore the presence of matches and output literals anyway.
The ability to recall the 4 most recently used distances means that, in principle, using a match with a distance that will be needed again later may be globally optimal even if it is not locally optimal, and as a result of this, optimal LZMA compression probably requires knowledge of the whole input and might require algorithms too slow to be usable in practice.
Due to this, practical implementations tend to employ non-global heuristics.
The xz encoders use a value called nice_len (the default is 64): when any match of length at least nice_len is found, the encoder stops the search and outputs it, with the maximum matching length.
Fast encoder [ edit ]The XZ fast encoder [11] (derived from the 7-zip fast encoder) is the shortest LZMA encoder in the xz source tree.
It works like this:
- Perform combined search and insertion in the dictionary data structure
- If any repeated distance matches with length at least nice_len :
- Output the most frequently used such distance with a REP packet
- If a match was found of length at least nice_len :
- Output it with a MATCH packet
- Set the main match to the longest match
- Look at the nearest match of every length in decreasing length order, and until no replacement can be made:
- Replace the main match with a match which is one character shorter, but whose distance is less than 1/128 the current main match distance
- Set the main match length to 1 if the current main match is of length 2 and distance at least 128
- If a repeated match was found, and it is shorter by at most 1 character than the main match:
- Output the repeated match with a REP packet
- If a repeated match was found, and it is shorter by at most 2 characters than the main match, and the main match distance is at least 512:
- Output the repeated match with a REP packet
- If a repeated match was found, and it is shorter by at most 3 characters than the main match, and the main match distance is at least 32768:
- Output the repeated match with a REP packet
- If the main match size is less than 2 (or there isn't any match):
- Output a LIT packet
- Perform a dictionary search for the next byte
- If the next byte is shorter by at most 1 character than the main match, with distance less than 1/128 times the main match distance, and if the main match length is at least 3:
- Output a LIT packet
- If the next byte has a match at least as long as the main match, and with less distance than the main match:
- Output a LIT packet
- If the next byte has a match at least one character longer than the main match, and such that 1/128 of its distance is less or equal than the main match distance:
- Output a LIT packet
- If the next byte has a match more than one character longer than the main match:
- Output a LIT packet
- If any repeated match is shorter by at most 1 character than the main match:
- Output the most frequently used such distance with a REP packet
- Output the main match with a MATCH packet
Normal encoder [ edit ]The XZ normal encoder [12] (derived from the 7-zip normal encoder) is the other LZMA encoder in the xz source tree, which adopts a more sophisticated approach that tries to minimize the post-range-encoding size of the generated packets.
Specifically, it encodes portions of the input using the result of a dynamic programming algorithm, where the subproblems are finding the approximately optimal encoding (the one with minimal post-range-encoding size) of the substring of length L starting at the byte being compressed.
The size of the portion of the input processed in the dynamic programming algorithm is determined to be the maximum between the longest dictionary match and the longest repeated match found at the start position (which is capped by the maximum LZMA match length, 273); furthermore, if a match longer than nice_len is found at any point in the range just defined, the dynamic programming algorithm stops, the solution for the subproblem up to that point is output, the nice_len -sized match is output, and a new dynamic programming problem instance is started at the byte after the match is output.
Subproblem candidate solutions are incrementally updated with candidate encodings, constructed taking the solution for a shorter substring of length L', extended with all possible "tails", or sets of 1-3 packets with certain constraints that encode the input at the L' position. Once the final solution of a subproblem is found, the LZMA state and least used distances for it are computed, and are then used to appropriately compute post-range-encoding sizes of its extensions.
At the end of the dynamic programming optimization, the whole optimal encoding of the longest substring considered is output, and encoding continues at the first uncompressed byte not already encoded, after updating the LZMA state and least used distances.
Each subproblem is extended by a packet sequence which we call "tail", which must match one of the following patterns:
1st packet 2nd packet 3rd packet any LIT LONGREP[0] *MATCH LIT LONGREP[0] The reason for not only extending with single packets is that subproblems only have the substring length as the parameter for performance and algorithmic complexity reasons, while an optimal dynamic programming approach would also require to have the last used distances and LZMA state as parameter; thus, extending with multiple packets allows to better approximate the optimal solution, and specifically to make better use of LONGREP[0] packets.
The following data is stored for each subproblem (of course, the values stored are for the candidate solution with minimum price ), where by "tail" we refer to the packets extending the solution of the smaller subproblem, which are described directly in the following structure:
XZ for Java member name description price quantity to be minimized: number of post-range-encoding bits needed to encode the string optPrev uncompressed size of the substring encoded by all packets except the last one backPrev -1 if the last packet is LIT, 0-3 if it is a repetition using the last used distance number 0-3, 4 + distance if it is a MATCH (this is always 0 if prev1IsLiteral is true, since the last packet can only be a LONGREP[0] in that case) prev1IsLiteral true if the "tail" contains more than one packet (in which case the one before the last is a LIT) hasPrev2 true if the "tail" contains 3 packets (only valid if prev1IsLiteral is true) optPrev2 uncompressed size of the substring encoded by all packets except the "tail" (only valid if prev1IsLiteral and hasPrev2 are true) backPrev2 -1 if the first packet in the "tail" is LIT, 0-3 if it is a repetition using the last used distance number 0-3, 4 + distance if it is a MATCH (only valid if prev1IsLiteral and hasPrev2 are true) reps[4] the values of the 4 last used distances after the packets in the solution (computed only after the best subproblem solution has been determined) state the LZMA state value after the packets in the solution (computed only after the best subproblem solution has been determined) Note that in the XZ for Java implementation, the optPrev and backPrev members are reused to store a forward single-linked list of packets as part of outputting the final solution.
LZMA2 encoder [ edit ]The XZ LZMA2 encoder processes the input in chunks (of up to 2 MB uncompressed size or 64 KB compressed size, whichever is lower), handing each chunk to the LZMA encoder, and then deciding whether to output an LZMA2 LZMA chunk including the encoded data, or to output an LZMA2 uncompressed chunk, depending on which is shorter (LZMA, like any other compressor, will necessarily expand rather than compress some kinds of data).
The LZMA state is reset only in the first block, if the caller requests a change of properties and every time a compressed chunk is output. The LZMA properties are changed only in the first block, or if the caller requests a change of properties. The dictionary is only reset in the first block.
Upper encoding layers [ edit ]Before LZMA2 encoding, depending on the options provided, xz can apply the BCJ filter, which filters executable code to replace relative offsets with absolute ones that are more repetitive, or the delta filter, which replaces each byte with the difference between it and the byte N bytes before it.
Parallel encoding is performed by dividing the file in chunks which are distributed to threads, and ultimately each encoded (using, for instance, xz block encoding) separately, resulting in a dictionary reset between chunks in the output file.
7-Zip reference implementation [ edit ]The LZMA implementation extracted from 7-Zip is available as LZMA SDK. It was originally dual-licensed under both the GNU LGPL and Common Public License , [13] with an additional special exception for linked binaries, but was placed by Igor Pavlov in the public domain on December 2, 2008, with the release of version 4.62. [8]
LZMA2 compression, which is an improved version of LZMA, [14] is now the default compression method for the .7z format, starting with version 9.30 on October 26th, 2012. [15]
The reference open source LZMA compression library is written in C++ and has the following properties:
- Compression speed: approximately 1 MB/s on a 2 GHz CPU .
- Decompression speed: 10–20 MB/s on a 2 GHz CPU.
- Support for multithreading .
In addition to the original C++ , the LZMA SDK contains reference implementations of LZMA compression and decompression ported to ANSI C , C# , and Java . [8] There are also third-party Python bindings for the C++ library, as well as ports of LZMA to Pascal , Go and Ada . [16] [17] [18] [19]
The 7-Zip implementation uses several variants of hash chains , binary trees and Patricia tries as the basis for its dictionary search algorithm.
Decompression-only code for LZMA generally compiles to around 5 KB, and the amount of RAM required during decompression is principally determined by the size of the sliding window used during compression. Small code size and relatively low memory overhead, particularly with smaller dictionary lengths, and free source code make the LZMA decompression algorithm well-suited to embedded applications.
Other implementations [ edit ]In addition to the 7-Zip reference implementation, the following support the LZMA format.
LZHAM [ edit ]
- xz : a streaming implementation that contains a gzip -like command line tool, supporting both LZMA and LZMA2 in its xz file format. It made its way into several software of the Unix-like world with its high performance (compared to bzip2 ) and small size (compared to gzip ). [2] The Linux kernel , dpkg and RPM systems contains xz code, and many software distributors like kernel.org , Debian [20] and Fedora now use xz for compressing their releases.
- lzip : another LZMA implementation mostly for Unix-like systems to be directly competing with xz . [21] It mainly features a simpler file format and therefore easier error recovery.
- ZIPX : an extension to the ZIP compressions format that was created by WinZip starting with version 12.1. It also can use various other compression methods such as BZip and PPMd . [22]
- DotNetCompression : a streaming implementation of LZMA in managed C#.
LZHAM (LZ, Huffman, Arithmetic, Markov), is an LZMA-like implementation that trades compression throughput for very high ratios and higher decompression throughput. [23]
References [ edit ]External links [ edit ]
- Jump up ^ Igor Pavlov has asserted multiple times on SourceForge that the algorithm is his own creation. (2004-02-19). "LZMA spec?" . Archived from the original on 2009-08-25 . Retrieved 2013-06-16 .
- ^ Jump up to: a b Lasse Collin (2005-05-31). "A Quick Benchmark: Gzip vs. Bzip2 vs. LZMA" . Retrieved 2015-10-21 . - LZMA Unix Port was finally replaced by xz which features better and faster compression; from here we know even LZMA Unix Port was a lot better than gzip and bzip2.
- Jump up ^ Klausmann, Tobias (2008-05-08). "Gzip, Bzip2 and Lzma compared" . Blog of an Alpha animal . Retrieved 2013-06-16 .
- Jump up ^ Igor Pavlov (2013). "7z Format" . Retrieved 2013-06-16 .
- Jump up ^ Mahoney, Matt. "Data Compression Explained" . Retrieved 2013-11-13 .
- Jump up ^ Collin, Lasse; Pavlov, Igor. "lib/xz/xz_dec_lzma2.c" . Retrieved 2013-06-16 .
- Jump up ^ "The .xz File Format" . 2009-08-27 . Retrieved 2013-06-16 .
- ^ Jump up to: a b c Igor Pavlov (2013). "LZMA SDK (Software Development Kit)" . Retrieved 2013-06-16 .
- Jump up ^ "XZ in Java" . Retrieved 2013-06-16 . [ permanent dead link ]
- ^ Jump up to: a b Solomon, David (2006-12-19). Data Compression: The Complete Reference (4 ed.). Springer Publishing . p. 245. ISBN 978-1846286025 .
- Jump up ^ Collin, Lasse; Pavlov, Igor. "LZMAEncoderFast.java" . Archived from the original on 2012-07-16 . Retrieved 2013-06-16 .
- Jump up ^ Collin, Lasse; Pavlov, Igor. "LZMAEncoderNormal.java" . Archived from the original on 2012-07-08 . Retrieved 2013-06-16 .
- Jump up ^ "Browse /LZMA SDK/4.23" . Sourceforge . Retrieved 2014-02-12 .
- Jump up ^ "Inno Setup Help" . jrsoftware.org . Retrieved 2013-06-16 .
LZMA2 is a modified version of LZMA that offers a better compression ratio for uncompressible data (random data expands about 0.005%, compared to 1.35% with original LZMA), and optionally can compress multiple parts of large files in parallel, greatly increasing compression speed but with a possible reduction in compression ratio.- Jump up ^ "HISTORY of the 7-Zip" . 2012-10-26 . Retrieved 2013-06-16 .
- Jump up ^ Bauch, Joachim (2010-04-07). "PyLZMA – Platform independent python bindings for the LZMA compression library" . Retrieved 2013-06-16 .
- Jump up ^ Birtles, Alan (2006-06-13). "Programming Help: Pascal LZMA SDK" . Retrieved 2013-06-16 .
- Jump up ^ Vieru, Andrei (2012-06-28). "compress/lzma package for Go 1" . Retrieved 2013-06-16 .
- Jump up ^ "Zip-Ada" .
- Jump up ^ Guillem Jover. "Accepted dpkg 1.17.0 (source amd64 all)" . Debian Package QA . Retrieved 2015-10-21 .
- Jump up ^ Diaz, Diaz. "Lzip Benchmarks" . LZIP (nongnu).
- Jump up ^ "What is a Zipx File?" . WinZip.com . Retrieved 2016-03-14 .
- Jump up ^ "LZHAM – Lossless Data Compression Codec" . Richard Geldreich.
LZHAM is a lossless data compression codec written in C/C++ with a compression ratio similar to LZMA but with 1.5x-8x faster decompression speed.
- Official home page
- Lzip format specification
- XZ format specification
- LZMA SDK (Software Development Kit)
- LZMA Utils = XZ Utils
- Windows Binaries for XZ Utils
- Data compression, Compressors & Archivers
[ show ] Archive formats Archiving only Compression only Archiving and compression Software packaging and distribution Document packaging and distribution Retrieved from " https://en.wikipedia.org/w/index.php?title=Lempel–Ziv–Markov_chain_algorithm&oldid=831085554 " Categories : Hidden categories:
[ show ] Data compression methods Lossless
Entropy type Dictionary type Other types Audio
Concepts Codec parts Image
Concepts Methods Video
Concepts Codec parts Theory Navigation menu Personal tools
- All articles with dead external links
- Articles with dead external links from January 2018
- Articles with permanently dead external links
- Wikipedia introduction cleanup from July 2014
- All pages needing cleanup
- Articles covered by WikiProject Wikify from July 2014
- All articles covered by WikiProject Wikify
- Wikipedia articles needing style editing from July 2014
- All articles needing style editing
- Articles with multiple maintenance issues
- Articles needing additional references from July 2010
- All articles needing additional references
- All articles with unsourced statements
- Articles with unsourced statements from June 2013
- Articles that may contain original research from April 2012
- All articles that may contain original research
Namespaces Variants Views More Navigation
- Not logged in
- Talk
- Contributions
- Create account
- Log in
Interaction Tools
- Main page
- Contents
- Featured content
- Current events
- Random article
- Donate to Wikipedia
- Wikipedia store
Print/export Languages
- What links here
- Related changes
- Upload file
- Special pages
- Permanent link
- Page information
- Wikidata item
- Cite this page
Edit links
- Български
- Catalŕ
- Čeština
- Deutsch
- Espańol
- Français
- 한국어
- Italiano
- Magyar
- 日本語
- Polski
- Portuguęs
- Русский
- Slovenščina
- Svenska
- 中文
- This page was last edited on 18 March 2018, at 17:37.
- Text is available under the Creative Commons Attribution-ShareAlike License ; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy . Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. , a non-profit organization.
- Privacy policy
- About Wikipedia
- Disclaimers
- Contact Wikipedia
- Developers
- Cookie statement
- Mobile view
- Enable previews
Mar 13, 2018 | www.linuxtoday.com
(Mar 12, 2018, 11:00) ( 0 talkbacks )
zstd is an open-source lossless data compression algorithm designed to offer fast real-time compression and decompression speeds, even faster than xz or gzip.
freshmeat.net
Lzip is a lossless data compressor based on the LZMA algorithm, with very safe integrity checking and a user interface similar to the one of gzip or bzip2. Lzip decompresses almost as quickly as gzip and compresses better than bzip2, which makes it well suited for software distribution and data archiving. Lziprecover is a data recovery tool for lzip compressed files able to repair slightly damaged files, recover badly damaged files from two or more copies, and extract undamaged members from multi-member files.
Re: Other LZMA tools
I think you are right that the standalone 'lzma' program (replacing the older lzmash) has a very basic data format. But still, it works, and is the more established tool. I would be happy for lzip to replace it if lzip is better, but to do that it should include support for decompressing legacy .lzma files.
(I note that the gzip format has provision for alternative compression methods but nobody ever seems to use it.)
> As for lrzip, it is actually
> an extension of rzip---and the two are
> more of a proof-of-concept than a
> realworld-workable format.
The file format may be basic but the tool is very good. It usually compresses better than plain LZMA (the algorithm, used in both lzma-utils and lzip) and faster too. LZMA is better for all-purpose use but for batch compression tasks where you don't mind relatively high memory usage, lrzip can give a big improvement. For some Subversion dump files I back up overnight it gave a fourfold increase in compression for about the same speed.
Re: Other LZMA tools
As I gather, lzma-utils-produced files lack magic identification bytes and a checksum, and if you believe forum archives, lzma-utils did not manage to come up with a suitable new format in more than two years. It is about time lzip came along-7z sounds nice too, but seems to have gotten no ground in the Unix world due to subpar unix integration.
As for lrzip, it is actually an extension of rzip-and the two are more of a proof-of-concept than a realworld-workable format.
In 1987, I was asked by a magazine editor to write an article about data compression. I wrote a manuscript and an accompanying program, sent them to the editor, and forgot about them. The next time I heard from him I was told that the magazine was discontinued. So I uploaded my program, a simple implementation of the LZSS compression algorithm (see below) to PC-VAN, a big Japanese BBS run by NEC. That was May 1, 1988.
Soon a number of hobby programmers gathered and began improving on that program. The project culminated in Kazuhiko Miki's archiver LArc, which was fairly widey used in Japan. (Dr. Miki was then a medical specialist working at a governmental office. I heard he left office and began work on freeware/shareware promotion.)
The LZSS algorithm is based on a very simple idea. Suppose I'm going to write "compression" here. But probably I've already used that word before in this file. If I used that word 57 characters before, I might as well write "go 57 characters backward, and read 11 characters," or <57,11> for short. In general, when I've already used the string of characters among the recent 4096 characters, say, I encode the string by a <position,length> pair.
In Storer's [8] terminology, this is a sliding dictionary algorithm, analyzed first by Ziv and Lempel [14] and then by Storer and Szymanski [9], among others.
Later versions of my LZSS implementations and Miki's LArc used binary search trees to make string search faster; see Bell [1].
Incidentally, there are two distinct Ziv-Lempel (LZ) methods: sliding dictionary [14] and dynamic dictionary [15] in Storer's [8] terminology. The LZW algorithm [12] belongs to the latter. Most pre-LHarc compression tools, such as 'compress', 'ARC', and 'PKARC', used LZW.
During the summer of 1988, I wrote another compression program, LZARI. This program is based on the following observation: Each output of LZSS is either a single character or a <position,length> pair. A single character can be coded as an integer between 0 and 255. As for the <length> field, if the range of <length> is 2 to 257, say, it can be coded as an integer between 256 and 511. Thus, I can say that there are 512 kinds of "characters," and the "characters" 256 through 511 are accompanied by a <position> field. These 512 "characters" can be Huffman-coded, or better still, algebraically coded. The <position> field can be coded in the same manner. In LZARI I used an adaptive algebraic compression [13], [2] to encode the "characters," and static algebraic compression to encode the <position> field. (There were several versions of LZARI; some of them were slightly different from the above description.) The compression of LZARI was very tight, though rather slow.
Haruyasu Yoshizaki (Yoshi), a physician and guru hobby programmer, worked very hard to make LZARI faster. Most importantly, he replaced LZARI's algebraic compression by dynamic Huffman coding.
His program, LZHUF, was very successful. It was much faster than my LZARI. As for compression ratio, Huffman cannot beat algebraic compression, but the difference turned out to be very small.
Yoshi rewrote the compression engine of LZHUF in assembler, and added a nifty user interface. His archiver, LHarc, soon became the de facto standard among Japanese BBS users. After Prof. Kenjirou Okubo, a mathematician, introduced LHarc to the United States, it became world-famous. Other vendors began using similar techniques: sliding dictionary plus statistical compressions such as Huffman and Shannon-Fano. (I wondered why they used Shannon-Fano rather than Huffman which is guaranteed to compress tighter than Shannon-Fano. As it turned out, a then-popular book on compression published in U.S. contained a wrong description and buggy sample programs, such that Shannon-Fano outperformed (buggy) Huffman on many files.)
Although LHarc was much faster than LZARI, we weren't quite satisfied with its speed. Because LHarc was based on dynamic Huffman, it had to update Huffman tree every time it received a character. Yoshi and I tried other dynamic Huffman algorithms [5], [10], [11], but improvements were not as great as we desired.
So I took a different step: replacing LHarc's dynamic Huffman by a static Huffman method.
Traditional static Huffman coding algorithm first scans the input file to count character distribution, then builds Huffman tree and encodes the file. In my approach, the input file is read only once. It is first compressed by a sliding dictionary method like LZARI and LHarc, and at the same time the distributions of the "characters" (see above) and positions are counted. The output of this process is stored in main memory. When the buffer in memory is full (or the input is exhausted), the Huffman trees are constructed, and the half-processed content of the buffer is actually compressed and output.
In static Huffman, the Huffman tree must be stored in the compressed file. In the traditional approach this information consumes hundreds of bytes. My approach was to standardize Huffman trees so that (1) each left subtree is no deeper than its right counterpart, and (2) the leaves at the same level are sorted in ascending order. In this way the Huffman tree can be uniquely specified by the lengths of the codewords. Moreover, the resulting table is again compressed by the same Huffman algorithm.
To make the decoding program simpler, the Huffman tree is adjusted so that the codeword lengths do not exceed 16 bits. Since this adjusting is rarely needed, the algorithm is made very simple. It does not create optimal length-limited Huffman trees; see e.g. [6] for an optimal algorithm. Incidentally, my early program had a bug here, which was quickly pointed out and corrected by Yoshi.
The sliding dictionary algorithm is also improved by Yoshi using a "PATRICIA tree" data structure; see McCreight [7] and Fiala and Greene [4].
After completing my algorithm, I learned that Brent [3] also used a sliding dictionary plus Huffman coding. His method, SLH, is simple and elegant, but since it doesn't find the most recent longest match, the distribution of match position becomes flat. This makes the second-stage Huffman compression less efficient.
On the basis of these new algorithms, Yoshi began to rewrite his LHarc, but it took him so long (remember he was a busy doctor!) that I decided to write my own archiver. My archiver was quite recklessly named 'ar'. (Actually I appended version numbers as in 'ar002' for version 0.02.) I should have named it 'har' (after my name), say, because 'ar' collides with the name of UNIX's archiver. I didn't want my program to compete with LHarc, but I wanted many people to try the algorithm, so I wrote it in pure ANSI C. This is the reason 'ar' lacked many bells and whistles necessary for a real archiver.
Note: The version of 'ar002' most often found in the U.S. had a bug. Line 24 of maketbl.c should read, of course,while (i <= 16) { weight[i] = 1U << (16 - i); i++; }Somehow the bug didn't show up when compiled by Turbo C.Yoshi finally showed us his new archiver written in C. It was tentatively named LHx. He then rewrote the main logic in assembler. Yoshi and I wrote an article describing his new archiver, which would be named LH, in the January, 1991, issue of "C Magazine" (in Japanese). The suffix 'arc' of LHarc was deliberately dropped because the people who sold ARC did not want others to use the name.
Then we learned that for the new DOS 5.0, LH meaned LoadHigh, an internal command. We decided to rename LH to LHA.
Also, I was told that the algorithm described in Fiala and Greene [4] got patented ("Textual Substitution Data Compression With Finite Length Search Windows," U.S. Patent 4,906,991, Mar. 6, 1990. Actually they got three patents! The other two were: "Start, Step, Stop Unary Encoding for Data Compression," Application Ser. No. 07/187,697, and "Search Tree Data Structure Encoding for Textual Substitution Data Compression Systems," Application Ser. No. 07/187,699.)
Furthermore, I learned that the original Ziv-Lempel compression method (Eastman et al., U.S. Patent 4,464,650, 8/1984) and the LZW method (Welch, 4,558,302, 12/1985) were patented. I also heard that Richard Stallman, of the Free Software Foundation, author of the EMACS editor and leader of the GNU project, ceased to use 'compress' program any more because its LZW algorithm got patented.
Are algorithms patentable? (See [16].) If these patents should turn out to be taken seriously, all compression programs now in use may infringe some of these patents. (Luckily, not all claims made by those algorithm patents seems to be valid.)
The foregoing is a slight modification of what I wrote in 1991. The year 1991 was a very busy year for me. In 1992, I joined the faculty of Matsusaka University. This opportunity should have given me more free time, but as it turned out I got ever busier. I stopped hacking on my compression algorithms; so did Yoshi.
Luckily, all good things in LHA were taken over, and all bad things abandoned, by the new great archiver zip and the compression tool gzip. I admire the efforts of Jean-loup Gailly and others.
A brief historical comment on PKZIP: At one time a programmer for PK and I were in close contact. We exchanged a lot of ideas. No wonder PKZIP and LHA are so similar.
Another historical comment: LHICE and ICE are definitely not written by Yoshi (or me or anyone I know). I think they are faked versions of LHarc.
REFERENCES
- [1]
- Timothy C. Bell. Better OPM/L text compression. IEEE Transactions on Communications, COM-34(12):1176--1182, 1986.
- [2]
- Timothy C. Bell, John G. Cleary, and Ian H. Witten. Text Compression. Prentice Hall, 1990.
- [3]
- R. P. Brent. A linear algorithm for data compression. The Australian Computer Journal, 19(2):64--68, 1987.
- [4]
- Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Communications of the ACM, 32(4):490--505, 1989.
- [5]
- Donald E. Knuth. Dynamic Huffman coding. Journal of Algorithms, 6:163--180, 1985.
- [6]
- Lawrence L. Larmore and Daniel S. Hirschberg. A fast algorithm for optimal length-limited Huffman codes. Journal of the Association for Computing Machinery, 37(3):464--473, 1990.
- [7]
- Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the Association for Computing Machinery, 23(2):262--272, 1976.
- [8]
- James A. Storer. Data Compression: Methods and Theory. Computer Science Press, Rockville, MD., 1988.
- [9]
- James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. Journal of the Association for Computing Machinery, 29(4):928--951, 1982.
- [10]
- Jeffrey Scott Vitter. Design and analysis of dynamic Huffman codes. Journal of the Association for Computing Machinery, 34(4):825--845, 1987.
- [11]
- Jeffrey Scott Vitter. Algorithm 673: Dynamic Huffman coding. ACM Transactions on Mathematical Software, 15(2):158--167, 1989.
- [12]
- Terry A. Welch. A technique for high-performance data compression. IEEE Computer}, 17(6):8--19, 1984.
- [13]
- Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520--540, 1987.
- [14]
- Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, IT-23(3):337--343, 1977.
- [15]
- Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, IT-24(5):530--536, 1978.
- [16]
- Edward N. Zalta. Are algorithms patentable? Notices of the American Mathematical Society, 35(6):796--799, 1988.
Haruhiko Okumura,
[email protected]Last modified: Tue Mar 17 17:02:03 1998
The column "html" shows the size of the article in html format and if clicked opens it in a new window. The "zip" column contains the zipped article and images or source if any. Note that brighter rows mean new or updated article. If you take the time to read the articles, please take the time to tell me what do you think of them.
Title Html Zip Last update Arithmetic coding, entropy coder 20k 7k 22-Jul-1999 Bwt, block reduction 21k 8k 22-Jul-1999 Canonical huffman. 15k 5k 23-Jul-1999 Crc-32, the standard Crc 6k 7k 10-Aug-1999 Finite context modeling 37k 12k 16-Nov-1999 Flexible parsing, improvement of lz 10k 4k 24-Jul-1999 Lz77, also called lzss 40k 14k 23-Jul-1999 Lzw, for gif decoding 27k 9k 23-Jul-1999 Mtf, a transformation scheme 9k 3k 24-Jul-1999 Implementing ppmc with hash tables 111k 327k 21-March-2000 Quasi Static model 19k 6k 13-Aug-1999 Range coder 24k 8k 17-Nov-1999 Rle, Basic scheme 7k 3k 24-Jul-1999 As you can read in the index I'm going to publish as soon as possible new articles, if you want to get an email when this happens, please fill this form, moreover you can use it to tell me which file format do you prefer. If you have any question about the articles feel free to email me.
My goal is to offer the best articles about compression for free. Thus I need your feedback on the articles to further improve them: What you don't know. What was confusing or unclear. What difficulties you found while implementing it. Or what you miss. Teeling me so is the only way of further improving articles for you, and depends only of you.
Note that since my first articles both my knowledge about compression and english have greatly improved, thus you can notice a difference in quality between old articles or the latest ones. Unfortunately I don't have the time to rewrite all the articles (as I would like to) so instead please tell me which articles do you find worth rewriting, and I'll do so.
The spread of computing has led to an explosion in the volume of data to be stored on hard disks and sent over the Internet. This growth has led to a need for "data compression", that is, the ability to reduce the amount of storage or Internet bandwidth required to handle data.
This document provides a survey of data compression techniques. The focus is on the most prominent data compression schemes, particularly popular archivers, JPEG image compression, and MPEG video and audio compression.
Gone are the days... (Score:5) by zpengo (99887) on Saturday April 22 2000, @06:31PM (#1116033) Homepage37 (Score:4) by roman_mir (125474) on Saturday April 22 2000, @06:36PM (#1116035) Homepage JournalOh, the memories...
I used to go down to the local computer store, which had bins and bins of the latest shareware, all on precious 5 1/4 disks. Each one held some sort of magic that would transform my XT with Hercules graphics into a completely absorbing experience.
Video games, clones of major applications, dinky little Pascal compilers, my first version of Spacewar....
But there was a key to all of that magic. Back then, there were no auto-installing CDs. There was no "setup.exe" There would just be a single file, with that ever-familiar extension: ".ZIP"
I had been on the scene long enough to know what was up, so I not only had PKZIP/PKUNZIP installed on my 4 meg harddrive, but I even had it in the PATH.
A few keystrokes later, the magic was unlocked.
We don't know how much we owe to this great man. I genuinely mourn his passing.
Sad day.... by maelstrom (Score:1) Saturday April 22 2000, @06:37PMSo many celebrities, poets, actors, revolutionaries, wariers, politicians etc have died on 33 and 37, I tell you, if you pass 37 you'll probably live a long life.
(to those of us who remember Vladimir Visotskiy) Na zifre 37, kovaren bog, rebrom vopros postavil: ili, ili
Na etom rubeje legli i Bairon i Rembo a nineshnie kak-to proskochili...*sigh* (Score:5) by Seumas (6865) on Saturday April 22 2000, @06:37PM (#1116040)
Re:PK vs. SEA controversy (Score:4) by farrellj (563) on Saturday April 22 2000, @07:04PM (#1116054) Homepage JournalThis is the first I've heard of his death and I have to say that it really makes me feel sad. I'm not aware of much that he's done outside of PKZIP, but I sure remember using ZIP for everything online (especially when a 2400 baud modem was considered fast and a zipped file could half your online time).
Huffman, Postel, Stevens . . . Now P.W. Katz. I feel guilty for not ever considering any of these people beyond what their program does or does not do for me -- or how I benefitted from their books, until after their death. To think that while we're all out there unzipping our latest copy of the Jargon file or stashing a bunch of porn in a password protected ZIP file, this guy was suffering a serious problem which eventually took his life at the age of *thirty-seven*.
I'm only 22. I spend all my time working at a desk. I haven't been in-shape for almost six years. I could be next. I could be next and I haven't offered a damn thing to the computer or internet community. These people -- and many others, have.
I hope that we'll remember these things in subsequent posts in reply to this article. The last thing we need is another disgustingly barbaric replay of the posts we saw when W. Richard Stevens died.
I hope you have peace, Phillip.
W. Richard Stevens Slashdot Article [slashdot.org]
W. Richard Stevens Home Page [kohala.com]
David Huffman Slashdot Article [slashdot.org]
Jon Postel Slashdot Article [slashdot.org]
Jon Postel's Home Page [isi.edu]This is really sad! (Score:5) by DeepDarkSky (111382) on Saturday April 22 2000, @07:04PM (#1116055)O.K., here is the story as I remember it.
Phil wrote a better compression program that was compatible with System Enhancements Associates (SEA) program called ARC. So they litigated. And so Phil went off and found a better algorithem for compression, and brought out PKZIP.
Many people in the BBS community thought that SEA was a little heavyhanded (Perception, I don't know the reality), and moved to PKZIP. Others moved over for the speed and the better compression. The rest is history.
See also "arc wars" [cs.hut.fi]MIT Jargon File ver 299. This story seems to have been dropped from the current Jargon File [jargon.org] for some reason.
ttyl
Farrell McGovernFormer Sysop, Data/SFnet (One of the first few hundred Fidonet BBSs!) and Solsbury Hill, founding member of PODSnet.
Re:Would like to know the rest of the story (Score:3) by Trepidity (597) <delirium-slashdot.hackish@org> on Saturday April 22 2000, @07:18PM (#1116062) HomepageI definitely remember Phil Katz and all the controversy surrounding him, and how grateful I was to have discovered his programs. I remember the first compression program which was SEA's ARC program. It was very slow. Then my friend and I discovered PKARC and PKXARC, which were much faster than ARC. As PKARC gained popularity because of its overall superiority, SEA sued Phil Katz, and he in turn created PKPAK/PKUNPAK (I think it was still paired like that). Tha PKPAK series didn't last long. The PKZIP series came out next, and that was the series that created the ubiquitous ZIP format that we see today. If I remember correctly, PKZ204G was the last official DOS version of the program, and there were plenty of trojans, etc. that were going around, and Phil created self-authenticating zip files, etc. Lots of neat little cool things. I also remember that other programs were giving PK a run for his money, such as ARJ and LHARC, but they never achieved the overall speed/performance/compression that PKZIP ever did (they were often better in one thing or another but not overall). Then WINZIP came out, and I kind lost sight of PK.
I still have thousands of ZIP files that were zipped with PKZIP. If it wasn't for him, I wouldn't have been as into computers as I am, it was because of those early days of playing around with PKARC and PKXARC that really got me started. I am terribly sad to see him go and in such (I think) indignant way.
Re:Would like to know the rest of the story (Score:5) by Harinath (26296) on Saturday April 22 2000, @07:28PM (#1116067) HomepageWell, PKZip seems to have stopped development around 1993, well before WinZip became popular. PKZIp v2.04g was pretty much the last version I know of, and it came out february 1, 1993. Up until then there had been fairly frequent updates, but throughout 1993, 1994, and 1995, PKZIp v2.04g for DOS remained the standard compression tool.
Only then, after 2-3 years of no updates, did other tools like WinZip become popular. PKZIp finally made a Windows product in 1997 or 1998, but they were long gone by then. I'm not sure what led to the development halt, but the original stuff is fine coding...
Some more links... (Score:2) by Megane (129182) on Saturday April 22 2000, @07:31PM (#1116069)IIRC, the ZIP file format was made public domain, thus allowing the Info-ZIP people to write a program that reads ZIP archives, which in turn allowed WinZip to not have to license software from PKware.
Unlike LZW, the ZIP "deflate" algorithms (LZ77 + Shannon Fano encoding) are unemcumbered. These compression algorithms are used in GNU Zip (gzip) partly for that reason. I think gzip can even read .zip archives with only one file inside. The zip algorithm is also in the zlib library, which is used in the PNG format, for one. The "deflate" algorithm is also described in RFC 1951.
So, thanks PK, for providing one of the tools that enable us to thumb our noses at Unisys :-)
Phillip Katz's patents (Score:2) by ContinuousPark (92960) on Saturday April 22 2000, @07:32PM (#1116071)From the Milwaukee Journal Sentinel:
The original obituary notice [jsonline.com]
A more complete article [jsonline.com]One interesting quote:
"It was just a hobby," he said. "I didn't expect it to turn into a business."I had a moderately successful shareware program myself during the '80s, and it sure didn't help my life much. Fortunately I have no interest in booze or drugs -- they just get in the way of hacking. And also fortunately, I let it go when it wasn't successful any more. Maybe a little later than I should have, but I did move on.
Re:How many people here registered pkzip? (Score:1) by Anonymous Coward on Saturday April 22 2000, @06:53PM (#1116049)Couldn't find much so far about him but I came across this page where several patents on data compression algorithms are mentioned and this led me to one of his patents [ibm.com], a so-called string searcher and compressor.
It would be interesting to know if what the patent decribes is the technology behind PKZIP.
I first found out about this on CNET Download.com's front page; there was this little message in memoriam of PK but I don't think it was mentioned on News.com; that was strange. This is a sad event and I think it would be more convenient and respectful if we didn't get to know the details of his death just because it turned out to be a morbid and attention-attracting story for the media.
i didn't either.
do you think it would be appropriate to register now, and contribute the check to his estate?
Katz, Phillip W.Publication Date: April 19, 2000
Age 37. Passed away unexpectedly on Fri., April 14, 2000. Beloved son of Hildegard and beloved brother of Cynthia. Also survived by other relatives and friends. Phil was a graduate of UWM Computer Science Engineering Program. He was the author of the PKZIP/PKUNZIP software and owner of PKWARE Inc. Co. Private services have been held. Memorials to the charity of your choice would be appreciated.ZWASKA FUNERAL HOME 354-5330 Serving the Family
>From: [email protected] (John Barkaus) Newsgroups: comp.graphics Subject: GIF file format responses 5/5 Date: 21 Apr 89 20:58:01 GMT Organization: The Cooper Union (NY, NY)
LZW and GIF explained----Steve Blackstock
I hope this little document will help enlighten those of you out there who want to know more about the Lempel-Ziv Welch compression algorithm, and, specifically, the implementation that GIF uses.
Before we start, here's a little terminology, for the purposes of this document:
- "character":
- a fundamental data element. In normal text files, this is just a single byte. In raster images, which is what we're interested in, it's an index that specifies the color of a given pixel. I'll refer to an arbitray character as "K".
- "charstream":
- a stream of characters, as in a data file.
- "string":
- a number of continuous characters, anywhere from one to very many characters in length. I can specify an arbitrary string as "[...]K".
- "prefix":
- almost the same as a string, but with the implication that a prefix immediately precedes a character, and a prefix can have a length of zero. So, a prefix and a character make up a string. I will refer to an arbitrary prefix as "[...]".
- "root":
- a single-character string. For most purposes, this is a character, but we may occasionally make a distinction. It is [...]K, where [...] is empty.
- "code":
- a number, specified by a known number of bits, which maps to a string.
- "codestream":
- the output stream of codes, as in the "raster data"
- "entry":
- a code and its string.
- "string table":
- a list of entries; usually, but not necessarily, unique.
That should be enough of that.
LZW is a way of compressing data that takes advantage of repetition of strings in the data. Since raster data usually contains a lot of this repetition, LZW is a good way of compressing and decompressing it.
For the moment, lets consider normal LZW encoding and decoding. GIF's variation on the concept is just an extension from there.
LZW manipulates three objects in both compression and decompression: the charstream, the codestream, and the string table. In compression, the charstream is the input and the codestream is the output. In decompression, the codestream is the input and the charstream is the output. The string table is a product of both compression and decompression, but is never passed from one to the other.
The first thing we do in LZW compression is initialize our string table. To do this, we need to choose a code size (how many bits) and know how many values our characters can possibly take. Let's say our code size is 12 bits, meaning we can store 0->FFF, or 4096 entries in our string table. Lets also say that we have 32 possible different characters. (This corresponds to, say, a picture in which there are 32 different colors possible for each pixel.) To initialize the table, we set code#0 to character#0, code #1 to character#1, and so on, until code#31 to character#31. Actually, we are specifying that each code from 0 to 31 maps to a root. There will be no more entries in the table that have this property.
Now we start compressing data. Let's first define something called the "current prefix". It's just a prefix that we'll store things in and compare things to now and then. I will refer to it as "[.c.]". Initially, the current prefix has nothing in it. Let's also define a "current string", which will be the current prefix plus the next character in the charstream. I will refer to the current string as "[.c.]K", where K is some character. OK, look at the first character in the charstream. Call it P. Make [.c.]P the current string. (At this point, of course, it's just the root P.) Now search through the string table to see if [.c.]P appears in it. Of course, it does now, because our string table is initialized to have all roots. So we don't do anything. Now make [.c.]P the current prefix. Look at the next character in the charstream. Call it Q. Add it to the current prefix to form [.c.]Q, the current string. Now search through the string table to see if [.c.]Q appears in it. In this case, of course, it doesn't. Aha! Now we get to do something. Add [.c.]Q (which is PQ in this case) to the string table for code#32, and output the code for [.c.] to the codestream. Now start over again with the current prefix being just the root Q. Keep adding characters to [.c.] to form [.c.]K, until you can't find [.c.]K in the string table. Then output the code for [.c.] and add [.c.]K to the string table. In pseudo-code, the algorithm goes something like this:
- Initialize string table;
- [.c.] <- empty;
- K <- next character in charstream;
- Is [.c.]K in string table?
- yes:
- [.c.] <- [.c.]K;
- go to [3];
- no:
- add [.c.]K to the string table;
- output the code for [.c.] to the codestream;
- [.c.] <- K;
- go to [3];
It's as simple as that! Of course, when you get to step [3] and there aren't any more characters left, you just output the code for [.c.] and throw the table away. You're done.
Wanna do an example? Let's pretend we have a four-character alphabet: A,B,C,D. The charstream looks like ABACABA. Let's compress it. First, we initialize our string table to: #0=A, #1=B, #2=C, #3=D. The first character is A, which is in the string table, so [.c.] becomes A. Next we get AB, which is not in the table, so we output code #0 (for [.c.]), and add AB to the string table as code #4. [.c.] becomes B. Next we get [.c.]A = BA, which is not in the string table, so output code #1, and add BA to the string table as code #5. [.c.] becomes A. Next we get AC, which is not in the string table. Output code #0, and add AC to the string table as code #6. Now [.c.] becomes C. Next we get [.c.]A = CA, which is not in the table. Output #2 for C, and add CA to table as code#7. Now [.c.] becomes A. Next we get AB, which IS in the string table, so [.c.] gets AB, and we look at ABA, which is not in the string table, so output the code for AB, which is #4, and add ABA to the string table as code #8. [.c.] becomes A. We can't get any more characters, so we just output #0 for the code for A, and we're done. So, the codestream is #0#1#0#2#4#0.
A few words (four) should be said here about efficiency: use a hashing strategy. The search through the string table can be computationally intensive, and some hashing is well worth the effort. Also, note that "straight LZW" compression runs the risk of overflowing the string table - getting to a code which can't be represented in the number of bits you've set aside for codes. There are several ways of dealing with this problem, and GIF implements a very clever one, but we'll get to that.
An important thing to notice is that, at any point during the compression, if [...]K is in the string table, [...] is there also. This fact suggests an efficient method for storing strings in the table. Rather than store the entire string of K's in the table, realize that any string can be expressed as a prefix plus a character: [...]K. If we're about to store [...]K in the table, we know that [...] is already there, so we can just store the code for [...] plus the final character K.
Ok, that takes care of compression. Decompression is perhaps more difficult conceptually, but it is really easier to program.
Here's how it goes: We again have to start with an initialized string table. This table comes from what knowledge we have about the charstream that we will eventually get, like what possible values the characters can take. In GIF files, this information is in the header as the number of possible pixel values. The beauty of LZW, though, is that this is all we need to know. We will build the rest of the string table as we decompress the codestream. The compression is done in such a way that we will never encounter a code in the codestream that we can't translate into a string.
We need to define something called a "current code", which I will refer to as "<code>", and an "old-code", which I will refer to as "<old>". To start things off, look at the first code. This is now <code>. This code will be in the intialized string table as the code for a root. Output the root to the charstream. Make this code the old-code <old>. *Now look at the next code, and make it <code>. It is possible that this code will not be in the string table, but let's assume for now that it is. Output the string corresponding to <code> to the codestream. Now find the first character in the string you just translated. Call this K. Add this to the prefix [...] generated by <old> to form a new string [...]K. Add this string [...]K to the string table, and set the old-code <old> to the current code <code>. Repeat from where I typed the asterisk, and you're all set. Read this paragraph again if you just skimmed it!!! Now let's consider the possibility that <code> is not in the string table. Think back to compression, and try to understand what happens when you have a string like P[...]P[...]PQ appear in the charstream. Suppose P[...] is already in the string table, but P[...]P is not. The compressor will parse out P[...], and find that P[...]P is not in the string table. It will output the code for P[...], and add P[...]P to the string table. Then it will get up to P[...]P for the next string, and find that P[...]P is in the table, as the code just added. So it will output the code for P[...]P if it finds that P[...]PQ is not in the table. The decompressor is always "one step behind" the compressor. When the decompressor sees the code for P[...]P, it will not have added that code to it's string table yet because it needed the beginning character of P[...]P to add to the string for the last code, P[...], to form the code for P[...]P. However, when a decompressor finds a code that it doesn't know yet, it will always be the very next one to be added to the string table. So it can guess at what the string for the code should be, and, in fact, it will always be correct. If I am a decompressor, and I see code#124, and yet my string table has entries only up to code#123, I can figure out what code#124 must be, add it to my string table, and output the string. If code#123 generated the string, which I will refer to here as a prefix, [...], then code#124, in this special case, will be [...] plus the first character of [...]. So just add the first character of [...] to the end of itself. Not too bad. As an example (and a very common one) of this special case, let's assume we have a raster image in which the first three pixels have the same color value. That is, my charstream looks like: QQQ.... For the sake of argument, let's say we have 32 colors, and Q is the color#12. The compressor will generate the code sequence 12,32,.... (if you don't know why, take a minute to understand it.) Remember that #32 is not in the initial table, which goes from #0 to #31. The decompressor will see #12 and translate it just fine as color Q. Then it will see #32 and not yet know what that means. But if it thinks about it long enough, it can figure out that QQ should be entry#32 in the table and QQ should be the next string output. So the decompression pseudo-code goes something like:
- Initialize string table;
- get first code: <code>;
- output the string for <code> to the charstream;
- <old> = <code>;
- <code> <- next code in codestream;
- does <code> exist in the string table?
- yes:
- output the string for <code> to the charstream;
- [...] <- translation for <old>;
- K <- first character of translation for <code>;
- add [...]K to the string table;
- <old> <- <code>;
- no:
- [...] <- translation for <old>;
- K <- first character of [...];
- output [...]K to charstream and add it to string table;
- <old> <- <code>
- go to [5];
Again, when you get to step [5] and there are no more codes, you're finished. Outputting of strings, and finding of initial characters in strings are efficiency problems all to themselves, but I'm not going to suggest ways to do them here. Half the fun of programming is figuring these things out!
Now for the GIF variations on the theme. In part of the header of a GIF file, there is a field, in the Raster Data stream, called "code size". This is a very misleading name for the field, but we have to live with it. What it is really is the "root size". The actual size, in bits, of the compression codes actually changes during compression/decompression, and I will refer to that size here as the "compression size". The initial table is just the codes for all the roots, as usual, but two special codes are added on top of those. Suppose you have a "code size", which is usually the number of bits per pixel in the image, of N. If the number of bits/pixel is one, then N must be 2: the roots take up slots #0 and #1 in the initial table, and the two special codes will take up slots #4 and #5. In any other case, N is the number of bits per pixel, and the roots take up slots #0 through #(2**N-1), and the special codes are (2**N) and (2**N + 1). The initial compression size will be N+1 bits per code. If you're encoding, you output the codes (N+1) bits at a time to start with, and if you're decoding, you grab (N+1) bits from the codestream at a time. As for the special codes: <CC> or the clear code, is (2**N), and <EOI>, or end-of-information, is (2**N + 1). <CC> tells the compressor to re-initialize the string table, and to reset the compression size to (N+1). <EOI> means there's no more in the codestream. If you're encoding or decoding, you should start adding things to the string table at <CC> + 2. If you're encoding, you should output <CC> as the very first code, and then whenever after that you reach code #4095 (hex FFF), because GIF does not allow compression sizes to be greater than 12 bits. If you're decoding, you should reinitialize your string table when you observe <CC>. The variable compression sizes are really no big deal. If you're encoding, you start with a compression size of (N+1) bits, and, whenever you output the code (2**(compression size)-1), you bump the compression size up one bit. So the next code you output will be one bit longer. Remember that the largest compression size is 12 bits, corresponding to a code of 4095. If you get that far, you must output <CC> as the next code, and start over. If you're decoding, you must increase your compression size AS SOON AS YOU write entry #(2**(compression size) - 1) to the string table. The next code you READ will be one bit longer. Don't make the mistake of waiting until you need to add the code (2**compression size) to the table. You'll have already missed a bit from the last code. The packaging of codes into a bitsream for the raster data is a potential stumbling block for the novice encoder or decoder. The lowest order bit in the code should coincide with the lowest available bit in the first available byte in the codestream. For example, if you're starting with 5-bit compression codes, and your first three codes are, say, <abcde>, <fghij>, <klmno>, where e, j, and o are bit#0, then your codestream will start off like:
byte#0: hijabcde byte#1: .klmnofgSo the differences between straight LZW and GIF LZW are: two additional special codes and variable compression sizes. If you understand LZW, and you understand those variations, you understand it all!
Just as sort of a P.S., you may have noticed that a compressor has a little bit of flexibility at compression time. I specified a "greedy" approach to the compression, grabbing as many characters as possible before outputting codes. This is, in fact, the standard LZW way of doing things, and it will yield the best compression ratio. But there's no rule saying you can't stop anywhere along the line and just output the code for the current prefix, whether it's already in the table or not, and add that string plus the next character to the string table. There are various reasons for wanting to do this, especially if the strings get extremely long and make hashing difficult. If you need to, do it.
Hope this helps out. ----steve blackstock
Google matched content |
Data Compression Debra A. Lelewer and Daniel S. Hirschberg
Sometimes you can find old O'Reilly graphic formats book on the web too:
Encyclopedia of Graphics File Formats, 2nd Edition
The Complete Reference on CD-ROM with Links to Internet ResourcesBy James D. Murray, William vanRyper
2nd Edition May 1996
It contains a chapter about graphic algorithms, see TOC
AnandTech - All About Compression, Part I
Queue, Huffman, Compression and Encryption Applets in C++
Forums - Efficient Huffman implementation
...it is certainly possible to construct the Huffman code without an explicit tree. The main observation is that there is a canonical Huffman code entirely defined by the code lengths of the letters. And, there is a simple data structure to compute the latter in linear time after a sorting phase without explicitly constructing the tree. This does save half the memory required for an explicit tree. For example, see my Huffman implementation in the Vcodex package at www.research.att.com/sw/tools.
LZW(Lempel-Ziv-Welch) is the most common algorithm used in computer graphics. This lossless method of data compression is found in several image file formats, such as GIF and TIFF, and is also part of the V.42bis modem compression standard and PostScript Level 2.
In 1977, Abraham Lempel and Jakob Ziv created the first of what we now call the LZ family of substitutional compressors. The LZ77 compression algorithms are commonly found in text compression and archiving programs, such as compress, zoo, lha, pkzip, and arj. The LZ78 compression algorithms are more commonly used to compress binary data, such as bitmaps.
In 1984, while working for Unisys, Terry Welch modified the LZ78 compressor for implementation in high-performance disk controllers. The result was the LZW algorithm that is commonly found today. It is covered by U.S. Patent 4,558,302 (plus its foreign counterparts, issued or pending). All patents are held by Unisys Corporation. That means that without obtaining a license from Unisys you formally cannot read or write a GIF file :-(. See TF_WP_unisys for more information.
>From: [email protected] (John Barkaus) Newsgroups: comp.graphics Subject: GIF file format responses 5/5 Date: 21 Apr 89 20:58:01 GMT Organization: The Cooper Union (NY, NY)
LZW and GIF explained----Steve Blackstock
I hope this little document will help enlighten those of you out there who want to know more about the Lempel-Ziv Welch compression algorithm, and, specifically, the implementation that GIF uses.
Before we start, here's a little terminology, for the purposes of this document:
That should be enough of that.
LZW is a way of compressing data that takes advantage of repetition of strings in the data. Since raster data usually contains a lot of this repetition, LZW is a good way of compressing and decompressing it.
For the moment, lets consider normal LZW encoding and decoding. GIF's variation on the concept is just an extension from there.
LZW manipulates three objects in both compression and decompression: the charstream, the codestream, and the string table. In compression, the charstream is the input and the codestream is the output. In decompression, the codestream is the input and the charstream is the output. The string table is a product of both compression and decompression, but is never passed from one to the other.
The first thing we do in LZW compression is initialize our string table. To do this, we need to choose a code size (how many bits) and know how many values our characters can possibly take. Let's say our code size is 12 bits, meaning we can store 0->FFF, or 4096 entries in our string table. Lets also say that we have 32 possible different characters. (This corresponds to, say, a picture in which there are 32 different colors possible for each pixel.) To initialize the table, we set code#0 to character#0, code #1 to character#1, and so on, until code#31 to character#31. Actually, we are specifying that each code from 0 to 31 maps to a root. There will be no more entries in the table that have this property.
Now we start compressing data. Let's first define something called the "current prefix". It's just a prefix that we'll store things in and compare things to now and then. I will refer to it as "[.c.]". Initially, the current prefix has nothing in it. Let's also define a "current string", which will be the current prefix plus the next character in the charstream. I will refer to the current string as "[.c.]K", where K is some character. OK, look at the first character in the charstream. Call it P. Make [.c.]P the current string. (At this point, of course, it's just the root P.) Now search through the string table to see if [.c.]P appears in it. Of course, it does now, because our string table is initialized to have all roots. So we don't do anything. Now make [.c.]P the current prefix. Look at the next character in the charstream. Call it Q. Add it to the current prefix to form [.c.]Q, the current string. Now search through the string table to see if [.c.]Q appears in it. In this case, of course, it doesn't. Aha! Now we get to do something. Add [.c.]Q (which is PQ in this case) to the string table for code#32, and output the code for [.c.] to the codestream. Now start over again with the current prefix being just the root Q. Keep adding characters to [.c.] to form [.c.]K, until you can't find [.c.]K in the string table. Then output the code for [.c.] and add [.c.]K to the string table. In pseudo-code, the algorithm goes something like this:
It's as simple as that! Of course, when you get to step [3] and there aren't any more characters left, you just output the code for [.c.] and throw the table away. You're done.
Wanna do an example? Let's pretend we have a four-character alphabet: A,B,C,D. The charstream looks like ABACABA. Let's compress it. First, we initialize our string table to: #0=A, #1=B, #2=C, #3=D. The first character is A, which is in the string table, so [.c.] becomes A. Next we get AB, which is not in the table, so we output code #0 (for [.c.]), and add AB to the string table as code #4. [.c.] becomes B. Next we get [.c.]A = BA, which is not in the string table, so output code #1, and add BA to the string table as code #5. [.c.] becomes A. Next we get AC, which is not in the string table. Output code #0, and add AC to the string table as code #6. Now [.c.] becomes C. Next we get [.c.]A = CA, which is not in the table. Output #2 for C, and add CA to table as code#7. Now [.c.] becomes A. Next we get AB, which IS in the string table, so [.c.] gets AB, and we look at ABA, which is not in the string table, so output the code for AB, which is #4, and add ABA to the string table as code #8. [.c.] becomes A. We can't get any more characters, so we just output #0 for the code for A, and we're done. So, the codestream is #0#1#0#2#4#0.
A few words (four) should be said here about efficiency: use a hashing strategy. The search through the string table can be computationally intensive, and some hashing is well worth the effort. Also, note that "straight LZW" compression runs the risk of overflowing the string table - getting to a code which can't be represented in the number of bits you've set aside for codes. There are several ways of dealing with this problem, and GIF implements a very clever one, but we'll get to that.
An important thing to notice is that, at any point during the compression, if [...]K is in the string table, [...] is there also. This fact suggests an efficient method for storing strings in the table. Rather than store the entire string of K's in the table, realize that any string can be expressed as a prefix plus a character: [...]K. If we're about to store [...]K in the table, we know that [...] is already there, so we can just store the code for [...] plus the final character K.
Ok, that takes care of compression. Decompression is perhaps more difficult conceptually, but it is really easier to program.
Here's how it goes: We again have to start with an initialized string table. This table comes from what knowledge we have about the charstream that we will eventually get, like what possible values the characters can take. In GIF files, this information is in the header as the number of possible pixel values. The beauty of LZW, though, is that this is all we need to know. We will build the rest of the string table as we decompress the codestream. The compression is done in such a way that we will never encounter a code in the codestream that we can't translate into a string.
We need to define something called a "current code", which I will refer to as "<code>", and an "old-code", which I will refer to as "<old>". To start things off, look at the first code. This is now <code>. This code will be in the intialized string table as the code for a root. Output the root to the charstream. Make this code the old-code <old>. *Now look at the next code, and make it <code>. It is possible that this code will not be in the string table, but let's assume for now that it is. Output the string corresponding to <code> to the codestream. Now find the first character in the string you just translated. Call this K. Add this to the prefix [...] generated by <old> to form a new string [...]K. Add this string [...]K to the string table, and set the old-code <old> to the current code <code>. Repeat from where I typed the asterisk, and you're all set. Read this paragraph again if you just skimmed it!!! Now let's consider the possibility that <code> is not in the string table. Think back to compression, and try to understand what happens when you have a string like P[...]P[...]PQ appear in the charstream. Suppose P[...] is already in the string table, but P[...]P is not. The compressor will parse out P[...], and find that P[...]P is not in the string table. It will output the code for P[...], and add P[...]P to the string table. Then it will get up to P[...]P for the next string, and find that P[...]P is in the table, as the code just added. So it will output the code for P[...]P if it finds that P[...]PQ is not in the table. The decompressor is always "one step behind" the compressor. When the decompressor sees the code for P[...]P, it will not have added that code to it's string table yet because it needed the beginning character of P[...]P to add to the string for the last code, P[...], to form the code for P[...]P. However, when a decompressor finds a code that it doesn't know yet, it will always be the very next one to be added to the string table. So it can guess at what the string for the code should be, and, in fact, it will always be correct. If I am a decompressor, and I see code#124, and yet my string table has entries only up to code#123, I can figure out what code#124 must be, add it to my string table, and output the string. If code#123 generated the string, which I will refer to here as a prefix, [...], then code#124, in this special case, will be [...] plus the first character of [...]. So just add the first character of [...] to the end of itself. Not too bad. As an example (and a very common one) of this special case, let's assume we have a raster image in which the first three pixels have the same color value. That is, my charstream looks like: QQQ.... For the sake of argument, let's say we have 32 colors, and Q is the color#12. The compressor will generate the code sequence 12,32,.... (if you don't know why, take a minute to understand it.) Remember that #32 is not in the initial table, which goes from #0 to #31. The decompressor will see #12 and translate it just fine as color Q. Then it will see #32 and not yet know what that means. But if it thinks about it long enough, it can figure out that QQ should be entry#32 in the table and QQ should be the next string output. So the decompression pseudo-code goes something like:
Again, when you get to step [5] and there are no more codes, you're finished. Outputting of strings, and finding of initial characters in strings are efficiency problems all to themselves, but I'm not going to suggest ways to do them here. Half the fun of programming is figuring these things out!
Now for the GIF variations on the theme. In part of the header of a GIF file, there is a field, in the Raster Data stream, called "code size". This is a very misleading name for the field, but we have to live with it. What it is really is the "root size". The actual size, in bits, of the compression codes actually changes during compression/decompression, and I will refer to that size here as the "compression size". The initial table is just the codes for all the roots, as usual, but two special codes are added on top of those. Suppose you have a "code size", which is usually the number of bits per pixel in the image, of N. If the number of bits/pixel is one, then N must be 2: the roots take up slots #0 and #1 in the initial table, and the two special codes will take up slots #4 and #5. In any other case, N is the number of bits per pixel, and the roots take up slots #0 through #(2**N-1), and the special codes are (2**N) and (2**N + 1). The initial compression size will be N+1 bits per code. If you're encoding, you output the codes (N+1) bits at a time to start with, and if you're decoding, you grab (N+1) bits from the codestream at a time. As for the special codes: <CC> or the clear code, is (2**N), and <EOI>, or end-of-information, is (2**N + 1). <CC> tells the compressor to re-initialize the string table, and to reset the compression size to (N+1). <EOI> means there's no more in the codestream. If you're encoding or decoding, you should start adding things to the string table at <CC> + 2. If you're encoding, you should output <CC> as the very first code, and then whenever after that you reach code #4095 (hex FFF), because GIF does not allow compression sizes to be greater than 12 bits. If you're decoding, you should reinitialize your string table when you observe <CC>. The variable compression sizes are really no big deal. If you're encoding, you start with a compression size of (N+1) bits, and, whenever you output the code (2**(compression size)-1), you bump the compression size up one bit. So the next code you output will be one bit longer. Remember that the largest compression size is 12 bits, corresponding to a code of 4095. If you get that far, you must output <CC> as the next code, and start over. If you're decoding, you must increase your compression size AS SOON AS YOU write entry #(2**(compression size) - 1) to the string table. The next code you READ will be one bit longer. Don't make the mistake of waiting until you need to add the code (2**compression size) to the table. You'll have already missed a bit from the last code. The packaging of codes into a bitsream for the raster data is a potential stumbling block for the novice encoder or decoder. The lowest order bit in the code should coincide with the lowest available bit in the first available byte in the codestream. For example, if you're starting with 5-bit compression codes, and your first three codes are, say, <abcde>, <fghij>, <klmno>, where e, j, and o are bit#0, then your codestream will start off like:
byte#0: hijabcde byte#1: .klmnofg
So the differences between straight LZW and GIF LZW are: two additional special codes and variable compression sizes. If you understand LZW, and you understand those variations, you understand it all!
Just as sort of a P.S., you may have noticed that a compressor has a little bit of flexibility at compression time. I specified a "greedy" approach to the compression, grabbing as many characters as possible before outputting codes. This is, in fact, the standard LZW way of doing things, and it will yield the best compression ratio. But there's no rule saying you can't stop anywhere along the line and just output the code for the current prefix, whether it's already in the table or not, and add that string plus the next character to the string table. There are various reasons for wanting to do this, especially if the strings get extremely long and make hashing difficult. If you need to, do it.
Hope this helps out. ----steve blackstock
I recommend first check DataCompression.info - Tutorials, Reference, Presentations
The Mandala Centre - Compression and Security - One on one compression FAQ
One to One Compression -- This site discusses a characteristic of some compression algorithms that the author refers to as One to One compression. In a nutshell, this property means that for any file X, F( F'( X ) ) == X. (F is either the compressor or decompressor, and F' is its opposite number.) This is definitely not the case for most conventional compression algorithms.
Bijective Arithmetic Encoding with Really Good End Treatment
DataCompression.info - Adaptive Huffman
Coding
... Adaptive Huffman Encoding, Rate, A library to perform adpative ... an implementation of
Vitter's dynamic Huffman compressor, adapted so that it is bijective. ...
- Milwaukee Journal Sentinel report about Phil Katz at the Wayback Machine (archived December 21, 2007)
- Death notice for Phillip Katz at the Wayback Machine (archived September 11, 2007) Milwaukee Journal Sentinel.
- "Compression", a video documentary about the arc vs zip controversy (WMV format) from BBS: The Documentary
- Katz bio from BBS: The Documentary
- Hanging out with Phil Katz
- The 411 on ZIP files: How a standard was born
Gifs and patents
The patent for LZW compression algorithm is held by Unisys since 1985. Compuserve Gif uses this algorithm. In 1994, Unisys implemented their patent rights by asking for license fees from developers of products that read or write gifs. This announcement took developers by surprise and caused quite a stir on the Internet. There were rumours that Unisys might charge web developers for usage of gifs on their sites. At the same time, it was argued that Gifs are the products of LZW compression and the patent does not extend to the usage of the end product. Actually, web developers had nothing to fear since Unisys planned to collect license fees only from software companies whose products employ LZW algorithm.
Web developers should not be worried. They are free to use gifs on their web sites. However, if you've developed a software that creates or modifies gifs, you would have to pay licensing fees to Unisys.
The business acumen of the people at Unisys has to be admired. It seems that they had waited for Gifs to become popular and beneficial (from a web developers' point of view) before implementing the patent rights. However, there was an interesting and fortunate (?) side effect of this story. It lead to the development of the PNG file format. PNG is a much better and more versatile image format than both JPG and GIF. It has all the bells and whistles of these two file formats.
At the time of writing the browser support for PNG is still quite patchy, though the future does look bright.
LZW compression: conjugations patented
Google Groups View Thread LZW Patent Expiry
Can the LZW algorithm be used freely in all countries after the US
patent expiry in June 2003? I have gathered this information from
newsgroups:1. The main US patent expires the 20 June 2003. The valid time can not
be extended, if they do not make adjustments/extensions to the patent.
US patent:
http://l2.espacenet.com/espacenet/viewer?PN=US4558302&CY=ep&LG=en&DB=EPD2. There are other US patents covering this field, but these are filed
in 1999, and are therefore not enforceable. Are there other US
patents?3. Unisys claim that they have patents in Canada, France, Germany,
U.K. and Italy, but they have never sued any company in European
court though they have had chances. There are some references in the
US patent: CA1223965, DE3476617D, EP0129439. Are these patents
enforceable? Are there other patents?4. Patents are pending in Japan. The US patent refers to these:
JP2610084B2, JP5068893B, JP60116228, JP7249996, JP8237138. Do we have
to worry about these?
Can anyone confirm the information above and answer the questions?
Does this mean that we can use the lzw algorithm freely in all
countries after June 2003?
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: June, 07, 2018