Huffman coding
From Academic Kids

In computer science, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variablelength code table for encoding a source symbol (such as a character in a file) where the variablelength code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It was developed by David A. Huffman as a Ph.D. student at MIT in 1952, and published in A Method for the Construction of MinimumRedundancy Codes.
Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefixfree code (that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol) that expresses the most common characters using shorter strings of bits than are used for less common source symbols. Huffman was able to design the most efficient compression method of this type: no other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code. (Huffman coding is such a widespread method for creating prefixfree codes that the term "Huffman code" is widely used as a synonym for "prefixfree code" even when such a code was not produced by Huffman's algorithm.)
For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding.
Assertions of the optimality of Huffman coding should be phrased carefully, because its optimality can sometimes accidentally be overstated. For example, arithmetic coding ordinarily has better compression capability, because it does not require the use of an integer number of bits for encoding each source symbol. LZW coding can also often be more efficient, particularly when the input symbols are not independentlydistributed, because it does not depend on encoding each input symbol one at a time (instead, it batches up a variable number of input symbols into each encoded syntax element). The efficiency of Huffman coding also depends heavily on having a good estimate of the true probability of the value of each input symbol.
Contents 
History
In 1951, David Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequencysorted binary tree, and quickly proved this method the most efficient.
In doing so, the student outdid his professor, who had worked with information theory inventor Claude Shannon to develop a similar code. Huffman avoided the major flaw of ShannonFano coding by building the tree from the bottom up instead of from the top down.
Problem definition
Informal description
Given. A set of symbols and their costs.
Find. A prefix free binary character code (a sets of codewords) with minimum weighted path length.
Note1. A code wherein each character is represented by a unique binary string (codeword) is called a binary character code.
Note2. A prefix free code is a code having the property that no codeword is a prefix of any other codeword.
Note3. In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is N digits will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing.
Formalized description
Input.
Alphabet A = {a[1], a[2], ..., a[n]} which is the symbol alphabet of size n.
Set C = {c[1], c[2], ..., c[n]} which is the set of the symbol costs, i.e., c[i] = cost (a[i]), 1 <= i <= n.
Output.
Code H(A,C) = {h[1], h[2], ..., h[n]} which is the set of (binary) codewords, where h[i] is codeword of a[i], 1 <= i <= n.
Goal.
Let S(H) = sum (c[i] * length (h[i])) (1 <= i <= n) be weighted path length of code H. Must be: S(H) <= S(T) for any code T(A,C).
Samples
Sample1
Input  Alphabet  a  b  c  d  e  f  g  h  i 
Costs  10  15  5  15  20  5  15  30  5  
Output  Codewords  000  010  0010  011  111  00110  110  10  00111 
Weighted path length  10*3  15*3  5*4  15*3  20*3  5*5  15*3  30*2  5*5  = 355 
Basic technique
The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols(N). A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has N leaf nodes and N−1 internal nodes.
A fast way to create a Huffman tree is to use the heap data structure, which keeps the nodes in partially sorted order according to a predetermined criterion. In this case, the node with the lowest weight is always kept at the root of the heap for easy access.
Creating the tree:
 Start with as many leaves as there are symbols.
 Push all leaf nodes into the heap.
 While there is more than one node in the heap:
 Remove two nodes with the lowest weight from the heap.
 Put the two nodes into the tree, noting their location.
 If parent links are used, set the children of any internal nodes to point at their parents.
 Create a new internal node, using the two nodes as children and their combined weight as the weight.
 Push the new node into the heap.
 The remaining node is the root node.
Note: it may be beneficial to generate codes with minimum length variance, since in the worst case when several long codewords have to be transmitted in a row it may lead to unwanted side effects (for example if we use a buffer and constant rate transmitter, it increases the minimum buffer size). To reduce variance every newly generated node must be favored among same weight nodes and placed as high as possible. This will balance the length, keeping same average rate.
Main properties
The frequencies used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed. (This variation requires that a frequency table or other hint as to the encoding must be stored with the compressed text; implementations employ various tricks to store these tables efficiently.)
Huffman coding is optimal when the probability of each input symbol is a negative power of two. Prefixfree codes tend to have slight inefficiency on small alphabets, where probabilities often fall between these optimal points. Expanding the alphabet size by coalescing multiple symbols into "words" before Huffman coding can help a bit. The worst case for Huffman coding can happen when the probability of a symbol exceeds 2^{1} making the upper limit of inefficiency unbounded. To prevent this, runlength encoding can be used to preprocess the symbols.
Extreme cases of Huffman codes are connected with Fibonacci and Lucas numbers and Wythoff array (http://mathworld.wolfram.com/WythoffArray.html).
Arithmetic coding produces slight gains over Huffman coding, but in practice these gains have seldom been large enough to offset arithmetic coding's higher computational complexity and patent royalties. (As of November 2001, IBM owns patents on the core concepts of arithmetic coding in several jurisdictions.)
Variations
Adaptive Huffman coding
A variation called adaptive Huffman coding calculates the frequencies dynamically based on recent actual frequencies in the source string. This is somewhat related to the LZ family of algorithms.
Lengthlimited Huffman coding
Lengthlimited Huffman coding is a variant where the goal is still to achieve a minimum weighted path length, but there is an additional restriction that the length of each codeword must be less than a given constant.
Huffman template algorithm
Most often, the weights used in implementations of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only a way to order weights and to add them. The Huffman template algorithm enables one to use nonnumerical weights (costs, frequencies).
nary Huffman coding
The nary Huffman algorithm uses the {0, 1, ..., n − 1} alphabet to encode message and build an nary tree.
Huffman coding with unequal letter costs
In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is N digits will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing.
Huffman coding with unequal letter costs is the generalization in which this assumption is no longer assumed true: the letters of the encoding alphabet may have nonuniform lengths, due to characteristics of the transmission medium. An example is the encoding alphabet of Morse code, where a 'dash' takes longer to send than a 'dot', and therefore the cost of a dash in transmission time is higher. The goal is still to minimize the weighted average codeword length, but it is no longer sufficient just to minimize the number of symbols used by the message.
Applications
Arithmetic coding can be viewed as a generalization of Huffman coding. Although arithmetic coding offers better compression performance than Huffman coding, Huffman coding is still in wide use because of its simplicity, high speed and lack of encumbrance by patents.
Huffman coding today is often used as a "backend" to some other compression method. DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a frontend model and quantization followed by Huffman coding.
External links
 Huffman's original article: D.A. Huffman, "A method for the construction of minimumredundancy codes (http://compression.graphicon.ru/download/articles/huff/huffman_1952_minimumredundancycodes.pdf)", Proceedings of the I.R.E., sept 1952, pp 10981102
 Background story: Profile: David A. Huffman (http://www.huffmancoding.com/david/scientific.html), Scientific American, Sept. 1991, pp. 5458
 nary Huffman Template Algorithm (http://alexvn.freeservers.com/s1/huffman_template_algorithm.html)
 Huffman codes' connection with Fibonacci and Lucas numbers (http://mathforum.org/discuss/sci.math/t/632220)
 Fibonacci connection between Huffman codes and Wythoff array (http://groups.google.com/groups?selm=2ss5aqF1oc42cU1%40uniberlin.de)
 Sloane A098950 (http://www.research.att.com/projects/OEIS?Anum=A098950) Minimizing kordered sequences of maximum height Huffman tree
 Computing Huffman codes on a Turing Machine (http://semillon.wpi.edu/~aofa/AofA/msg00040.html)
 Mordecai J. Golin, Claire Kenyon, Neal E. Young "Huffman coding with unequal letter costs (http://www.cs.ust.hk/faculty/golin/pubs/LOP_PTAS_STOC.pdf)", STOC 2002 (http://www.informatik.unitrier.de/~ley/db/conf/stoc/stoc2002.html): 785791
 Huffman Coding, implemented in python (http://gumuz.looze.net/wordpress/index.php/archives/2004/11/25/huffmanencoding/)de:ShannonFanoKodierung
fr:Codage de Huffman ko:허프만 코딩 it:Codifica di Huffman nl:Huffmancodering ja:ハフマン符号 pl:Kodowanie Huffmana fi:Huffmanin koodaus th:รหัสฮัฟแมน และ รหัสแชนนอนฟาโน zh:哈夫曼树