About site: Algorithms/Compression - Practical Huffman Coding
Return to Computers
  About site: http://www.compressconsult.com/huffman/

Title: Algorithms/Compression - Practical Huffman Coding A tutorial by Michael Schindler.
Sparse_Matrix_Compression_Schemes Pseudocode for a spare matrix compression algorithm.

SPIHT The Set Partitioning In Hierarchical Trees algorithm for image and video compression. Documentation and resources.

Tutorial_on_Image_Compression A tutorial on Image Compression, with step by step procedural examples and list of references, along with sample code.

Vcodex Video and image coding tutorials, links, resources and research.

Alternate_Perspective_Online Specialize in computer animation and digital special effects for film and broadcast television. Network render farm services and web authoring.

Image_Refinery_Render_Farm_Services A digital design house, they also offer a render farm supporting Lightwave. How to contact them for more information.


  Alexa statistic for http://www.compressconsult.com/huffman/





Get your Google PageRank






Please visit: http://www.compressconsult.com/huffman/


  Related sites for http://www.compressconsult.com/huffman/
    MTP_Grafx_Render_Farm Supporting 3D Studio Max and Lightwave. Their capabilities and other information. Flat rate charged per frame for broadcast quality D1 renderings.
    RenderNOW!_-_Render_farm Computer animation render facilities, custom computer animation creation and video production.
    ResPower Offers 24/7 self-service Internet 3D rendering on 400+ computers (1,500 GHz) for Lightwave, Maya, 3d Studio Max, Brazil, mentalRay and final Render.
    SAEC_/_Kinetic_Vision_-_Design_Visualization Offering 3D Studio MAX network render and videotape transfer services. Please contact for more information.
    American_Design_Drafting_Association The ADDA is an individual membership society for the design drafting community across all industries.
    CAD_Society A not-for-profit organization which has the goal of encouraging open communication among those in the CAD industry.
    Enter An open source Enterprise 128 emulator for Windows.
    Enterprise_64/128_Depository Enterprise 64/128 depository and Emulator Project.
    EPTE Enterprise Tape Emulator. EPTE is a free virtual tape recorder for the Enterprise 128k home computer.
    Zilog_Realms EP32 an open source 128 emulator for Windows based on Kevin Thackers "enter" emulator, and a small collection of demos and programs.
    Bioeddie\'s_Psion_emulators Various emulators for Psion.
    pocketemulator_com Emulators for Pocket PC.
    Adobe_Systems Adobe Type Library offers a collection of quality fonts from internationally renowned foundries, as well as individual type designers and distinguished design studios.
    Agfa_Monotype_Corporation Purchase libraries of typefaces from designers, and independent foundries. Also offers custom font design and font technology development.
    Altemus_Collection Thousands of decorative and dingbat designs, fonts for Mac and PC. Over 50 unique fonts; stars, bursts, borders, sports, and spirals printers cuts.
    Berthold_Font_Store A source of high-quality fonts for design professionals and other font users available for immediate purchase and download.
    BuyFonts Vendor of TrueType and PostScript fonts, including Braille, OCR, bar codes, phonetic, mathematical, and symbolic fonts.
    Chank The Chank Company manufactures and distributes a large selection of freeware and commercial fonts by type designers from all over the world.
    Download-fonts_com Download mac fonts and windows fonts. Browse and download truetype fonts and postscript fonts.
    Educational_Fontware,_Inc Handwriting fonts for teachers and homeschoolers. Very comprehensive list, many variations (dotted, outlines, arrows, rules) of each family (D'Nealian, Zaner-Bloser, Getty Dubay).
    Faces Offers fonts from a number of foundries, royalty free photographs, picture fonts and clip art for download or on CD.
    Font_Technology_-_Bitstream Bitstream is the leading developer of font technology, digital fonts, and custom typeface designs for a wide variety of markets.
    Font_Works_Digital_Retail UK's principal font supplier representing over 100 foundries worldwide,
    Fonts4eva Free and for-sale fonts and software which you can purchase to make your own fonts.
    FontShop Offers font search by name, designer or foundry. Also provides custom creation and conversions, Font Feed, blog and magazine.
    GraphicObsession Royalty free images, footage and fonts. Search, compare, test, purchase and download thousands of images and fonts.
    Graphx_Edge Windows TrueType downloadable font collections includes script fonts, headline, unique and dingbat fonts.
    ITC_Fonts The site of the International Typeface Corporation. A comprehensive typeface library,up-to-date and accurate information on new fonts, and an interactive online typesetter to view fonts with. A lot of
    Kanecal Offers symbols used in drafting/blueprint callouts. Geometric Dimensioning and Tolerancing (GD&T) symbols (with Feature Control Frames) plus Statistical Process Control (SPC) symbols. Barcodes for
    MICR_Encoding_Fonts__com Distributors of MICR E-13B and CMC-7 fonts for bank check printing. Fonts are available as PostScript, TrueType and PCL for Windows, Macintosh and UNIX.
    MyFonts E-commerce offering a variety of choices, as well as an area to upload images. Includes information about designers and foundries.
    Nextag_com-_Fonts_Software Several font collections on CDs
    Phil\'s_Fonts Source for type and graphic software. Distributor of Postscript and TrueType fonts, from more than 75 foundries around the world also specializes in creating custom fonts.
    School_Fonts Font sets compatible with the D'Nealian, and Zaner-Bloser methods of learning handwriting. Offers, cursive handwriting, Block letters, and kindergarten dingbats. Exercise sheets, free samples and comp
    Terrapin_Solutions_Limited UK based company supplies typefaces from all of the major font manufacturers as well as a range of more modern and experimental designs from around the world. Language fonts, bar code and OCR fonts ar
    Typos_Type_&_Image_Design Provides fonts, typographic and corporate identity services, updates and licensing.
    ACE_32 The Jupiter ACE Emulator for MSDOS
    Spectravideo_software_tools Support site for the SVI-318/328 emulator by Jimmy MÃ¥rdell. Features a cas to wave and wave to cas converter, a frontend, and a manager for the cassette files.
    SVI-318/328_emulator An open source SVI-318/328 emulator for DOS.
    FLEX-ES System/390 on Intel-based servers.
This is now2007.com cache of m/ as retrieved on 2008.12.03 now2007.com's cache is the snapshot that we took of the page as we crawled the web. The page may have changed since that time.
Practical Huffman Coding

Practical Huffman coding

by Michael Schindler of >data Compression Consulting


Welcome to Compression Consulting's
huffman coding hints.  This page assumes that you are familiar with
huffman coding.  It will focus on practical issues you need to know
for writing a fast and reasonable memory efficient huffman coder.
It will not cover the essentials, the history, proof of optimality
(within the constraints) and other things you can find in textbooks.
More information about huffman coding can be found in 
The Data Compression Library.



This page is now mentioned in the
comp.compression
FAQ (part2) and on the compression
pointers page as well as in The Data Compression Library.
For a nice animated source code see the University of Western Australia
algorithms course


This is basically a cookbook recipe collection which fits most peoples
needs.  If you are not sure whether this fits your needs please contact me
at michael@compressconsult.com .
There is intentionally no source code as Huffman coding is a popular student exercise.

Table of content

Example Distribution and Tree Notation Used in this Text
Huffman Codes
Canonical Huffman Codes
Code Construction
Maximum Length of a Huffman Code
Calculating Codelengths for a Distribution
Encoding
Decoding


Example Distribution and Tree Notation Used in this Text


Throughout the text I will use the following probabilities for the
symbols A-H:

A: 3/28 B: 1/28 C: 2/28 D: 5/28
E: 5/28 F: 1/28 G: 1/28 H: 10/28



I will write trees in a bracket syntax, each bracket pair encloses
a subtree; subtrees are written left to right.
Example: ((a,b),c) denotes a binary tree where the left child of
the root is a node with a as left child, b as right child. c is the right
child of the root.  If the reader is not familiar with that I suggest
using paper and pencil to draw the tree.


Conventions -
Huffman Codes -
Canonical Huffman Codes -
Code Construction -
Maximum Length -
Calculating Codelengths -
Encoding -
Decoding


Huffman Codes


Textbooks usually will not tell you, but typically there is more than one
huffman tree for a distribution, so there is more than one huffman
code. These codes may even differ in length.
The following are huffman trees for the example distribution:
((((B,F),A),E),(((G,C),D),H))  Height 4
((D,((F,C),(B,G))),(H,(E,A)))  Height 4
((((C,((F,G),B)),A),(E,D)),H)  Height 6


Even if the codelengths are fixed (these lengths correspond with the
first tree) there is more than one code assignment:

   treecode code2 code3 canonical
A  001      111   100   010
B  0000     1101  0011  0000
C  1001     0010  1011  0001
D  101      000   000   011
E  01       01    11    10
F  0001     1100  1010  0010
G  1000     0011  0010  0011
H  11       10    01    11



The first code was derived directly from the tree; code2, code3 and
the code labeled canonical are some other prefix codes with the same
length.


Conventions -
Huffman Codes -
Canonical Huffman Codes -
Code Construction -
Maximum Length -
Calculating Codelengths -
Encoding -
Decoding


Canonical Huffman Codes

The name canonical neither comes from the copy company nor from
church; here and in maths canonical denotes one of many alternatives
that can be distinguished since it follows simple rules.  The rules
used here were: 
 Shorter codes have a numerically (if filled with zeros to the right)
higher value than longer codes.
 within the same length, numerical values increase with alphabet.

It should also be mentioned that the codelengths are the same
as with huffman codes since these are canonical huffman codes. So there
is no loss in compression when using these codes.

Advantages

There are some advantages of using these (or similar) rules and
produce a canonical huffman code:
 The first rule guarantees that no more than the
  ceil(log2(alphabetsize)) rightmost bits of the code can differ
  from zero - see below.
 The first rule also allows an efficient decoding - see below.
 Both rules together allow a complete reconstruction of the code
  knowing only the codelengths for each symbol.



Conventions -
Huffman Codes -
Canonical Huffman Codes -
Code Construction -
Maximum Length -
Calculating Codelengths -
Encoding -
Decoding


Code Construction

I assume you know the codelength for each symbol and how often
each length occurs. A method to do this will be given later.
To assign codes you need only a single pass over the symbols,
but before doing that you need to calculate where the codes for
each codelength start. To do so consider the following:
The longest code is all zeros and each code differs from the previous
by 1 (I store them such that the last bit of the code
is in the least significant bit of a byte/word).


In the example this means: 
codes with length 4 start at 0000
codes with length three start at (0000+4*1)>>1 = 010. There are 4 codes
 with length 4 (that is where the 4 comes from), so the next length 4
 code would start at 0100. But since it shall be a length 3 code we remove
 the last 0 (if we ever remove a 1 there is a bug in the codelengths!).
codes with length 2 start at (010+2*1)>>1 = 10.
codes with length 1 start at (10+2*1)>>1 = 10.
codes with length 0 start at (10+0*1)>>1 = 1. If anything else than
 1 is start for the codelength 0 there is a bug in the codelengths!



Then visit each symbol in alphabetical sequence (to ensure the second
condition) and assign the startvalue for the codelength of that symbol
as code to that symbol. After that increment the startvalue for that
codelength by 1. That's it.


This construction also ensures the claimed property, namely that only
the ceil(log2(alphabetsize)) rightmost bits can be nonzero.
Proof: The following is valid for all symbols: The code has been
incremented by one for each symbol with a larger or equal codelength.
There can be at most alphabetsize-1 such symbols, so it has been
incremented at most alphabetsize-1 times. This maximum number fulfills
the claimed property.


Conventions -
Huffman Codes -
Canonical Huffman Codes -
Code Construction -
Maximum Length -
Calculating Codelengths -
Encoding -
Decoding


Maximum Length of a Huffman Code

Apart from the ceil(log2(alphabetsize)) boundary for the nonzero bits
in this particular canonical huffman code it is useful to know the
maximum length a huffman code can reach. In fact there are two limits
which must both be fulfilled.


No huffman code can be longer than alphabetsize-1.  Proof: it is
impossible to construct a binary tree with N nodes and more than N-1
levels.


The maximum length of the code also depends on the number of samples you
use to derive your statistics from; the sequence is as follows (the
samples include the fake samples to give each symbol a nonzero
probability!):

 #samples    maximum codelength
 #samples    maximum codelength
   1....2     1
  21...33     6
   3....4     2
  34...54     7
   5....7     3
  55...88     8
   8...12     4
  89..143     9
  13...20     5
 144..232    10

the width of each range is the lower end of the previous range,
so the next would be: 233 to 233+144-1=376 samples give a maximum
codelength of 11.


An example for a tree with depth 6 and 21 samples (count for
each symbol given) would be ((((((1,1),1),2),3),5),8).
(oh no - Fibonacci numbers again :)


Conventions -
Huffman Codes -
Canonical Huffman Codes -
Code Construction -
Maximum Length -
Calculating Codelengths -
Encoding -
Decoding


Calculating Codelengths for a Distribution

There are several methods to calculate codelengths efficiently.
Either do as described below or look at

http://www.cs.mu.oz.au/~alistair/abstracts/inplace.html to give
you just a few ideas.


Textbooks usually describe huffman tree construction similar to the
following: 
 Setup: make a heap containing the symbols, lowest probable
 symbol on top.
 Loop: take the 2 least probable nodes out of the heap, form a
 new node having the two nodes used to form it as children. Insert
 the new node back into the heap. Repeat until only one node is left
 (or alphabet-1 times; this is the same.)
 that node is the root.



Practical purposes often demand a separation of intermediate and leaf
nodes during that process. If you do that store the leaf nodes in an
array of size alphabetsize-1 and fill it from left to right.
Since intermediate nodes are constructed in the sequence they are used
you just need two pointers; one pointing to the next unfilled place and
one pointing to the next unused intermediate node. You don't have to do
the heap sink down that often this way; you just compare the top of the
heap containing the symbols with the unused intermediate node.
If you like you could also sort the symbols by probability first and
then use them in the sequence of increasing probability. The result is
the same; if you sort first you might use quicksort, if you keep the heap
idea you (implicitly) use heapsort to sort the symbols.


If you have sorted probabilities you don't need the sorting step and
complexity for code generation will drop from O(n log(n)) to O(n).


If you are short of memory organize the data as follows: During the loop
phase store which intermediate node is the parent only; you may use
the space to store the huffman code lengths later in the leaves.
You also need to provide the space for alphabetsize-1 intermediate nodes;
simply use the place in the leaves to store the huffman code endings
later. So you don't need any extra space that depends on the
alphabetsize, you can all do it in the space you need to store the codes
later.


After that treegeneration phase set the depth of the last intermediate
node (root) to 0. Then loop over the intermediate nodes from the next
to last created to the first created, replacing the parent index
with 1 + the value you find at the parent index; this is the depth of
that node.


Proceed with the leaves; fill the length field with 1 + the value at the
parent instead of the parent index; this is the depth (codelength) for
that leaf. Count how often each length occurs during that loop.
If you process the leaves in sequence of increasing or decreasing
probability you can reuse the space of the intermediate nodes for the
counters provided you have an extra space of one word (for the
first/currently processed codelength counter). This is an additional
benefit from doing the probability sort first and not using a heap
for the symbols. Warning: zero each counter before
incrementing the first time and not before starting this loop, the
treedepths that occupy the same place are used in the loop!


Conventions -
Huffman Codes -
Canonical Huffman Codes -
Code Construction -
Maximum Length -
Calculating Codelengths -
Encoding -
Decoding


Encoding

In practice the log2(alphabetsize) for the nonzero bits in this
canonical code is the one that is important for memory layout.
You just store that number of bits of the code and the codelength for
each symbol.  To encode a symbol you look up the length and last bits
of the code.  Then shift the old output to the left by the length
(possibly producing bytes to output) and OR the last bits in.
You may want to pack the codelength and code ending into one
memory unit (16 or 32 bit) to reduce the number of memory accesses.


On some architectures it is faster to have an register containing
0..7(15) bits pending for output. To encode you leftshift the last
bits of the added code by the number of pending bits and OR the
result in. Then add the codelength to the number of pending bits.
Output bytes (or larger units) and rightshift the code until the
number of pending bits is less than 7(15). The leading zeros will
be shifted in as needed. Never do any bitwise IO.


Conventions -
Huffman Codes -
Canonical Huffman Codes -
Code Construction -
Maximum Length -
Calculating Codelengths -
Encoding -
Decoding


Decoding

Textbooks still explain decoding on a bit-by-bit method; if you see
a 0 go left in the tree, if you see a 1 go right; if you reach a leaf
you have a symbol. This is DEAD SLOW.

Lookup decoding

How about decoding the example canonical code the following way:
make a table with 16 entries. This table will tell you what symbol
you decoded and how many bits you used.
index 0000 would contain B,4
index 0001 C,4
...
index 1000 to 1011 would all contain E,2
index 1100 to 1111 would all contain H,2 


This is in fact a good idea for short maximum codelengths. But if
maximum codelength is 25 you would need a table with>32 million
entries -- not a good idea. The solution is to use different tables
for different lengths. Here it is important that the canonical code
chosen keeps codes of same length together.


You can make a table for each length and search the correct
table by looking at the input; all you need to know is where the
codes for each length start and search your input in there.


Or you make multilevel tables; first lookup the first few bits; they
will tell you what table to use. Then look up in the second table the
decoded symbol and the length. I personally prefer that one if the
memory is not tight.


You might also choose the table based on the amount of 0's 
preceding the next 1. But stop search for a 1 after a fixed
length. Modern processors have an assembler instruction for that
search.

Example for tables for each length

decode 000101110 (CDE):


you have an array containing the start of each length and which table to use


start codelength table to use
0000  4          table1
0100  3          table2
1000  2          table3



table1 contains the symbols with length 4 sorted by code: BCFG
table2 contains the symbols with length 3 sorted by code: AD
table3 contains the symbols with length 2 sorted by code: EH



get the first 4 bits (0001). (actually you could do with 1 bit less
 than the longest codelength since that omitted bit must be 0 at a
 length boundary)
do a search for these bits in the array; telling you to
 use table1.
subtract start from the bits you have, shift them if the code is
 short and use the result as index into table1. You find C.
get the next 4 bits (0111).
do a search for these bits in the array; telling you to
 use table2.
subtract start from the bits you have, shift them by 1 and use
 the result as index into table2. You find D.
get the first 4 bits (1000, the l was not used for the last symbol!).
do a search for these bits in the array; telling you to
 use table3.
subtract start from the bits you have, shift them by 2 and use
 the result as index into table3. You find E.


In a real implementation you could, for example, avoid the subtraction
by adjusting the table pointers.

Example for two-level tables


you have an array containing the following:


index table to use bits as index
00    table1       2
01    table2       1
10    table3       0
11    table4       0

some entries may point to the same table if you have shorter
codes than your index.


The other tables contain a symbol and a codelength. There are ways to
omit the codelengths, see below.
table1 contains: B4 C4 F4 G4
table2 contains: A3 D3
table3 contains: E2
table4 contains: H2
If a symbol has a shorter codelength than the symbol with the longest
codelength in that table it occupies more than one place - just like
with the full decoding table in the first attempt.


These tables are surprisingly small if you choose the size of the first
array that decides between tables large enough - 2^(maximum codelength/2)
is usually a good guess for the size of that array.



get the first 2 bits (00).
look into array and see that you need to use table1 with 2 bit as index.
look into table1 index next 2 bits (01) and find C and 4 bits used.
get the next 2 bits (01).
look into array and see that you need to use table2 with 1 bit as index.
look into table2 index next bit (1) and find D and 3 bits used.
get the next 2 bits (10).
look into array and see that you need to use table3 with 0 bit as index.
look into table2 no index bits and find E and 2 bits used.


Again optimizations are possible. Note that the code contains no if
at all.

Some Variants


You might try to decode short symbols with only one lookup, however
the decision whether to make a second lookup or not costs more than the
second lookup.  You might also consider decoding more than one symbol
at once; however this usually does not pay off unless the average
codelength used is very short (less than 2 bit).


You might want to have the first array point to functions instead of
tables; then a one-lookup (and possibly a 3-lookup with another
intermediate level) decoding can be done efficiently. But it will break
instruction pipeline.


With short memory you might want to avoid storing the codelengths
for each symbol like with the first method. If your first array
has 2^9 (512) entries and your maximum codelength is 18 you know
that only 9 of the 512 second level tables might have codes with
different lengths in them. Only these tables need to store the
length. Or search the length for codes using a
binary search like with the first method - but much faster.
For the search store the maximum and minimum codelengths
in each such table and do the binary search only in that range.
You might also use separate arrays where codelengths start for
each table with more than one codelength. You could even do the search
always; for symbols standing in tables with only one length it will
terminate immediately anyway.


If your symbols are variable size you can store pointers to the
symbols instead the symbols themselves. If you store them as a block
for each table you can easily avoid a termination symbol or a length
for those variable symbols; just look in the table where the next
variable length symbol starts. You will need an extra entry in the
table to terminate the last symbol of the table.


Instead of multiple second level tables you may use one big table
and create appropriate pointers or indices.


Conventions -
Huffman Codes -
Canonical Huffman Codes -
Code Construction -
Maximum Length -
Calculating Codelengths -
Encoding -
Decoding



<IMG SRC= szip and the >data</// logo are trademarks or registered trademarks of Michael Schindler. All other trademarks or registered trademarks are held by their owners.
 

A

tutorial

by

Michael

Schindler.

http://www.compressconsult.com/huffman/

Practical Huffman Coding 2008 December

dvd rental

dvd


A tutorial by Michael Schindler.

Rules




© 2005 Internet Explorer 5+ or Netscape 6+

Recommended Sites: 1. Arts - Business - Computers - Games - Health - Home - Kids and Teens - News - Recreation - Reference - Regional - Science - Shopping - Society - Sports - World Miss Gallery - Top Anime Hentai - DVD rental by mail - Car Loans - Auto Loans - Credit Cards - Bleach - Refinance
2008-12-03 02:14:08

Copyright 2005, 2006 by Webmaster
Websites is cool :)