Huffman Coding

1. Introduction

  • Huffman coding is a variable-length prefix coding algorithm used for lossless data compression. The core idea is to assign shorter codes to more frequent characters and longer codes to less frequent ones, minimizing the overall size of the encoded data. It’s used in various applications like image and audio compression (e.g., JPEG, MP3)
  • Huffman coding is a lossless data compression algorithm used to reduce the size of data without losing any information.
  • It is a greedy algorithm that constructs an optimal prefix code (a type of variable-length code) for a given set of symbols based on their frequencies.
  • The goal is to assign shorter codes to more frequent symbols and longer codes to less frequent symbols, minimizing the total number of bits required to represent the data.
  • Huffman coding is widely used in compression formats like JPEG, MP3, and ZIP.

Key Concepts:

  1. Prefix Codes: In Huffman coding, no codeword is a prefix of another codeword. This ensures unambiguous decoding. For example, if “0” is the code for “A”, then “01” cannot be the code for “B.”
  2. Frequency Table: The algorithm starts by creating a frequency table for each character in the input data. This table stores how many times each character appears.
  3. Huffman Tree: The frequency table is used to build a binary tree, known as a Huffman tree. This tree is constructed bottom-up, combining the least frequent characters first.
  4. Encoding and Decoding: The path from the root of the Huffman tree to each character represents its codeword. Left branches typically represent ‘0’, and right branches represent ‘1’. Decoding is done by traversing the tree according to the received bit stream.

Steps in Huffman Coding:

    1. Create a Frequency Table: Count the occurrences of each character in the input.
    2. Build the Huffman Tree:
      • Create a leaf node for each character, with its frequency as its weight.
      • Repeatedly combine the two nodes with the smallest weights into a new parent node, whose weight is the sum of its children’s weights. Continue until a single root node remains. This root node represents the entire input.
    3. Assign Codes: Traverse the Huffman tree from the root to each leaf node. The path determines the code for that character.
    4. Encode and Decode: Use the generated codes to compress and decompress the data.

2. Huffman Algorithm

The Huffman algorithm builds a Huffman tree and uses it to generate the optimal prefix codes. Here are the steps:

  1. Input: A set of symbols and their frequencies.
  2. Output: A Huffman tree and the corresponding prefix codes for each symbol.

Steps:

  1. Create a min-heap (priority queue):
    • Each node in the heap represents a symbol and its frequency.
    • The heap is ordered by frequency, with the smallest frequency at the root.
  2. Build the Huffman tree:
    • While there is more than one node in the heap:
      • Extract the two nodes with the smallest frequencies (call them x and y).
      • Create a new internal node with frequency = frequency(x) + frequency(y).
      • Make x and y the left and right children of the new node.
      • Insert the new node back into the heap.
    • The last remaining node in the heap is the root of the Huffman tree.
  3. Generate the codes:
    • Traverse the Huffman tree from the root to each leaf node.
    • Assign 0 for left edges and 1 for right edges.
    • The path from the root to a leaf node gives the prefix code for the corresponding symbol.

3. Example

Symbols and Frequencies:

  • A: 5
  • B: 9
  • C: 12
  • D: 13
  • E: 16
  • F: 45

 

Step 1: Initial Min-Heap

The initial min-heap contains the following nodes (sorted by frequency):

 

(A,5), (B,9), (C,12), (D,13), (E,16), (F,45)

 

Step 2: Build the Huffman Tree

  1. Combine (A,5) and (B,9):
    • New internal node: (AB, 14)
    • Tree so far:

 

                      (AB,14)

                      /      \

               (A,5)    (B,9)

 

    • Updated heap: (C,12), (D,13), (E,16), (AB,14), (F,45)
  1. Combine (C,12) and (D,13):
    • New internal node: (CD, 25)
    • Tree so far:

 

                     (CD,25)

                     /      \

            (C,12)  (D,13)

    1. Updated heap: (E,16), (AB,14), (CD,25), (F,45)

3.Combine (E,16) and (AB,14):

    • New internal node: (EAB, 30)
    • Tree so far:

 

    (EAB,30)

   /      \

(E,16)   (AB,14)

          /    \

       (A,5)  (B,9)

    1. Updated heap: (CD,25), (EAB,30), (F,45)

4.Combine (CD,25) and (EAB,30):

      • New internal node: (CDEAB, 55)
      • Tree so far:

 

                 (CDEAB,55)

                        /         \

             (CD,25)      (EAB,30)

           /           \         /          \

    (C,12)    (D,13)(E,16)  (AB,14)
      /   \
(A,5) (B,9)

  1. Updated heap: (F,45), (CDEAB,55)

5.Combine (F,45) and (CDEAB,55):

    • New internal node: (FCDEAB, 100)
    • Final Huffman tree:

 

    (FCDEAB,100)

   /           \

(F,45)      (CDEAB,55)

           /         \

      (CD,25)      (EAB,30)

      /    \        /    \

   (C,12)(D,13)(E,16)(AB,14)

                    /    \

                 (A,5)  (B,9)

Step 3: Assign Codes

Traverse the tree from the root to each leaf node, assigning 0 for left edges and 1 for right edges:

  • F: 0 (left edge from root)
  • C: 100 (right from root → left from CDEAB → left from CD)
  • D: 101 (right from root → left from CDEAB → right from CD)
  • E: 110 (right from root → right from CDEAB → left from EAB)
  • A: 1110 (right from root → right from CDEAB → right from EAB → left from AB)
  • B: 1111 (right from root → right from CDEAB → right from EAB → right from AB)

Final Huffman Codes

Symbol

Code

F

0

C

100

D

101

E

110

A

1110

B

1111