My coursework project for my Data Structures and Algorithms module - Huffman Coding compression algorithm in Python. A demonstration of the program in use can be found here. Hosted on: Github
The project takes in a .TXT file and (from my testing) is capable of reducing the file's size by up to 45% using Huffman Coding
- Python >= 3.7
- Modules used:
- json
- typing
- heapq
- OS
- math
- bitstring
To run program, simply run the main.py file. I used the bee movie script for testing, it's been uploaded to the github repo so if you would like to test using that, enter 'bee.txt' as your first input in the program
Returns a string of the text data in a provided file
Returns a list with characters in the provided string and their frequencies in descending order
Creates all necessary nodes to construct a huffman coding tree, returns the root node of the tree
Assigns optimised huffman codes to each letter using the provided tree
- def calc_code_for_char(code: str, node: Node)
- Recursive function within create_codes() that checks assigns codes to nodes (characters)
Uses generated optimised character codes to create a BIN file with the compressed text and a JSOn file with the character codes
Reads the provided compressed file and it's character codes file, decodes, and writes decoded data to new file
Function that calls all other functions to test entire program's functionality
Constructor for a new node object, creates a node in the tree with defined child nodes
Defines the behaviour of the less-than operator, overrides the __lt__()
function to ensure heapq orders nodes
as expected (ascending order of frequency)
File name | Original size (bytes) | Compressed size (bytes) | % reduction in file size |
French E-Book 1 | 346563 | 202010 | 41.71% |
French E-Book 2 | 1108709 | 600656 | 45.82% |
English E-Book 1 | 860669 | 476703 | 44.61% |
English E-Book 2 | 860669 | 476703 | 44.61% |
Portugese E-Book 1 | 302358 | 168707 | 44.20% |
Portugese E-Book 2 | 51321 | 32565 | 36.55% |
Compression language | Average % reduction in file size | ||
French E-Books | 43.77% | ||
English E-Books | 44.61% | ||
Portugese E-Books | 40.37% |
As seen in the data above, Huffman coding is able to fairly consistently reduce file sizes for all languages tested by 35 - 45%
Data set name and type | Original size (bytes) | Compressed size (bytes) | % reduction in file size |
dblp.xml.00001.1(pseudo-real) | 104857600 | 68781321 | 34.41% |
Escherichia_Coli(real) | 112689515 | 31648046 | 71.92% |
fib41(artificial) | 267914296 | 33489307 | 87.50% |
As seen in the data above, reduction in file size varies a lot with data sets as they may have differing levels of repetitiveness
-
https://www.geeksforgeeks.org/huffman-coding-greedy-algo-3/ - General intro to Huffman coding
-
https://pypi.org/project/bitstring/ - Documentation used to understand writing to binary files
-
https://docs.python.org/3/library/heapq.html - Documentation used to understand priority queues in python
-
https://www.tutorialspoint.com/python_data_structure/python_binary_tree.htm - Understanding how to implement a binary tree in python