Skip to content

Latest commit

 

History

History
117 lines (88 loc) · 5.56 KB

File metadata and controls

117 lines (88 loc) · 5.56 KB

Huffman coding in Python

My coursework project for my Data Structures and Algorithms module - Huffman Coding compression algorithm in Python. A demonstration of the program in use can be found here. Hosted on: Github

The project takes in a .TXT file and (from my testing) is capable of reducing the file's size by up to 45% using Huffman Coding

Requirements

  • Python >= 3.7
  • Modules used:
    • json
    • typing
    • heapq
    • OS
    • math
    • bitstring

Usage

To run program, simply run the main.py file. I used the bee movie script for testing, it's been uploaded to the github repo so if you would like to test using that, enter 'bee.txt' as your first input in the program

Main functions

get_text_from_file(file: str)

Returns a string of the text data in a provided file

calc_letter_frequencies(text: str)

Returns a list with characters in the provided string and their frequencies in descending order

create_huffman_tree(text: str)

Creates all necessary nodes to construct a huffman coding tree, returns the root node of the tree

create_codes(root_node: Node)

Assigns optimised huffman codes to each letter using the provided tree

  • def calc_code_for_char(code: str, node: Node)
    • Recursive function within create_codes() that checks assigns codes to nodes (characters)

def compress_text(text: str, file, code_dict = None)

Uses generated optimised character codes to create a BIN file with the compressed text and a JSOn file with the character codes

def decompress_text(compressed_file, code_dict=None)

Reads the provided compressed file and it's character codes file, decodes, and writes decoded data to new file

def test_program()

Function that calls all other functions to test entire program's functionality

Node object functions

def __init__(self, char, frequency, left=None, right=None)

Constructor for a new node object, creates a node in the tree with defined child nodes

def __lt__(self, other):

Defines the behaviour of the less-than operator, overrides the __lt__() function to ensure heapq orders nodes as expected (ascending order of frequency)

Performance analysis

E-book compression analysis graph

File name Original size (bytes) Compressed size (bytes) % reduction in file size
French E-Book 1 346563 202010 41.71%
French E-Book 2 1108709 600656 45.82%
English E-Book 1 860669 476703 44.61%
English E-Book 2 860669 476703 44.61%
Portugese E-Book 1 302358 168707 44.20%
Portugese E-Book 2 51321 32565 36.55%

Compression analysis by language

Compression language Average % reduction in file size
French E-Books 43.77%
English E-Books 44.61%
Portugese E-Books 40.37%

As seen in the data above, Huffman coding is able to fairly consistently reduce file sizes for all languages tested by 35 - 45%

Compression analysis for datasets

Data set name and type Original size (bytes) Compressed size (bytes) % reduction in file size
dblp.xml.00001.1(pseudo-real) 104857600 68781321 34.41%
Escherichia_Coli(real) 112689515 31648046 71.92%
fib41(artificial) 267914296 33489307 87.50%

As seen in the data above, reduction in file size varies a lot with data sets as they may have differing levels of repetitiveness

List of resources used to implement compression algorithm

License

MIT