Skip to content

An implementation of compression and decompression using Huffman coding in Python

License

Notifications You must be signed in to change notification settings

lucas-ps/Huffman-Code-compression-python-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Huffman coding in Python

My coursework project for my Data Structures and Algorithms module - Huffman Coding compression algorithm in Python. A demonstration of the program in use can be found here. Hosted on: Github

The project takes in a .TXT file and (from my testing) is capable of reducing the file's size by up to 45% using Huffman Coding

Requirements

  • Python >= 3.7
  • Modules used:
    • json
    • typing
    • heapq
    • OS
    • math
    • bitstring

Usage

To run program, simply run the main.py file. I used the bee movie script for testing, it's been uploaded to the github repo so if you would like to test using that, enter 'bee.txt' as your first input in the program

Main functions

get_text_from_file(file: str)

Returns a string of the text data in a provided file

calc_letter_frequencies(text: str)

Returns a list with characters in the provided string and their frequencies in descending order

create_huffman_tree(text: str)

Creates all necessary nodes to construct a huffman coding tree, returns the root node of the tree

create_codes(root_node: Node)

Assigns optimised huffman codes to each letter using the provided tree

  • def calc_code_for_char(code: str, node: Node)
    • Recursive function within create_codes() that checks assigns codes to nodes (characters)

def compress_text(text: str, file, code_dict = None)

Uses generated optimised character codes to create a BIN file with the compressed text and a JSOn file with the character codes

def decompress_text(compressed_file, code_dict=None)

Reads the provided compressed file and it's character codes file, decodes, and writes decoded data to new file

def test_program()

Function that calls all other functions to test entire program's functionality

Node object functions

def __init__(self, char, frequency, left=None, right=None)

Constructor for a new node object, creates a node in the tree with defined child nodes

def __lt__(self, other):

Defines the behaviour of the less-than operator, overrides the __lt__() function to ensure heapq orders nodes as expected (ascending order of frequency)

Performance analysis

E-book compression analysis graph

File name Original size (bytes) Compressed size (bytes) % reduction in file size
French E-Book 1 346563 202010 41.71%
French E-Book 2 1108709 600656 45.82%
English E-Book 1 860669 476703 44.61%
English E-Book 2 860669 476703 44.61%
Portugese E-Book 1 302358 168707 44.20%
Portugese E-Book 2 51321 32565 36.55%

Compression analysis by language

Compression language Average % reduction in file size
French E-Books 43.77%
English E-Books 44.61%
Portugese E-Books 40.37%

As seen in the data above, Huffman coding is able to fairly consistently reduce file sizes for all languages tested by 35 - 45%

Compression analysis for datasets

Data set name and type Original size (bytes) Compressed size (bytes) % reduction in file size
dblp.xml.00001.1(pseudo-real) 104857600 68781321 34.41%
Escherichia_Coli(real) 112689515 31648046 71.92%
fib41(artificial) 267914296 33489307 87.50%

As seen in the data above, reduction in file size varies a lot with data sets as they may have differing levels of repetitiveness

List of resources used to implement compression algorithm

License

MIT

About

An implementation of compression and decompression using Huffman coding in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages