Data Rescue Project

This is the repo for converting pdf tabular data into a machine readable (csv) format. Below are the different modules:

ocr.py: The main file for converting numbers and station names of a pdf page into csv file.

get_stations.py: Used in ocr.py to convert station names into machine readable text using image segmentation and tesseract.

convert_table_to_csv.py: A wrapper file to convert pdf files of all the years available to csv. It iterates over all the files and for each one calls the ocr.py to convert that page's data into csv format. It is used by search.py file to search the data for a particular station.

search.py: This file is used to get data for all the years available for a particular station and save the data into csv file for different years.

reformat_data.py: This file converts the data generated from search.py into the required format i.e. (location, id, date, data).

To Replicate this project

In this section, we describe the steps required to setup and retrain the categorization model.

$ git clone https://github.com/gwf-uwaterloo/data-rescue.git

$ python3 -m venv env

(env) $ source env/bin/activate
(env) $ pip install -r requirements.txt

More to follow....

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
Readme.md		Readme.md
__init__.py		__init__.py
convert_table_to_csv.py		convert_table_to_csv.py
convert_tiff_to_pdf.py		convert_tiff_to_pdf.py
get_stations.py		get_stations.py
ocr.py		ocr.py
reformat_data.py		reformat_data.py
requirements.txt		requirements.txt
search.py		search.py