paperpdf2xml

A set of Python 3 CLI to convert scientific papers in PDF format to XML documents with sections and tables.

Prerequisites

Make sure you have installed pdftottext utility installed for initial PDF to text conversion.

For Ubuntu/Debian

sudo apt-get install poppler-utils

For RedHat/RHEL/ Fedora/ CentOS Linux

sudo yum install poppler-utils

create a Python virtual environment

python3 -m venv ~/pdf_env

activate the virtual environment and install dependencies

source ~/pdf_env/bin/activate
pip install --upgrade pip
pip install pdftotree==0.2.13
pip install h5py==2.10.0
pip install tensorflow
pip install Keras
pip install spacy
python -m spacy download en_core_web_sm

Install spacy NLP library and models (A virtual environment is recommended)

pip install spacy
python -m spacy download en_core_web_sm

Usage

pdftotext paper.pdf paper.txt
python pdftext2pages.py -i paper.txt -o /tmp/paper1
python paper2xml.py -i /tmp/paper1/pdf.xml -o /tmp/paper1/paper.xml

The generated tmp/paper1/paper.xml contains paper section and table information with the common page headers and footers (line numbers) removed, formula lines detected heuristically and stripped. The generated XML can then be used for text mining applications.

python pdftext2pages.py -h 

usage: pdftext2pages.py [-h] -i I -o O

optional arguments:
  -h, --help  show this help message and exit
  -i I        input PDF Text file
  -o O        output directory

python paper2xml.py -h
usage: paper2xml.py [-h] -i I -o O

optional arguments:
  -h, --help  show this help message and exit
  -i I        input PDF XML file
  -o O        output XML file

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LICENSE.md		LICENSE.md
README.md		README.md
glove_handler.py		glove_handler.py
hocr2pages.py		hocr2pages.py
junk_remover.py		junk_remover.py
paper2xml.py		paper2xml.py
pdftext2pages.py		pdftext2pages.py
requirements.txt		requirements.txt
split_utils.py		split_utils.py
textbook2hocr.sh		textbook2hocr.sh
utils.py		utils.py
utils_tests.py		utils_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paperpdf2xml

Prerequisites

Usage

About

Releases

Packages

Contributors 2

Languages

License

SciCrunch/paperpdf2xml

Folders and files

Latest commit

History

Repository files navigation

paperpdf2xml

Prerequisites

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages