A set of Python 3 CLI to convert scientific papers in PDF format to XML documents with sections and tables.
- Make sure you have installed
pdftottext
utility installed for initial PDF to text conversion.
For Ubuntu/Debian
sudo apt-get install poppler-utils
For RedHat/RHEL/ Fedora/ CentOS Linux
sudo yum install poppler-utils
- create a Python virtual environment
python3 -m venv ~/pdf_env
- activate the virtual environment and install dependencies
source ~/pdf_env/bin/activate
pip install --upgrade pip
pip install pdftotree==0.2.13
pip install h5py==2.10.0
pip install tensorflow
pip install Keras
pip install spacy
python -m spacy download en_core_web_sm
- Install
spacy
NLP library and models (A virtual environment is recommended)
pip install spacy
python -m spacy download en_core_web_sm
pdftotext paper.pdf paper.txt
python pdftext2pages.py -i paper.txt -o /tmp/paper1
python paper2xml.py -i /tmp/paper1/pdf.xml -o /tmp/paper1/paper.xml
The generated tmp/paper1/paper.xml
contains paper section and table information with the common page headers and footers (line numbers) removed,
formula lines detected heuristically and stripped. The generated XML can then be used for text mining applications.
python pdftext2pages.py -h
usage: pdftext2pages.py [-h] -i I -o O
optional arguments:
-h, --help show this help message and exit
-i I input PDF Text file
-o O output directory
python paper2xml.py -h
usage: paper2xml.py [-h] -i I -o O
optional arguments:
-h, --help show this help message and exit
-i I input PDF XML file
-o O output XML file