BibHelioTech is a program for the recognition of temporal expressions and entities (satellites, instruments, regions) extracted from scientific articles in the field of heliophysics.
It was developed at IRAP (Institut de Recherche en Astrophysique et Planétologie, Toulouse https://www.irap.omp.eu/) in the frame of an internship by A. Dablanc supervised by V. Génot.
Its main purpose is to retrieve events of interest which have been studied and published, and associate them with the full context of the observations. It produces standardized catalogues of events (time intervals, satellites, instruments, regions, metrics) which can then be exploited in space physics visualization tools such as AMDA (http://amda.cdpp.eu/).
STEP 1: install all dependency
On your shell, run: pip install -r requirements.txt
Don't forget to install SUTime Java dependencies, more details on: https://pypi.org/project/sutime/
Put the "english.sutime.txt" under sutime install directory, jars/stanford-corenlp-4.0.0-models.jar/edu/stanford/nlp/models/sutime/
STEP 2: tesseract 5.1.0 installation (Ubuntu exemple)
sudo apt update
sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
sudo apt install -y tesseract-ocr
sudo apt update
tesseract --version
STEP 3: GROBID (0.7.1) installation
install GROBID under ../
Follow install instruction on: https://grobid.readthedocs.io/en/latest/Install-Grobid/
Make sure you have JVM 8 used by default !
STEP 4: GROBID python client installation
install GROBID python client under ../
Follow install instruction on: https://github.com/kermitt2/grobid_client_python
Put Heliophysics articles in pdf format under BibHelio_Tech/DATA/Papers.
You just have to run "MAIN.py".
optionally if you want to have AMDA catalogues by satellites,
you need to run "SATS_catalogue_generator.py".
If you use or contribute to BibHelio_Tech, you agree to use it or share your contribution following this license.
[Axel Dablanc]: [email protected]
[Vincent Génot]: [email protected]