This system automatically discovers papers and datasets published by HHMI Janelia staff and stores them in a MongoDB database. The automated scripts, which are run periodically (nightly or weekly), also make "educated guesses" about metadata that are of strategic interest to Janelia, such as labs, teams, and employees who contributed to the work. Utility scripts allow the librarian to curate these metadata in a semi-automated fashion. A Flask-based application provides a user interface, visualizations, and a REST API.
This repository is split into four sections:
- api: Web-based user interface and REST API
- etl: programs for ETL (Extract-Transform-Load) for creating/maintaining DIS database
- sync: programs meant to be periodically run in the backgroud to sync the DIS database from external data sources
- utility: utility programs to be run interactively on the command line, for CRUD operations on database collections
The DIS system is based on a MongoDB database with collections to persist DOIs, ORCIDs, and project mappings. Python programs are used for ETL and updates. A Flask-based application provides user interface, visualizations, and a REST API.
The DIS MongoDB database contains four collections:
- dois: local persistence of records from Crossref or DataCite along with Janelia metadata
- dois_to_process: transient storage for DOIs that are present in secondary systems (e.g. bioRxiv) but not yet available in Crossref/DataCite
- orcid: Janelia authors. Data in this collection is drawn fro ORCID and the HHMI People system.
- project_map: mapping of alternate project names to approved tags
The Python programs in the sync and utility sections of this repository are meant to be run from the Unix command line, preferably from inside a Python virtual environment. To see which command line parameters may be specified for programs, use --help:
my_venv/bin/python3 update_dois.py --help
Most of the command line programs have a set of common parameters:
- --manifold: used to specify the MongoDB database manifold (dev or prod)
- --write: actually write to the database. If not specified, no rows will be updated in the MongoDB database
- --verbose: verbose mode for logging - status messages are printed to STDOUT - this is chatty
- --debug: debug mode for logging - debug messages are printed to STDOUT - this is chatty in the extreme
Other common parameters:
- --doi: a single DOI to process
- --file: a file of DOIs to process (one DOI per line)
While this system does use some config files, the database credentials are stored in the Configuration system.
The current production server is dis.int.janelia.org. If this changes, you'll need to modify nginx.conf.