Skip to content

JaneliaSciComp/dis-utilities

Repository files navigation

dis-utilities Picture

GitHub last commit GitHub commit merge status

Python Flask Bootstrap jQuery MongoDB Docker

DOI system and utilities for Data and Information Services

This system automatically discovers papers and datasets published by HHMI Janelia staff and stores them in a MongoDB database. The automated scripts, which are run periodically (nightly or weekly), also make "educated guesses" about metadata that are of strategic interest to Janelia, such as labs, teams, and employees who contributed to the work. Utility scripts allow the librarian to curate these metadata in a semi-automated fashion. A Flask-based application provides a user interface, visualizations, and a REST API.

This repository is split into four sections:

  • api: Web-based user interface and REST API
  • etl: programs for ETL (Extract-Transform-Load) for creating/maintaining DIS database
  • sync: programs meant to be periodically run in the backgroud to sync the DIS database from external data sources
  • utility: utility programs to be run interactively on the command line, for CRUD operations on database collections

DIS system architecture

DIS system architecture

The DIS system is based on a MongoDB database with collections to persist DOIs, ORCIDs, and project mappings. Python programs are used for ETL and updates. A Flask-based application provides user interface, visualizations, and a REST API.

The DIS MongoDB database contains four collections:

  • dois: local persistence of records from Crossref or DataCite along with Janelia metadata
  • dois_to_process: transient storage for DOIs that are present in secondary systems (e.g. bioRxiv) but not yet available in Crossref/DataCite
  • orcid: Janelia authors. Data in this collection is drawn fro ORCID and the HHMI People system.
  • project_map: mapping of alternate project names to approved tags

Python command line programs

The Python programs in the sync and utility sections of this repository are meant to be run from the Unix command line, preferably from inside a Python virtual environment. To see which command line parameters may be specified for programs, use --help:

my_venv/bin/python3 update_dois.py --help

Common command line parameters

Most of the command line programs have a set of common parameters:

  • --manifold: used to specify the MongoDB database manifold (dev or prod)
  • --write: actually write to the database. If not specified, no rows will be updated in the MongoDB database
  • --verbose: verbose mode for logging - status messages are printed to STDOUT - this is chatty
  • --debug: debug mode for logging - debug messages are printed to STDOUT - this is chatty in the extreme

Other common parameters:

  • --doi: a single DOI to process
  • --file: a file of DOIs to process (one DOI per line)

Configuration

While this system does use some config files, the database credentials are stored in the Configuration system.

Production server

The current production server is dis.int.janelia.org. If this changes, you'll need to modify nginx.conf.

About

Utilities for Data and Information Services

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published