OAI-PMH-Harvester is a Python CLI application for harvesting metadata from repositories (also known as "Data Providers") available via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).
- To preview a list of available Makefile commands:
make help
- To install with dev dependencies:
make install
- To update dependencies:
make update
- To run unit tests:
make test
- To lint the repo:
make lint
- To run the app:
pipenv run oai --help
Create a virtual environment and install dev dependencies: make install
.
Additional notes:
-
To execute the steps below, you can use the following sample url to an OAI-PMH repo:
https://aspace-staff-dev.mit.edu/oai
. -
To write the output file to an S3 bucket, include S3 in the
-o/--output-file
argument.- With AWS credentials:
-o s3://<AWS_KEY>:<AWS_SECRET_KEY>@<BUCKET_NAME>/<output-filename>.xml
- Wihout AWS credentials (if you have your credentials stored locally):
-o s3://<BUCKET_NAME>/<output-filename>.xml
- With AWS credentials:
-
Run
make dist-dev
to build the Docker container image. -
To run a harvest, execute the following command in your terminal:
docker run -it --volume <local-file-path>:<docker-file-path>' oai-pmh-harvester-dev -h <url-to-oai-pmh-repo> -o <docker-file-path>/<output-filename>.xml harvest <optional-command-args>
Note: The
-v/--volume
argument mounts the <local-file-path> in the current directory into the container at <docker-file-path>, which allows us to view the generated output file in <local-file-path>.
-
To run a harvest, execute the following command in your terminal:
pipenv run oai -h <url-to-oai-pmh-repo> -o <output-filename>.xml harvest <optional-command-args>
# Set to dev for local development, this will be set to 'stage' and 'prod' in those environments by Terraform.
WORKSPACE=dev
# Required only if a source has records that cause errors during a harvest and --method=get. The value provided must be a space-separated list of OAI-PMH record identifiers to skip during harvest.
RECORD_SKIP_LIST=<oai-pmh-id1> <oai-pmh-id2>
# Sets the interval for logging status updates as records are written to the output file. Defaults to 1000, which will log a status update for every thousandth record.
STATUS_UPDATE_INTERVAL = 1000
# If set to a valid Sentry DSN, enables Sentry exception monitoring This is not needed for local development.
SENTRY_DSN = <sentry-dsn-for-oai-pmh-harvester>
All CLI commands can be run with pipenv run .
Usage: -c [OPTIONS] COMMAND [ARGS]...
Options:
-h, --host TEXT Hostname of server for an OAI-PMH compliant source.
[required]
-o, --output-file TEXT Filepath for generated output (either an XML file
with harvested metadata or a JSON file describing
set structure of an OAI-PMH compliant source). This
value can be a local filepath or an S3 URI.
[required]
-v, --verbose Pass to log at debug level instead of info
--help Show this message and exit.
Commands:
harvest Harvest command to retrieve records from an OAI-PMH compliant source.
setlist Create a JSON file describing the set structure of an OAI-PMH compliant source.
Usage: -c harvest [OPTIONS]
Harvest command to retrieve records from an OAI-PMH compliant source.
Options:
--method [get|list] Method for record retrieval. The 'list' method
is faster and should be used in most cases;
'get' method should be used for ArchivesSpace
due to errors retrieving a full record set with
the 'list' method. [default: list]
-m, --metadata-format TEXT Alternate metadata format for harvested records.
A record should only be returned if the format
specified can be disseminated from the item
identified by the value of the identifier
argument. [default: oai_dc]
-f, --from-date TEXT Filter for files modified on or after this date;
format YYYY-MM-DD.
-u, --until-date TEXT Filter for files modified before this date;
format YYYY-MM-DD.
-s, --set-spec TEXT SetSpec of set to be harvested. Limits harvest
to records in the provided set.
-sr, --skip-record TEXT Set of OAI-PMH identifiers for records to skip
during a harvest. Only works when --method=get.
Multiple identifiers can be provided using the
syntax: '-sr oai:12345 -sr oai:67890'. Values
can also be retrieved through the
RECORD_SKIP_LIST env var (see README for more
details).
--exclude-deleted Pass to exclude deleted records from harvest.
--help Show this message and exit.
Usage: -c setlist [OPTIONS]
Create a JSON file describing the set structure of an OAI-PMH compliant
source.
Uses the OAI-PMH ListSets verbs to retrieve all sets from a repository, and
writes the set names and specs to a JSON output file.
Options:
--help Show this message and exit.