This repository maintains the data storage system of the Human Cell Atlas. We use this Google Drive folder for design docs and meeting notes, and this Zenhub board to track our GitHub work.
The DSS is a replicated data storage system designed for hosting large sets of scientific experimental data on Amazon S3 and Google Storage. The DSS exposes an API for interacting with the data and is built using Chalice, API Gateway and AWS Lambda. The API also implements Step Functions to orchestrate Lambdas for long-running tasks such as large file writes. You can find the API documentation and give it a try here.
The DSS API uses Swagger to define the API specification according to the OpenAPI 2.0 specification. Connexion is used to map the API specification to its implementation in Python.
You can use the
Swagger Editor
to review and edit the API specification. When the API is live, the spec is also available at /v1/swagger.json
.
You can find the DSS API documentation for each stage listed below:
- HCA DSS: The Human Cell Atlas Data Storage System
In this section, you'll configure and deploy a local API server and your own suite of cloud services to run a development version of the DSS.
Note that privileged access to cloud accounts (AWS, GCP, etc.) is required to deploy the data-store. IF your deployment fails due to access restrictions, please consult your local systems administrators.
Also note that all commands given in this Readme should be run from the root of this repository after sourcing the
correct environment (see the Configuration section below). The root directory of the repository
is also available in the environment variable $DSS_HOME
.
The DSS requires Python 3.6 to run.
Clone the repo and install dependencies:
git clone [email protected]:HumanCellAtlas/data-store.git
cd data-store
pip install -r requirements-dev.txt
Also install terraform from Hashicorp from your favourite package manager.
The DSS is configured via environment variables. The required environment variables and their default values
are defined in the file environment
. To customize the values of these environment variables:
- Copy
environment.local.example
toenvironment.local
- Edit
environment.local
to add custom entries that override the default values inenvironment
- Run
source environment
now and whenever these environment files are modified.
The full list of configurable environment variables and their descriptions are documented here.
The DSS uses the Amazon S3 backend for Terraform-deployed resources. To run deployment, make sure that the requisite bucket exists. The bucket is named after $DSS_TERRAFORM_BACKEND_BUCKET_TEMPLATE
, by default org-humancellatlas-{account_id}-terraform
with {account_id}
substituted with your AWS account ID.
-
Follow this tutorial to install the AWS command line utility and configure your AWS access credentials.
-
Specify the names of S3 buckets in
environment.local
, using the environment variablesDSS_S3_BUCKET_*
, and verify thatAWS_DEFAULT_REGION
points to your prefered region. These buckets will be created with Terraform, and should not exist before deploying for the first time.
-
Generate OAuth application secrets to be used for your instance:
-
Go to the GCP API and Service Credentials page. You may have to select Organization and Project again.
-
Click Create Credentials and select OAuth client
-
For Application type choose Other
-
Under application name, use
hca-dss-
followed by the stage name (i.e. the value ofDSS_DEPLOYMENT_STAGE
). This is a convention only and carries no technical significance. -
Click Create, don't worry about noting the client ID and secret, click OK
-
Click the edit icon for the new credentials and click Download JSON
-
Place the downloaded JSON file into the project root as
application_secrets.json
-
Run the command
### WARNING: RUNNING THIS COMMAND WILL ### CLEAR EXISTING SCRET VALUE cat $DSS_HOME/application_secrets.json | ./scripts/dss-ops.py secrets set --secret-name $GOOGLE_APPLICATION_SECRETS_SECRETS_NAME
-
-
Download the
gcloud
command line utility. -
Run the command
./scripts/dss-ops.py lambda update
This populates the environment variables defining your stage into an AWS Simple Systems Manager parameter store. These variables will be read from the parameter store and in-lined into the Lambda deployment packages during deployment. This command should be executed whenever the environment variables are updated. The environments of currently deployed Lambdas may optionally by updated in place with the flag
--update-deployed
. -
Choose a region that has support for Cloud Functions and set
GCP_DEFAULT_REGION
to that region. See the GCP locations list for a list of supported regions. -
Run
gcloud config set project PROJECT_ID
wherePROJECT_ID
is the ID, not the name (i.e: hca-store-21555, NOT just hca-store) of the GCP project you selected earlier. -
Enable required APIs:
gcloud services enable cloudfunctions.googleapis.com gcloud services enable runtimeconfig.googleapis.com gcloud services enable iam.googleapis.com
-
Specify the names of Google Cloud Storage buckets in
environment.local
, using the environment variablesDSS_GS_BUCKET_*
, and verify thatGCP_DEFAULT_REGION
points to your prefered region. These buckets will be created with Terraform, and should not exist before deploying for the first time.
The following environment variables must be set to enable user authentication and authorization.
OIDC_AUDIENCE
must be populated with the expected JWT (JSON web token) audience.OPENID_PROVIDER
is the generator of the JWT, and is used to determine how the JWT is validated.OIDC_GROUP_CLAIM
is the JWT claim that specifies the group the users belongs to.OIDC_EMAIL_CLAIM
is the JWT claim that specifies the requests email.
Also update authorizationUrl
in dss/dss-api.yml to point to an authorization endpoint which returns a
valid JWT.
Optional: To configure a custom swagger auth before deployment run:
python scripts/swagger_auth.py -c='{"/path": "call"}'
Alternatively, to configure auth for all swagger endpoints, you can run:
python scripts/swagger_auth.py --secure
Note: Removing auth from endpoints will currently break tests, however adding auth should be fine
(make test
should run successfully).
Note: The auth config file for deployment can also be set in environment.local
with AUTH_CONFIG_FILE.
Some daemons (dss-checkout-sfn for example) use Amazon SES to send emails. You must set DSS_NOTIFICATION_SENDER
to
your email address and then verify that address using the SES Console enabling SES to send notification emails from it.
Run ./dss-api
in the top-level data-store
directory to deploy the DSS API on your localhost
.
When deploying for the first time, a Google Cloud Platform service account must first be created and credentialed.
-
Specify the name of the Google Cloud Platform service account in
environment.local
using the variableDSS_GCP_SERVICE_ACCOUNT_NAME
. -
Provision a set of credentials that will allow you to run deployment.
-
In the Google Cloud Console, select the correct Google user account on the top right and the correct GCP project in the drop down in the top center. Go to "IAM & Admin", then "Service accounts".
-
Click "Create service account" and select "Furnish a new private key". Under "Roles", select a) "Project – Owner", a) "Service Accounts – Service Account User" a) "Cloud Functions – Cloud Function Developer".
-
Create the account and download the service account key JSON file.
-
Place the file as
$DSS_HOME/gcp-credentials.json
. You will replace it later.
-
-
Create the Google Cloud Platform service account using the command
make -C infra COMPONENT=gcp_service_account apply
This step can be skipped if you're rotating credentials.
-
Place the downloaded JSON file into the project root as
gcp-credentials.json
-
Run the command
### WARNING: RUNNING THIS COMMAND WILL ### CLEAR EXISTING SECRET VALUE cat $DSS_HOME/gcp-credentials.json | ./scripts/dss-ops.py secrets set --secret-name $GOOGLE_APPLICATION_CREDENTIALS_SECRETS_NAME
Set admin account emails within AWS Secret Manager:
### WARNING: RUNNING THIS COMMAND WILL
### CLEAR EXISTING SECRET VALUE
echo -n '[email protected],[email protected]' | ./scripts/dss-ops.py secrets set --secret-name $ADMIN_USER_EMAILS_SECRETS_NAME
Assuming the tests have passed above, the next step is to manually deploy. See the section below for information on CI/CD with Travis if continuous deployment is your goal.
Several components in the DSS deployed separately as daemons, found in $DSS_HOME/daemons
. Daemon deployment may
incorporate dependent infrastructure, such SQS queues or SNS topics, by placing Terraform files in daemon directory, e.g.
$DSS_HOME/daemons/dss-admin/my_queue_defs.tf
. This infrastructure is deployed non-interactively, without the
usual plan/review Terraform workflow, and should therefore be lightweight in nature. Large infrastructure should be
added to $DSS_HOME/infra
instead.
Cloud resources have the potential for naming collision in both AWS and GCP, ensure that you rename resources as needed.
Buckets within AWS and GCP need to be available for use by the DSS. Use Terraform to setup these resources:
make -C infra COMPONENT=buckets plan
make -C infra COMPONENT=buckets apply
The AWS Elasticsearch Service is used for metadata indexing. Currently the DSS uses version 5.5 of ElasticSearch. For typical development deployments the
t2.small.elasticsearch instance type is sufficient. Use the DSS_ES_
variables to adjust the cluster as needed.
Add allowed IPs for ElasticSearch to the secret manager, use comma separated IPs:
### WARNING: RUNNING THIS COMMAND WILL
### CLEAR EXISTING SECRET VALUE
echo -n '1.1.1.1,2.2.2.2' | ./scripts/dss-ops.py secret set --secret-name $ES_ALLOWED_SOURCE_IP_SECRETS_NAME
Use Terraform to deploy ES resource:
make -C infra COMPONENT=elasticsearch plan
make -C infra COMPONENT=elasticsearch apply
A certificate matching your domain must be registered with
AWS Certificate Manager. Set ACM_CERTIFICATE_IDENTIFIER
to the identifier of the certificate, which can be found on the AWS console.
An AWS route53 zone must be available for your domain name and configured in environment
.
Now deploy using make:
make plan-infra
make deploy-infra
make deploy
If successful, you should be able to see the Swagger API documentation at:
https://<domain_name>
And you should be able to list bundles like this:
curl -X GET "https://<domain_name>/v1/bundles" -H "accept: application/json"
Please see the data-store-monitor repo for additional monitoring tools.
We use Travis CI for continuous unit testing that does
not involve deployed components. A private GitLab instance is used for deployment to
the dev
environment if unit tests pass, as well as further testing of deployed components, for every commit
on the master
branch. GitLab testing results are announced on the
data-store-eng
Slack channel in the HumanCellAtlas workspace.
Travis behaviour is defined in .travis.yml
, and GitLab behaviour is defined in .gitlab-ci.yml
.
Encrypted environment variables give Travis CI the AWS credentials needed to run the tests and deploy the app. Run
scripts/authorize_aws_deploy.sh IAM-PRINCIPAL-TYPE IAM-PRINCIPAL-NAME
(e.g. authorize_aws_deploy.sh group travis-ci
) to give that principal the permissions needed to deploy the app. Because a group policy has a higher size
limit (5,120 characters) than a user policy (2,048 characters), it is advisable to apply this to a group and add the
principal to that group. Because this is a limited set of permissions, it does not have write access to IAM. To set up
the IAM policies for resources in your account that the app will use, run make deploy
using privileged account
credentials once from your workstation. After this is done, Travis CI will be able to deploy on its own. You must
repeat the make deploy
step from a privileged account any time you change the IAM policies templates in
iam/policy-templates/
.
Environment variables provide the AWS credentials needed to relay events originating from supported cloud platforms
outside of AWS. Run scripts/create_config_aws_event_relay_user.py
to create an AWS IAM user with the appropriate
restricted access policy. This script also creates the user access key and stores it in an AWS Secrets Manager
store.
Note when executing the script above, ensure that the role/user used within AWS is authorized to perform: iam:CreateUser
Now that you have deployed the data store, the next step is to use the HCA Data Store CLI to upload and download data to
the system. See data-store-cli for installation instructions. The
client requires you change hca/api_spec.json
to point to the correct host, schemes, and, possibly, basePath. Examples
of CLI use:
# list bundles
hca dss post-search --es-query "{}" --replica=aws | less
# upload full bundle
hca dss upload --replica aws --staging-bucket staging_bucket_name --src-dir ${DSS_HOME}/tests/fixtures/datafiles/example_bundle
Now that you've uploaded data, the next step is to confirm the indexing is working properly and you can query the indexed metadata.
hca dss post-search --replica aws --es-query '
{
"query": {
"bool": {
"must": [{
"match": {
"files.donor_organism_json.medical_history.smoking_history": "yes"
}
}, {
"match": {
"files.specimen_from_organism_json.genus_species.text": "Homo sapiens"
}
}, {
"match": {
"files.specimen_from_organism_json.organ.text": "brain"
}
}]
}
}
}
'
-
Check that software packages required to test and deploy are available, and install them if necessary:
make --dry-run
-
Populate text fixture buckets with test fixture data (This command will completely empty the given buckets before populating them with test fixture data, please ensure the correct bucket names are provided)**:
tests/fixtures/populate.py --s3-bucket $DSS_S3_BUCKET_TEST_FIXTURES --gs-bucket $DSS_GS_BUCKET_TEST_FIXTURES
-
Set the environment variable
DSS_TEST_ES_PATH
to the path of theelasticsearch
binary on your machine. -
Run tests with
make test
All tests for the DSS fall into one of two categories:
- Standalone tests, which do not depend on deployed components, and
- Integration tests, which depend on deployed components.
As such, standalone tests can be expected to pass even if no deployment is configured, and in fact should pass before an initial deployment. For more information on tests, see tests/README.md.
The direct runtime dependencies of this project are defined in requirements.txt.in
. Direct development dependencies
are defined in requirements-dev.txt.in
. All dependencies, direct and transitive, are defined in the corresponding
requirements.txt
and requirements-dev.txt
files. The latter two can be generated using make requirements.txt
or
make requirements-dev.txt
respectively. Modifications to any of these four files need to be committed. This process is
aimed at making dependency handling more deterministic without accumulating the upgrade debt that would be incurred by
simply pinning all direct and transitive dependencies. Avoid being overly restrictive when constraining the allowed
version range of direct dependencies in -requirements.txt.in
and requirements-dev.txt.in
If you need to modify or add a direct runtime dependency declaration, follow the steps below:
- Make sure there are no pending changes to
requirements.txt
orrequirements-dev.txt
. - Make the desired change to
requirements.txt.in
orrequirements-dev.txt.in
- Run
make requirements.txt
. Runmake requirements-dev.txt
if you have modifiedrequirements-dev.txt.in
. - Visually check the changes to
requirements.txt
andrequirements-dev.txt
. - Commit them with a message like
Bumping dependencies
.
You now have two commits, one that catches up with updates to transitive dependencies, and one that tracks your explict
change to a direct dependency. This process applies to development dependencies as well, except for
requirements-dev.txt
and requirements-dev.txt.in
respectively.
If you wish to re-pin all the dependencies, run make refresh_all_requirements
. It is advisable to do a full
test-deploy-test cycle after this (the test after the deploy is required to test the lambdas).
-
Always use a module-level logger, call it
logger
and initialize it as follows:import logging logger = logging.getLogger(__name__)
-
Do not configure logging at module scope. It should be possible to import any module without side-effects on logging. The
dss.logging
module contains functions that configure logging for this application, its Lambda functions and unit tests. -
When logging a message, pass either
-
an f-string as the first and only positional argument or
-
a %-string as the first argument and substitution values as subsequent arguments. Do not mix the two string interpolation methods. If you mix them, any percent sign in a substituted value will raise an exception.
# In other words, use logger.info(f"Foo is {foo} and bar is {bar}") # or logger.info("Foo is %s and bar is %s", foo, bar) # but not logger.info(f"Foo is {foo} and bar is %s", bar) # Keyword arguments can be used safely in conjunction with f-strings: logger.info(f"Foo is {foo}", exc_info=True)
-
-
To enable verbose logging by application code, set the environment variable
DSS_DEBUG
to1
. To enable verbose logging by dependencies setDSS_DEBUG
to2
. To disable verbose logging unsetDSS_DEBUG
or set it to0
. -
To assert in tests that certain messages were logged, use the
dss
logger or one of its childrendss_logger = logging.getLogger('dss') with self.assertLogs(dss_logger) as log_monitor: # do stuff # or import dss with self.assertLogs(dss.logger) as log_monitor: # do stuff
AWS Xray tracing is used for profiling the performance of deployed lambdas. This can be enabled for chalice/app.py
by
setting the lambda environment variable DSS_XRAY_TRACE=1
. For all other daemons you must also check
"Enable active tracking" under "Debugging and error handling" in the AWS Lambda console.
See our Security Policy.
External contributions are welcome. Please review the Contributing Guidelines