ddd
- Learn how to provision computing resources for running Big Data analyses using the Infrastructure as Code (IaC) approach.
- Learn how to set up opinionated CI/CD pipelines to deploy cloud infrastructure.
- Learn how to utilize linters for detecting security vulnerabilities in cloud infrastructure.
- Learn how to run Apache Spark code in a distributed way on Hadoop cluster using Vertex AI notebooks and Dataproc services on GCP.
- Learn how to use Workload Identity Federation for a secure authentication from GitHub Actions to Google Cloud.
- Google Cloud SDK
- gsutil
- pre-commit
- Terraform ( Requirements )
- Python ~>3.8
- Linux/MacOS
- pre-commit-terraform dependencies
- Redeem a GCP coupon to create a billing account
- Authenticate to GCP to obtain the default credentials used for running the code
# first remove the stored credentials if exist
gcloud auth application-default revoke
# login and get the new application credentials
gcloud auth application-default login
- Export shared environment variables
export TF_VAR_tbd_semester=2024L
# format: 20xx for teachers, student ID number for students
export TF_VAR_user_id=9900
# use your own billing account id
export TF_VAR_billing_account=01F44C-CA9C7E-587C25
- Enter
bootstrap
folder then init project and Terraform state bucket
cd bootstrap
terraform init
terraform apply
cd ..
- CI/CD (Github Actions setup using Workload Identity Federation)
- Edit
env/backend.tfvars
file and setbucket
variable with the Terraform state bucket - Edit
env/project.tfvars
file and setproject_name
,iac_service_account
variables using the output from thebootstrap
phase, e.g.: - Edit
cicd_bootstrap/conf/github_actions.tfvars
to setgithub_org
andgithub_repo
, e.g.:
github_org = "mwiewior"
github_repo = "tbd-workshop-1"
- Init state file and set env variables
cd cicd_bootstrap
terraform init -backend-config=../env/backend.tfvars
- Apply
# authenticate Docker backend with GCP
gcloud auth configure-docker
# create CI/CD integration using Workload Identity
terraform apply -var-file ../env/project.tfvars -var-file conf/github_actions.tfvars -compact-warnings
cd ..
- Use output variables for configuring Github Actions workflow:
.github/workflows/pull-request.yml
,e.g. : Please do not edit and hardcode these values in a YAML but set the Github Actions secrets instead while preserving the secret names, i.e.GCP_WORKLOAD_IDENTITY_PROVIDER_NAME
andGCP_WORKLOAD_IDENTITY_SA_EMAIL
. - Install and configure
pre-commit
pre-commit install
- Commit changes, push to a branch and open a PR to YOUR repository main/master branch. If you see a warning like this -- please enable the workflows: ...and repush your changes!
Once all Pull Requests checks have passed please merge your PR and wait until your release job finishes. 7. Navigate to the Vertex AI Workbench menu item, find your notebook on the list, press CONNECT and follow the instructions
- Check if
pyspark
kernel exists - if not then in your Jupyterlab enviroment add Python3.8 kernel:
python3.8 -m ipykernel install --user --name pyspark
- Add support for arbitrary machine types and worker nodes for a Dataproc cluster and JupyterLab instance
- Add support for preemptible/spot instances in a Dataproc cluster
- Perform additional hardening of Jupyterlab environment, i.e. disable sudo access and enable secure boot
- (Optional) Get access to Apache Spark WebUI
- Create a BigQuery dataset and an external table (change storage location if needed)
CREATE SCHEMA IF NOT EXISTS demo OPTIONS(location = 'europe-west1');
CREATE OR REPLACE EXTERNAL TABLE demo.shakespeare
OPTIONS (
format = 'ORC',
uris = ['gs://tbd-2023z-9900-data/data/shakespeare/*.orc']);
SELECT * FROM demo.shakespeare ORDER BY sum_word_count DESC LIMIT 5;
-
Workshop 2 exercises are described in Jupyter notebook
-
IMPORTANT ❗ ❗ ❗ Please remember to destroy all the resources after the workshop:
terraform init -backend-config=env/backend.tfvars
terraform destroy -no-color -var-file env/project.tfvars
Name | Version |
---|---|
terraform | ~> 1.5.0 |
docker | 3.0.2 |
~> 5.23.0 | |
kubernetes | 2.24.0 |
Name | Version |
---|---|
5.23.0 | |
kubernetes | 2.24.0 |
Name | Source | Version |
---|---|---|
composer | ./modules/composer | n/a |
data-pipelines | ./modules/data-pipeline | n/a |
dataproc | ./modules/dataproc | n/a |
dbt_docker_image | ./modules/dbt_docker_image | n/a |
gcr | ./modules/gcr | n/a |
jupyter_docker_image | ./modules/jupyter_docker_image | n/a |
vertex_ai_workbench | ./modules/vertex-ai-workbench | n/a |
vpc | ./modules/vpc | n/a |
Name | Type |
---|---|
google_compute_firewall.allow-all-internal | resource |
kubernetes_service.dbt-task-service | resource |
google_client_config.provider | data source |
google_container_cluster.composer-gke-cluster | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
ai_notebook_instance_owner | Vertex AI workbench owner | string |
n/a | yes |
project_name | Project name | string |
n/a | yes |
region | GCP region | string |
"europe-west1" |
no |
No outputs.