terraform-google-airbyte-flows

A Terraform module to programmatically deploy end-to-end ELT flows to BigQuery on Airbyte. Supports custom sources and integrates with the secret manager to securely store sensitive configurations. Also allows you to specify flows as YAML.

Prerequisites

Terraform. Tested with v1.5.3. Install Terraform
An authenticated gcloud CLI
- Install the gcloud CLI
- gcloud init
- gcloud auth application-default login
An up and running Airbyte instance on GCP
GCP permissions
- Broad roles that will work, but not recommended for service accounts or even people.
  - roles/owner
  - roles/editor
- Recommended roles to respect the least privilege principle.
  - roles/bigquery.dataOwner
  - roles/secretmanager.admin
  - roles/storage.admin
- Granular permissions required to build a custom role specific for this deployment.
  - bigquery.datasets.create
  - bigquery.datasets.delete
  - bigquery.datasets.update
  - secretmanager.secrets.create
  - secretmanager.secrets.delete
  - secretmanager.versions.add
  - secretmanager.versions.destroy
  - secretmanager.versions.enable
  - storage.buckets.create
  - storage.buckets.delete
  - storage.buckets.getIamPolicy
  - storage.buckets.setIamPolicy
  - storage.hmacKeys.create
  - storage.hmacKeys.delete
  - storage.hmacKeys.update

Usage

Go to the examples directory to view all the code samples.

Basic configuration example

Get started with the module through a minimal flow example.

Configuration example with secrets

Most sources need to be configured with secrets (DB passwords, API keys, tokens, etc...). This example shows how to configure the module to fetch secret values from the GCP secret manager to avoid hard coding them in your configuration.

Configuration example for a custom source

If the source you want to integrate is not in the Airbyte catalog, you can create a custom connector and use it in the module.

Scheduled flow configuration example

You can set your ELT pipelines to run on a cron schedule by setting cron_schedule and optionally cron_timezone.

YAML configuration example

This module is designed to be compatible with external YAML configuration files. It is a convenient way for users not proficient in Terraform to specify/modify ELT pipelines programmatically, or to integrate this module with other tools that can generate YAML files.

Features

Programmatic deployment of ELT flow to BigQuery with minimal configuration

The module is highly opinionated to reduce the design load of the users. In a few minutes/hours, you should be able to build data flows from your sources to BigQuery.

Deploying through Terraform rather than the Airbyte UI will allow you to benefit from all the advantages of config-based deployments.

Easier and less error-prone to upgrade environments.
Automatable in a CI/CD for better saclability, consistency, and efficiency.
All the configuration is centralized and versioned in git for reviews and tests.

Seamless integration with the secret manager to secure sensitive configurations

Most of the sources will require to be set up with sensitive information such as API keys, database password, and other secrets. In order not to have these as clear text on your repo, this module integrates with the secret manager to fetch sensitive data at deployment time.

Custom sources support

Airbyte has a lot of sources, but in the event yours is not officially supported, you can create your own and this module will be able to use it.

YAML-compatible configuration

Even though this module is likely to only be used by data engineers who are proficient with Terraform, it might be useful to de-couple the ELT configuration details from the TF code through a YAML file.

Users who don't know Terraform can update the config files themselves more easily.
It becomes possible to have a front or form that generates theses YAML files to then be automatically deployed by Terraform.
It separates concerns and avoids super long terraform files if you have alot of flows.

Out of the box data staging in GCS

Under the hood, the data going from your sources through Airbyte and to BigQuery will always be staged in a GCS bucket as Avro files. This is important for disaster recovery, reprocessings, backfills, archival, compliance, etc...

Input validation for sources configurations

A lot of attention was given to provide useful error messages when you misconfigure a source. If you're stuck, make sure to refer to the Airbyte connector catalog, or to the full connectors spec to check what your source requires.

Limitations

As this module depends on an available Airbyte deployment at plan time, it can not live in the same terraform state as the Airbyte infrastructure deployment itself. You will first need to deploy the Airbyte VM/cluster, and then the ELT flows separately.

It is very difficult to use from TF Cloud. You would either need to expose the Airbyte instance to the public internet, or find a way to create an SSH tunnel to it from the TF Cloud runner. If you find a neat way to work around this issue, hit me up at [email protected].

Reference: `flows_configuration`

When calling the module, you will need to specify a flow_configuration. This page documents this structure.

module "airbyte_flows" {
  source  = "artefactory/airbyte-flows/google"
  version = "~> 0"

  project_id                    = local.project_id
  airbyte_service_account_email = local.airbyte_service_account

  flows_configuration = {}  # <-- This right here
}

Full specification

map(object({
  flow_name   = string  # Display name for your data flow
  source_name = string  # Name of the source. Either one from https://docs.airbyte.com/category/sources or a custom one.

  custom_source = optional(object({  # Default: null. If source_name is not in the Airbyte sources catalog, you need to specify where to find it
    docker_repository = string       # Docker Repository URL (e.g. 112233445566.dkr.ecr.us-east-1.amazonaws.com/source-custom) or DockerHub identifier (e.g. airbyte/source-postgres)
    docker_image_tag  = string       # Docker image tag
    documentation_url = string       # Custom source documentation URL
  }))

  cron_schedule = optional(string, "manual")  # Default: manual. Cron expression for when syncs should run (ex. "0 0 12 * * ?" =\> Will sync at 12:00 PM every day)
  cron_timezone = optional(string, "UTC")     # Default: UTC. One of the TZ identifiers at https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

  normalize = optional(bool, true)  # Default: true. Whether Airbyte should normalize the data after ingestion. https://docs.airbyte.com/understanding-airbyte/basic-normalization/

  tables_to_sync = map(object({                               # All streams to extract from the source and load to BigQuery
    sync_mode             = optional(string, "full_refresh")  # Allowed: full_refresh | incremental. Default: full_refresh
    destination_sync_mode = optional(string, "append")        # Allowed: append | overwrite | append_dedup. Default: append
    cursor_field          = optional(string)                  # Path to the field that will be used to determine if a record is new or modified since the last sync. This field is REQUIRED if sync_mode is incremental. Otherwise it is ignored.
    primary_key           = optional(string)                  # List of the fields that will be used as primary key (multiple fields can be listed for a composite PK). This field is REQUIRED if destination_sync_mode is *_dedup. Otherwise it is ignored.
  }))

  source_specification = map(string)  # Source-specific configurations. Refer to the connectors catalog for more info. For any string like "secret:\<secret_name\>", the module will fetch the value of `secret_name` in the Secret Manager.

  destination_specification = object({
    dataset_name        = string        # Existing dataset to which your data will be written
    dataset_location    = string        # Allowed: EU | US | Any valid BQ region as specified here https://cloud.google.com/bigquery/docs/locations
    staging_bucket_name = string        # Existing bucket in which your data will be written as avro files at each connection run.
  })
}))

Auto-generated module documentation

Requirements

Name	Version
airbyte	~>0.1
google	~>4.75
http	~>3.4

Providers

Name	Version
http	~>3.4

Modules

Name	Source	Version
airbyte_bigquery_flow	./airbyte_bigquery_flow	n/a

Resources

Name	Type
http_http.connectors_catalog	data source

Inputs

Name Description Type Default Required

airbyte_service_account_email Email address of the service account used by the Airbyte VM string n/a yes

flows_configuration

Definition of all the flows to Bigquery that will be Terraformed to your Airbyte instance

map(object({
    flow_name   = string
    source_name = string

    custom_source = optional(object({
      docker_repository = string
      docker_image_tag  = string
      documentation_url = string
    }))

    cron_schedule = optional(string, "manual")
    cron_timezone = optional(string, "UTC")

    normalize = optional(bool, true)

    tables_to_sync = map(object({
      sync_mode             = optional(string, "full_refresh")
      destination_sync_mode = optional(string, "append")
      cursor_field          = optional(string)
      primary_key           = optional(string)
    }))

    source_specification = map(string)

    destination_specification = object({
      dataset_name        = string
      dataset_location    = string
      staging_bucket_name = string
    })
  }))

n/a

yes

project_id GCP project id in which the existing Airbyte instance resides string n/a yes

Outputs

No outputs.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.skaff		.skaff
airbyte_bigquery_flow		airbyte_bigquery_flow
examples		examples
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
catalog-info.yaml		catalog-info.yaml
flows.tf		flows.tf
main.tf		main.tf
provider.tf		provider.tf
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

terraform-google-airbyte-flows

Prerequisites

Usage

Basic configuration example

Configuration example with secrets

Configuration example for a custom source

Scheduled flow configuration example

YAML configuration example

Features

Programmatic deployment of ELT flow to BigQuery with minimal configuration

Seamless integration with the secret manager to secure sensitive configurations

Custom sources support

YAML-compatible configuration

Out of the box data staging in GCS

Input validation for sources configurations

Limitations

Reference: `flows_configuration`

Full specification

Auto-generated module documentation

Requirements

Providers

Modules

Resources

Inputs

Outputs

About

Releases 5

Packages

Contributors 2

Languages

License

artefactory-skaff/terraform-google-airbyte-flows

Folders and files

Latest commit

History

Repository files navigation

terraform-google-airbyte-flows

Prerequisites

Usage

Features

Programmatic deployment of ELT flow to BigQuery with minimal configuration

Seamless integration with the secret manager to secure sensitive configurations

Custom sources support

YAML-compatible configuration

Out of the box data staging in GCS

Input validation for sources configurations

Limitations

Reference: flows_configuration

Full specification

Auto-generated module documentation

Requirements

Providers

Modules

Resources

Inputs

Outputs

About

Resources

License

Stars

Watchers

Forks

Languages

Reference: `flows_configuration`