The modules in this repo have been tested with Terraform and OpenTofu, and will help you deploy a production-grade Flyte instance on Microsoft Azure including:
- AKS cluster
- Azure Database for Postgres - Flexible Server
- Azure blob storage container
- Support for Workload Identity Federation with Entra ID
- NGINX Ingress controller
- cert-manager for TLS
- Azure Container Registry
- (Optional) AKS GPU-powered node pool
- Terraform (version 1.3.7)
- Azure CLI
- Helm
- Your Microsoft account should have access to an Azure subscription with at least Contributor role.
- Log into Azure via
az login
- Once logged in, create a new:
- Storage account with default settings
- Storage container for the Terraform state
- Put these values, including the Resource Group where the above resources were created, into backend.tf
-
Go to locals.tf and update the values to match your desired configuration.
-
To automatically provide the input variables that the module needs you can uncomment and edit the values on terraform.tfvars
- From the
environments/azure/flyte-core
folder, initialize the Terraform backend:
cd environments/azure/flyte-core && terraform init -backend=true -backend-config=backend.tfvars
- Generate a Terraform plan:
terraform plan -out=flyte.plan
- Apply the plan:
terraform apply flyte.plan
Example output:
Apply complete! Resources: 9 added, 0 changed, 0 destroyed.
Outputs:
cluster_endpoint = "flytedeploy01.eastus.cloudapp.azure.com"
- Verify Flyte's backend status
kubectl get pods -n flyte
NAME READY STATUS RESTARTS AGE
datacatalog-6864645db6-99msb 1/1 Running 0 6m45s
flyte-pod-webhook-848d7db899-8wltj 1/1 Running 0 6m45s
flyteadmin-6cc67b49b4-cmt7j 1/1 Running 0 6m45s
flyteconsole-68f677797f-p4s98 1/1 Running 0 6m45s
flytepropeller-b88f7bf6d-lqc8s 1/1 Running 0 6m45s
flytescheduler-844db4658c-hfrhv 1/1 Running 0 6m45s
syncresources-767d7fc77b-5mj6n 1/1 Running 0 6m45s
- Update your
$HOME/.flyte/config,yaml
and configureendpoint
with the value of thecluster_endpoint
output:
NOTE: installing
flytectl
will typically create an initialconfig.yaml
file. Learn more.
Example:
...
admin:
endpoint: dns:///flytedeploy01.eastus.cloudapp.azure.com"
insecure: false #it means, the connection uses SSL, even if it's a temporary cert-manager cert.
...
#Uncomment only if you want to test CLI commands and the certificate is not generated yet.
# You can confirm the cert by either going to the UI (a valid certificate should be used) or
#from your terminal: kubectl get challenges.acme.cert-manager.io -n flyte (there should not be any pending challenge). With this flag enabled, SSL is still used but the client doesn't verify the certificate chain.
#insecureSkipVerify: true
NOTE: this configuration step is only needed for CLI access (
flytectl
orpyflyte
), not for the UI.
- Save the following "hello world" workflow definition:
cat << 'EOF' >hello_world.py
from flytekit import task, workflow
@task
def say_hello() -> str:
return "hello world"
@workflow
def my_wf() -> str:
res = say_hello()
return res
if __name__ == "__main__":
print(f"Running my_wf() {my_wf()}")
EOF
- Execute the workflow on the Flyte cluster:
pyflyte run --remote hello_world.py my_wf
Example output:
Running Execution on Remote.
[✔] Go to https://flytedeploy01.eastus.cloudapp.azure.com/console/projects/flytesnacks/domains/development/executions/fae18cf6750bd4d64bc7 to see execution in the console.
- Go to the console and verify the succesful execution:
Congratulations!
You have a fully working Flyte environment on Azure.
From this point on, you can continue your learning journey by going through the Flyte Fundamentals tutorials.
To be able to request GPUs on Azure directly from your Flyte tasks, you have multiple options. This section covers how to use some of them with the Terraform/OpenTofu modules.
The examples in this section use ImageSpec, a Flyte feature that builds a custom container image without a Dockerfile. Install it using
pip install flytekitplugins-envd
.\
- Go to
aks.tf
and switch thelocals.gpu_node_pool_count
value to the number of GPU-enabled nodes you need in your AKS node pool. - Run a
terraform plan
+terraform apply
operation. - Save the following test workflow:
from flytekit import ImageSpec, Resources, task
image = ImageSpec(
base_image= "ghcr.io/flyteorg/flytekit:py3.10-1.10.2",
name="pytorch",
python_version="3.10",
packages=["torch"],
builder="envd",
registry="<YOUR_CONTAINER_REGISTRY>",
)
@task(container_image=image, requests=Resources(gpu="1"))
def check_torch() -> bool:
import torch
return torch.cuda.is_available()
- Execute it remotely on your Flyte cluster:
pyflyte run --remote hello_gpu.py gpu_available
It should return a True
value.
- Go to
values-aks.yaml
and uncomment the keygpu-device-node-label
underconfigmap.k8s.plugins.k8s
. This is an arbitrary label that is applied as anodeAffinity
to Pods spawned from Tasks that request a specific accelerator. - Go to
aks.tf
and confirm or change thelocals.accelerator
value to the GPU device model that you plan to use. Make sure to check the supported options. - Run a
terraform plan
+terraform apply
operation - Save and execute the following test workflow, changing the GPU device model to match your environment (example with a V100):
from flytekit import ImageSpec, Resources, task
from flytekit.extras.accelerators import V100
image = ImageSpec(
base_image= "ghcr.io/flyteorg/flytekit:py3.10-1.10.2",
name="pytorch",
python_version="3.10",
packages=["torch"],
builder="envd",
registry="<YOUR_CONTAINER_REGISTRY>",
)
@task(requests=Resources( gpu="1"),
accelerator=V100,
)
def gpu_available() -> bool:
return torch.cuda.is_available()
Learn more about accelerators in flytekit
- Go to
aks.tf
and adjust the value of thelocals.partition_size
key to your desired GPU partition size.
Learn more about the supported partition profiles for NVIDIA A100 devices
- Run a
terraform plan
+terraform apply
operation - Save and execute the following test workflow, changing the GPU partition size to match your needs:
from flytekit import ImageSpec, Resources, task
from flytekit.extras.accelerators import A100
image = ImageSpec(
base_image= "ghcr.io/flyteorg/flytekit:py3.10-1.10.2",
name="pytorch",
python_version="3.10",
packages=["torch"],
builder="envd",
registry="<YOUR_CONTAINER_REGISTRY>",
)
@task(requests=Resources( gpu="1"),
accelerator=A100.partition_2g_10gb,
)
def gpu_available() -> bool:
return torch.cuda.is_available()
Learn more about GPU configuration in the Flyte docs.
- Once you're done testing/using Flyte, just invoke the following command:
terraform destroy