Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improvement] add cloud-init boostrap via cluster object store #586

Merged
merged 5 commits into from
Dec 12, 2024

Conversation

cbang-akamai
Copy link
Contributor

@cbang-akamai cbang-akamai commented Dec 3, 2024

What this PR does / why we need it:

This change introduces an optional Cluster Object Store in the LinodeCluster resource definition. The Cluster Object Store definition references an object storage bucket used for internal cluster operations, e.g. bootstrapping.

Currently, for all Cluster API bootstrap providers (e.g. Cluster API Provider RKE2) , CAPL uses cloud-init as the default configuration engine to bootstrap workload cluster machines. Cloud-init bootstrap configurations (cloud-configs) are deployed to the cluster's Linodes via the Metadata service as user data. Unfortunately, the Metadata service currently the limits amount of data the can be transmitted to a Linode. The restrictions are:

Base64-encoded cloud-config data.

Cannot be modified after provisioning. To update, use either the Clone a Linode or Rebuild a Linode operations.

Must not be included when cloning to an existing Linode.

Unencoded data must not exceed 65535 bytes, or about 16kb encoded.

Source: https://techdocs.akamai.com/linode-api/reference/post-linode-instance

This limitation restricts the adoption of CAPL, as users may need to include custom configuration to in the bootstrap configuration of their Linodes, which can result in a cloud-config that exceeds this 16kB Metadata service limit. To bypass this limitation, this submission introduces the following Linode creation workflow changes:

  1. During LinodeMachine creation, check the Cluster API bootstrap Secret to determine if the generated bootstrap cloud-config fits within the Metadata or Stackscript service limits.
  2. If it does, continue the Linode creation as normal.
  3. Otherwise, if the cloud-config is too large, upload the original cloud-config to the cluster's Object Store and generate a signed URL for the uploaded object.
  4. Generate a "pointer" cloud-config (using cloud-init's Include file user data format) to deliver the cloud-config by reference to the Linode.
  5. Pass the pointer cloud-config to the cloud-init bootstrap service instead. Cloud-init will read the pointer cloud-config after instance initialization and download the original cloud-config, thereby bypassing any service limits.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Verification:

  1. For your selected flavor, modify the bootstrap configuration to generate a large (>16kb) cloud-config:
$ tr -dc A-Za-z0-9 < /dev/urandom | head -c 32kB >> chonk.txt

$ kubectl create secret generic chonk-secret --from-file=chonk.txt

# Example: Kubeadm
$ cat 0001-DEBUG-kubeadm-add-dummy-thicc-file.patch
From ea5a83b70eab030fb9df2a708cc5a974ae97e967 Mon Sep 17 00:00:00 2001
From: Example <[email protected]>
Date: Tue, 3 Dec 2024 09:28:37 -0500
Subject: [PATCH] DEBUG: kubeadm: add dummy thicc file

---
 .../flavors/kubeadm/default/kubeadmConfigTemplate.yaml      | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/templates/flavors/kubeadm/default/kubeadmConfigTemplate.yaml b/templates/flavors/kubeadm/default/kubeadmConfigTemplate.yaml
index 43e19db..3834eb3 100644
--- a/templates/flavors/kubeadm/default/kubeadmConfigTemplate.yaml
+++ b/templates/flavors/kubeadm/default/kubeadmConfigTemplate.yaml
@@ -6,6 +6,12 @@ metadata:
 spec:
   template:
     spec:
+      files:
+        - path: /chonk.txt
+          contentFrom:
+            secret:
+              key: chonk.txt
+              name: chonk-secret
       preKubeadmCommands:
         - curl -fsSL https://github.com/linode/cluster-api-provider-linode/raw/dd76b1f979696ef22ce093d420cdbd0051a1d725/scripts/pre-kubeadminit.sh | bash -s ${KUBERNETES_VERSION}
         - hostnamectl set-hostname '{{ ds.meta_data.label }}' && hostname -F /etc/hostname
--
2.25.1

$ git am --keep-cr 0001-DEBUG-kubeadm-add-dummy-thicc-file.patch
Applying: DEBUG: kubeadm: add dummy thicc file
  1. Create a cluster:
$ make local-release

$ clusterctl generate cluster $CLUSTER_NAME --flavor kubeadm-full --kubernetes-version v1.29.1 --infrastructure local-linode:v0.0.0 | kubectl apply -f -
  1. Check all Machine(s) and LindoeMachine(s) are able to successfully provision:
$ kubectl get linodemachine
NAME                                  CLUSTER           STATE     READY   PROVIDERID          MACHINE
${CLUSTER_NAME}-control-plane-j6twt   ${CLUSTER_NAME}   running   true    linode://00000000   ${CLUSTER_NAME}-control-plane-j6twt
${CLUSTER_NAME}-md-0-4ttnz-9js7n      ${CLUSTER_NAME}   running   true    linode://00000000   ${CLUSTER_NAME}-md-0-4ttnz-9js7n

$ kubectl get machine
NAME                                  CLUSTER           NODENAME                              PROVIDERID          PHASE     AGE     VERSION
${CLUSTER_NAME}-control-plane-j6twt   ${CLUSTER_NAME}   ${CLUSTER_NAME}-control-plane-j6twt   linode://00000000   Running   12m     v1.29.1
${CLUSTER_NAME}-md-0-4ttnz-9js7n      ${CLUSTER_NAME}   ${CLUSTER_NAME}-md-0-4ttnz-9js7n      linode://00000000   Running   3m35s   v1.29.1
  1. Check the Linode configuration:
$ export IP="$(kubectl get machine ${CLUSTER_NAME}-md-0-4ttnz-9js7n -oyaml | yq '.status.addresses[] | select (.type == "ExternalIP") | .address' | head -n 1)"

$ ssh root@$IP cloud-init query userdata && ls -lah /chonk.txt
#include
🏃‍♀️💨🔫👮🚔

-rw-r--r-- 1 root root 32K Dec  3 17:29 /chonk.txt

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

@cbang-akamai cbang-akamai changed the title [improvement] metadata limit bypass [improvement] add cloud-init boostrap via cluster object store Dec 4, 2024
@cbang-akamai cbang-akamai force-pushed the hack.metadata branch 3 times, most recently from 8473efd to 6201a39 Compare December 4, 2024 16:53
@AshleyDumaine AshleyDumaine self-requested a review December 4, 2024 20:09
cloud/scope/common.go Outdated Show resolved Hide resolved
@cbang-akamai cbang-akamai force-pushed the hack.metadata branch 4 times, most recently from 0e849d4 to 753c02c Compare December 6, 2024 21:15
Comment on lines 22 to 40
if mscope == nil {
return "", errors.New("machine scope can't be nil")
}

if mscope.S3Client == nil {
return "", errors.New("nil S3 client in machine scope")
}

if mscope.LinodeCluster.Spec.ObjectStore == nil {
return "", errors.New("nil cluster object store")
}

if len(data) == 0 {
return "", errors.New("got empty data")
}

bucket, err := mscope.GetBucketName(ctx)
if err != nil || bucket == "" {
return "", errors.New("no bucket name")
Copy link
Contributor

@AshleyDumaine AshleyDumaine Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we define these errors in a var block up top similar to internal/controller/linodemachine_controller_helpers.go? Might be useful for tests to have these defined. Also can then be reused in DeleteObject below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also make sense to create a helper function to check all the required variables are set and that we can get the bucket since there seems to be some overlap in the DeleteObject function

Copy link
Contributor Author

@cbang-akamai cbang-akamai Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define these errors in a var block

QQ: Why do we want(?) this level of insight into the implementation details in our unit tests?

create a helper function to check all the required variables are set

Sounds good! 👮

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to see the test errored for the right reason rather than that it just threw an error in general.

Copy link
Contributor Author

@cbang-akamai cbang-akamai Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to litter the package with one-off error globals here, so I just matched the error string in the tests 👼

// In the case that the IAM policy does not have sufficient
// permissions to get the object, we will attempt to delete it
// anyway for backwards compatibility reasons.
case errors.As(err, &ae) && ae.ErrorCode() == "Forbidden":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this is a string instead of an int. Is there at least some kind of const for this from smithy?

Copy link
Contributor Author

@cbang-akamai cbang-akamai Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the github.com/aws/smithy-go.APIError type here is just a super generic error container. The actual error formats and any values are unique to the service themselves, e.g. Amazon S3.

// Clean up after instance creation.
if linodeInstance.Status == linodego.InstanceRunning && machineScope.Machine.Status.Phase == "Running" {
if err := deleteBootstrapData(ctx, machineScope); err != nil {
logger.Error(err, "Fail to bootstrap data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to have deleteBootstrapData function log the deletion error. We might also not want to log it as an error if we're not doing anything with that error like requeuing to retry the deletion (though we might want to do that in the case we get some kind of error like rate-limiting?)

Copy link
Contributor Author

@cbang-akamai cbang-akamai Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to log it in the caller instead of the deleteBootstrapData function so the caller can determine what to do if the deleteBootstrap data function errors out.

Like, for now, I'm just logging it as an error in the caller(s), but not failing the reconciliation itself since I don't think object storage issues here should block reconciliation.

if err != nil {
logger.Info(fmt.Sprintf("Failed to compress bootstrap data: %v", err))
logger.Info("Falling back to Object Storage")
goto obj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead have a helper function called something like storeMetadataInObj?

Copy link
Contributor Author

@cbang-akamai cbang-akamai Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can pull out the logic after this entire switch into it's own helper function to make it easier to read.

However, are you asking to remove the goto statement here? I think we still should keep this since the aim of this switch (including the goto statements) is to bail out early when we know the delivery "method" to use for the bootstrap data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I meant doing something like this:

	case machineScope.LinodeMachine.Status.CloudinitMetadataSupport && gzipCompressionEnabled:
		compressed, err = compressUserData(bootstrapdata)
		// Fallback to Object Storage workaround on compression failure.
		if err != nil {
			logger.Info(fmt.Sprintf("Failed to compress bootstrap data: %v", err))
			logger.Info("Falling back to Object Storage")
			return storeMetadataInObj(ctx, machineScope, logger, bootstrapdata)
		}
		size = len(compressed)
		if size < limit {
			return compressed, nil
		}
		return storeMetadataInObj(ctx, machineScope, logger, bootstrapdata)
	// Worst case: Object Storage workaround.
	default:
		logger.Info("decoded bootstrap data exceeds size limit", "limit", limit, "size", size)
		return storeMetadataInObj(ctx, machineScope, logger, bootstrapdata)
	}

Copy link
Contributor Author

@cbang-akamai cbang-akamai Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked the logic inside the switch statement to:

  • Reduce repetition between all the different path while preserving the overall "layered" visual structure
  • Make the (now single) goto statement function more like a break out of the nested switch logic and the continue executing the rest of the function

hit

@cbang-akamai cbang-akamai force-pushed the hack.metadata branch 3 times, most recently from 28936db to bf02585 Compare December 11, 2024 19:53
AshleyDumaine
AshleyDumaine previously approved these changes Dec 12, 2024
Copy link
Contributor

@AshleyDumaine AshleyDumaine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cbang-akamai cbang-akamai added the documentation Improvements or additions to documentation label Dec 12, 2024
@cbang-akamai cbang-akamai merged commit 3792726 into main Dec 12, 2024
17 checks passed
@cbang-akamai cbang-akamai deleted the hack.metadata branch December 12, 2024 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants