Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus integration in Watcher controller #40

Merged

Conversation

amoralej
Copy link
Contributor

@amoralej amoralej commented Jan 16, 2025

Watcher will use Prometheus as metrics store. We need to integrate the
prometheus configuration from the Watcher controller.

In order to configure Prometheus datasource in Watcher we are using a secret which will provide the required connection data. This secret will be provided by the telemetry operator at some point although could be any secret which has the expected fields, host, port, ca_secret and ca_key

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/a4c4be5e3fde466491956dbce76f2b96

✔️ noop SUCCESS in 0s
✔️ openstack-meta-content-provider SUCCESS in 1h 45m 20s
watcher-operator-validation FAILURE in 1h 12m 21s
watcher-operator-kuttl RETRY_LIMIT in 29m 12s

@amoralej
Copy link
Contributor Author

recheck

@amoralej amoralej force-pushed the addprometheus branch 2 times, most recently from f5f6817 to c51c6e9 Compare January 17, 2025 10:41
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/5f8c8ce0324f4f54b447479fb19e1ddc

✔️ noop SUCCESS in 0s
✔️ openstack-meta-content-provider SUCCESS in 1h 30m 58s
✔️ watcher-operator-validation SUCCESS in 1h 19m 09s
watcher-operator-kuttl RETRY_LIMIT in 51m 03s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/bc81efe937e44deda7ff67e1a796a300

✔️ noop SUCCESS in 0s
✔️ openstack-meta-content-provider SUCCESS in 1h 41m 09s
✔️ watcher-operator-validation SUCCESS in 1h 29m 05s
watcher-operator-kuttl RETRY_LIMIT in 1h 07m 18s

@amoralej amoralej changed the title [WIP] Add prometheus configuration [WIP] Prometheus integration in Watcher controller Jan 17, 2025
@amoralej amoralej changed the title [WIP] Prometheus integration in Watcher controller Prometheus integration in Watcher controller Jan 20, 2025
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/ci-framework#2658 is needed.

@amoralej
Copy link
Contributor Author

recheck

Copy link
Contributor

@raukadah raukadah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Overall codewise looks good.

api/v1beta1/common_types.go Outdated Show resolved Hide resolved
ci/playbooks/create-coo-subscription.yaml Outdated Show resolved Hide resolved
ci/playbooks/deploy_watcher_service.yaml Outdated Show resolved Hide resolved
}

if instance.Spec.PrometheusTLSCaCertSecret != nil {
PrometheusCaCert = watcher.CustomPrometheusCaCertFolderPath + instance.Spec.PrometheusTLSCaCertSecret.Key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use filepath.Join to add these two file path.

Suggested change
PrometheusCaCert = watcher.CustomPrometheusCaCertFolderPath + instance.Spec.PrometheusTLSCaCertSecret.Key
PrometheusCaCert = filepath.Join(watcher.CustomPrometheusCaCertFolderPath , instance.Spec.PrometheusTLSCaCertSecret.Key)

templates/watcher/config/00-default.conf Show resolved Hide resolved
Copy link

openshift-ci bot commented Jan 20, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: raukadah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=65535
// +kubebuilder:validation:Optional
PrometheusPort int32 `json:"prometheusPort,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not really the corect interface.

if we were to pass the service endpoint to watcher directly
we should be past either the name of a promitious Cr that we can then use to retire a transport URL equivalent as a secrete or we should be passing the k8s service to consume.

that is not how we should implement this however.

at the end of the day the planned way to do the integration is for watcher to lookup the pometious end point in keystone as aodh does.

that is work that still need to be completed in the 2025.1 cycle upstream as a prerquist for FR2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the correct way to integrate this in the short term would be for the telemetry operator to render a secrete with eh port/host info and tls flag and for use to have a filed that takes the secrete by name.

we can default the name of that so that it does not need to be specified by default.

for the CA secrete we can have a separate filed for that as they do in the telemetry operator
PrometheusTLSCaCertSecret
https://github.com/openstack-k8s-operators/telemetry-operator/blob/acd9c11a60caf6ff96d31bb56e290975eb94d785/api/v1beta1/autoscaling_types.go#L176

this is an interface we shoudltry an get right before we merge anything.

exposing the hostname in the CRD is something we should avoid doing.

Copy link
Contributor Author

@amoralej amoralej Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain how this is supposed to work. I've taken this approach from the autoscaling controller in telemetry-operator but I'm fine if we prefer to change it. With this approach we are covering two ways to configure Prometheus parameters:

  1. Provide our custom host, port and CA certificate to any value that the customer is willing, that could be even an external prometheus not managed by openstack-operator . For this case I am creating the three parameters PrometheusHost, PrometheusPort and PrometheusTLSCaCertSecret. These values are not intended to be used when using the integrated prometheus deployed by Telemetry operator.
  2. When using the prometheus deployed by telemetry-operator, those three parameters are expected to keep empty. In that case the watcher-operator checks that there is a MetricStorage object created and will configure host, port and cacert just based on the naming conventions used by telemetry-operator:
  • PrometheusHost = fmt.Sprintf("%s-prometheus.%s.svc", telemetryv1.DefaultServiceName, instance.Namespace)
  • PrometheusPort = telemetryv1.DefaultPrometheusPort
  • PrometheusCaCert = tls.DownstreamTLSCABundlePath

If we prefer to only support the configuration of the integrated prometheus, we can simplify it, totally get rid of the Prometheus* parameters in the Spec and apply just approach 2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, before we merge anything downstream to enable this integration, my hope is that we have completed the keystone integration in the data source plugin so that you do not need to specify any Prometheus host or port in watcher.

to me having the hostname in the CR is a seciutiy hardening opportunity
its not a vulnerability but we should avoid it, i don't think the current telmetery operator CRd is approate and it likely should be changed.

we have not requrieemtn to support external prometious instance and i don't think that is something we should do at least not in the near term without a document usecase and further discussion.

in the absence of the keystone integration my preference would be to modify the telemetry operator to store the data required in a pre-defiend secrete
and for us to pass that by name to the wathcer CR.

if somethone whated to use an external prometious which again i don't think we should support then the could just create a different secrete with the correct filed and pass that in instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, I will simplify it make the operator to rely in the values used by the telemetry-operator when deploying prometheus. Thanks for the feedback

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be clear, this discussion and deciding to not support the external prometheus case applies to the operator only right? upstream we will have to support both as we cannot assume prometheus will be integrated with keystone via telemetry operator in the same way

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will be able to support external prometheus this this interface by creating a secret with the appropriate content and passing that.

in our product we likely wont support that but it will be possible to do.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/3f3bf0e6167c4f3496ee9975ee5d65d5

✔️ noop SUCCESS in 0s
✔️ openstack-meta-content-provider SUCCESS in 1h 18m 00s
watcher-operator-validation FAILURE in 1h 06m 10s
watcher-operator-kuttl RETRY_LIMIT in 21m 05s

@amoralej amoralej removed the approved label Jan 23, 2025
@amoralej amoralej changed the title Prometheus integration in Watcher controller WIP Prometheus integration in Watcher controller Jan 23, 2025
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/4cbcaf3b338441afb309c7306eb79ddc

✔️ noop SUCCESS in 0s
✔️ openstack-meta-content-provider SUCCESS in 57m 12s
watcher-operator-validation FAILURE in 46m 15s
✔️ watcher-operator-kuttl SUCCESS in 34m 49s

@amoralej
Copy link
Contributor Author

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/5df118c8dd3549f5a5fd6ce3a4393e0d

✔️ noop SUCCESS in 0s
✔️ openstack-meta-content-provider SUCCESS in 1h 43m 42s
✔️ watcher-operator-validation SUCCESS in 1h 12m 35s
watcher-operator-kuttl RETRY_LIMIT in 50m 30s

@amoralej amoralej force-pushed the addprometheus branch 2 times, most recently from 6a71c9b to 3ab9441 Compare January 27, 2025 11:27
@amoralej amoralej changed the title WIP Prometheus integration in Watcher controller Prometheus integration in Watcher controller Jan 27, 2025
Copy link
Collaborator

@SeanMooney SeanMooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty close

i think there are still a few issues with the status and finaliser however which we should probably fix before meging.

Makefile Show resolved Hide resolved
Makefile Show resolved Hide resolved
ci/olm.sh Show resolved Hide resolved
api/v1beta1/common_types.go Show resolved Hide resolved
controllers/watcher_controller.go Show resolved Hide resolved
controllers/watcher_controller.go Show resolved Hide resolved
controllers/watcher_controller.go Show resolved Hide resolved
api/v1beta1/common_types.go Show resolved Hide resolved
apiVersion: v1
kind: Secret
metadata:
name: metric-storage-prometheus-config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want "kuttl" in there somewhere to differentiate from the actual default i.e. "metric-storage-prometheus-config-kuttl" for example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, also in that way we will avoid future conflicts when telemetry operator creates it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking it again, the idea of this test is to validate defaults. We have another test case with custom value in scenario 04-deploy-with-precreated-account.yaml (we should rename it to 04-deploy-with-non-defaults.yaml btw in follow-up)

In order to configure Prometheus datasource in Watcher we are using a
secret which will provide the required connection data. This secret will
be provided by the telemetry operator at some point although could be
any secret which has the expected fields, host, port and
ca_secret and ca_key.

This patch adds the new field into the spec, validates that it provides the
required fields and copy from Watcher to SubCRs. It also adds a watcher
on the secret in the main controller (also adds the watcher on the top
level .Spec.secret which was misssing).
This patch includes the prometheus configuration data from the provided
secret into the config files in the watcherapi controller.
This patch adds some required prometheus secret for kuttl jobs to run
and adds validation checks.

Also, it modifies the watcher_deploy make target to create a secret so
that the watcher can be deployed until the secret is created by the
telemetry operator.
Copy link
Contributor

@marios marios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had another pass, couldn't spot something to comment on ;)

@cescgina
Copy link
Contributor

cescgina commented Jan 30, 2025

lgtm I think this is good to go once the ongoing discussion about the finalizer is complete

Copy link
Collaborator

@SeanMooney SeanMooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this looks pretty complete.
we should have envtest coverage for adding/removing the finalizer
but i think we can proceed with this based on the kuttl testing.

the kuttl testing could also be extended to test the removal of the finaliser from the secret when the watcher cr is removed.

I'm glad that you have added the condition setting for validation error with regards to the prometheus secrets.

Makefile Show resolved Hide resolved
watcherv1beta1.WatcherPrometheusSecretErrorMessage))
return ctrl.Result{}, err
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it would be better to do this in the other order
however if if the secrete is delete then the update will fail sowe will get an error and abort, so this is a safe ordering.

util.LogForObject(helper, "Removed finalizer from prometheus config secret", instance)
}
}
//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

name: metric-storage-prometheus-config
namespace: watcher-kuttl-default
finalizers:
- openstack.org/watcher
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this is what I was looking for.

I wanted to confirm we had some testing for the finaliser.

so we have kuttl testing for this, we should also have envtests for this but this is the minium required testing

@openshift-ci openshift-ci bot added the lgtm label Jan 30, 2025
@viroel
Copy link
Contributor

viroel commented Jan 30, 2025

nice work here @amoralej - lgtm

@openshift-merge-bot openshift-merge-bot bot merged commit 6aacd44 into openstack-k8s-operators:main Jan 30, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants