Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Monitoring Metrics with aioprometheus for ResourcePools #103

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

marcosmamorim
Copy link

@marcosmamorim marcosmamorim commented Mar 31, 2024

Description:
This PR introduces monitoring capabilities for ResourcePools using the aioprometheus library. With these changes, we are able to measure and monitor the execution time of specific events as well as the current state of the ResourcePools, including available and utilized resources.

Context:
The motivation behind this PR is the need to have a clearer and real-time view of the performance and state of the ResourcePools. This has become essential to optimize resource allocation and identify performance bottlenecks more quickly.

Implementation:
The implementation focuses on integrating aioprometheus with our existing system and defining specific metrics for the ResourcePools. The main changes include:

Execution Time Measurement: We use the measure_execution_time or aioprometheus.timer() decorator to measure the execution time of the methods on the operator. This helps us identify and optimize slower operations.

Specific Metrics Added:
In addition to basic execution time monitoring metrics, this PR introduces a series of specific metrics for ResourcePools, allowing for detailed monitoring of their state and utilization. The added metrics are:

resource_pool_min_available: This Gauge metric represents the minimum number of environments available in each ResourcePool. It is crucial for understanding reserve capacity and ensuring that ResourcePools are adequately sized for demands.

resource_pool_available: Similar to the previous metric, this Gauge measures the current number of available environments, offering real-time insights into resource utilization.

resource_pool_used_total: A Counter that accumulates the total number of environments used over time in each ResourcePool. This metric is essential for tracking overall demand and usage patterns.

resource_pool_state: Complementing the above metrics, this Gauge captures the state of each ResourcePool, including information on available and utilized resources. The states are differentiated by labels such as name, namespace, and state, allowing for granular analyses of the condition and performance of the ResourcePools.

Each of these metrics is accompanied by detailed labels, such as name and namespace for ResourcePool, and an additional state label for the resource_pool_state metric. These labels provide the necessary context for precise and targeted analyses, facilitating the identification of areas that require attention or adjustments.

Example:

# HELP resource_pool_available Number of available environments in each resource pool
# TYPE resource_pool_available gauge
resource_pool_available{name="test-01",namespace="poolboy-dev-metrics"} 2
# HELP resource_pool_min_available Minimum number of available environments in each resource pool
# TYPE resource_pool_min_available gauge
resource_pool_min_available{name="test-01",namespace="poolboy-dev-metrics"} 2
# HELP resource_pool_state State of each resource pool, including available and used resources
# TYPE resource_pool_state gauge
resource_pool_state{name="test-01",namespace="poolboy-dev-metrics",state="available"} 2
resource_pool_state{name="test-01",namespace="poolboy-dev-metrics",state="used"} 0
# HELP resource_pool_used_total Total number of environments used in each resource pool
# TYPE resource_pool_used_total counter
# HELP response_time_seconds Response time in seconds
# TYPE response_time_seconds summary
response_time_seconds{method="on_create",quantile="0.5",resource_type="resourceclaims"} 0.003202676773071289
response_time_seconds{method="on_create",quantile="0.9",resource_type="resourceclaims"} 0.003202676773071289
response_time_seconds{method="on_create",quantile="0.99",resource_type="resourceclaims"} 0.003202676773071289
response_time_seconds_count{method="on_create",resource_type="resourceclaims"} 1
response_time_seconds_sum{method="on_create",resource_type="resourceclaims"} 0.003202676773071289
response_time_seconds{method="on_event",quantile="0.5",resource_type="resourceproviders"} 0.0002906322479248047
response_time_seconds{method="on_event",quantile="0.9",resource_type="resourceproviders"} 0.0002906322479248047
response_time_seconds{method="on_event",quantile="0.99",resource_type="resourceproviders"} 0.0002906322479248047
response_time_seconds_count{method="on_event",resource_type="resourceproviders"} 1
response_time_seconds_sum{method="on_event",resource_type="resourceproviders"} 0.0002906322479248047
response_time_seconds{method="manage",quantile="0.5",resource_type="resourcepool"} 0.00021266937255859375
response_time_seconds{method="manage",quantile="0.9",resource_type="resourcepool"} 0.00021266937255859375
response_time_seconds{method="manage",quantile="0.99",resource_type="resourcepool"} 0.00021266937255859375
response_time_seconds_count{method="manage",resource_type="resourcepool"} 1
response_time_seconds_sum{method="manage",resource_type="resourcepool"} 0.00021266937255859375
response_time_seconds{method="on_event",quantile="0.5",resource_type="resourcepools"} 0.00026869773864746094
response_time_seconds{method="on_event",quantile="0.9",resource_type="resourcepools"} 0.00026869773864746094
response_time_seconds{method="on_event",quantile="0.99",resource_type="resourcepools"} 0.00026869773864746094
response_time_seconds_count{method="on_event",resource_type="resourcepools"} 1
response_time_seconds_sum{method="on_event",resource_type="resourcepools"} 0.00026869773864746094
response_time_seconds{method="on_create",quantile="0.5",resource_type="resourcehandles"} 0.020758867263793945
response_time_seconds{method="on_create",quantile="0.9",resource_type="resourcehandles"} 0.020758867263793945
response_time_seconds{method="on_create",quantile="0.99",resource_type="resourcehandles"} 0.020758867263793945
response_time_seconds_count{method="on_create",resource_type="resourcehandles"} 3
response_time_seconds_sum{method="on_create",resource_type="resourcehandles"} 0.0834805965423584

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants