Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add E2E Integration Test For Adaptive Sampling Processor #5951

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

mahadzaryab1
Copy link
Collaborator

Which problem is this PR solving?

Description of the changes

How was this change tested?

Checklist

Copy link

codecov bot commented Sep 7, 2024

Codecov Report

Attention: Patch coverage is 72.22222% with 5 lines in your changes missing coverage. Please review.

Project coverage is 96.89%. Comparing base (f411b3c) to head (ca9a8c9).

Files with missing lines Patch % Lines
internal/safeexpvar/safeexpvar.go 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5951      +/-   ##
==========================================
- Coverage   96.91%   96.89%   -0.03%     
==========================================
  Files         349      349              
  Lines       16587    16598      +11     
==========================================
+ Hits        16076    16082       +6     
- Misses        328      333       +5     
  Partials      183      183              
Flag Coverage Δ
badger_v1 7.99% <0.00%> (-0.01%) ⬇️
badger_v2 1.82% <0.00%> (-0.01%) ⬇️
cassandra-4.x-v1 15.76% <0.00%> (-0.01%) ⬇️
cassandra-4.x-v2 1.74% <0.00%> (-0.01%) ⬇️
cassandra-5.x-v1 15.76% <0.00%> (-0.01%) ⬇️
cassandra-5.x-v2 1.74% <0.00%> (-0.01%) ⬇️
elasticsearch-6.x-v1 18.71% <0.00%> (+<0.01%) ⬆️
elasticsearch-7.x-v1 18.77% <0.00%> (-0.02%) ⬇️
elasticsearch-8.x-v1 18.95% <0.00%> (-0.03%) ⬇️
elasticsearch-8.x-v2 1.82% <0.00%> (-0.01%) ⬇️
grpc_v1 9.37% <0.00%> (-0.01%) ⬇️
grpc_v2 7.12% <0.00%> (-0.01%) ⬇️
kafka-v1 9.70% <0.00%> (-0.01%) ⬇️
kafka-v2 1.82% <0.00%> (-0.01%) ⬇️
memory_v2 1.82% <0.00%> (-0.01%) ⬇️
opensearch-1.x-v1 18.81% <0.00%> (+<0.01%) ⬆️
opensearch-2.x-v1 18.81% <0.00%> (-0.02%) ⬇️
opensearch-2.x-v2 1.82% <0.00%> (+0.01%) ⬆️
tailsampling-processor 0.46% <0.00%> (-0.01%) ⬇️
unittests 95.68% <72.22%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# Copyright (c) 2024 The Jaeger Authors.
# SPDX-License-Identifier: Apache-2.0

BINARY ?= all-in-one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all-in-one is a v1 style binary. I would prefer we test v2 version (or at least both, but v2 is higher priority and testing v1 at this point is wasted work since v1 will be EOLed in a year)

Copy link
Collaborator Author

@mahadzaryab1 mahadzaryab1 Sep 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro I was trying to test jaeger binary but I can't seem to access port 14268. Do you know why that is? I left more details on the issue: #5717 (comment)

@@ -152,7 +152,7 @@ func (a *aggregator) HandleRootSpan(span *span_model.Span, logger *zap.Logger) {
}
samplerType, samplerParam := span.GetSamplerParams(logger)
if samplerType == span_model.SamplerTypeUnrecognized {
return
samplerType = span_model.SamplerTypeProbabilistic
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro what kind of a config do we want to add to perform this override?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like "do not check sampler tags"

Copy link
Collaborator Author

@mahadzaryab1 mahadzaryab1 Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro should this config be exposed as part of the YAML configuration? or do we just want it to be internal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be user settable

// }
// }
// return false
return true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro this causes the unit tests to fail and I believe its messing with the calculations as well. Any ideas on how we can get around this? If we don't hardcode this here however, the probability only gets calculated once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's very difficult to troubleshoot like this. I would suggest maybe altering tracegen and manually adding the sampler.type=probabilistic / sampler.param=0.5 (any value for now) attributes to the span to see how the system reacts to this. To my knowledge aside from this check the probability used by the sampler should not be affecting the calculations, but I may be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and another thing would help is to expose internal state via expvar so that we can actually monitor how that state changes.

Comment on lines +40 to +41
- name: Setup Node.js version
uses: ./.github/actions/setup-node.js
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure you need this, unless the test specifically checks that the UI is able to render the metrics

@mahadzaryab1
Copy link
Collaborator Author

@yurishkuro I added the expvar reporting to debug the first element in the service cache. Here is what I see after the first few intervals. Do these calculations make sense to you?

"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{1 true}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.2083785217916667 true}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.021478606836127154 true}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.002214752631527174 true}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.00022832613436132902 true}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{2.3538892285428596e-05 true}]]"

@@ -398,7 +401,7 @@ func (p *PostAggregator) isUsingAdaptiveSampling(
// before.
if len(p.serviceCache) > 1 {
if e := p.serviceCache[1].Get(service, operation); e != nil {
return e.UsingAdaptive && !FloatEquals(e.Probability, p.InitialSamplingProbability)
return !FloatEquals(e.Probability, p.InitialSamplingProbability)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro with this patch, the numbers seem to make a bit more sense. here's the output i see now

"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{1 true}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.20840949054166666 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.021478798306074933 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.012634530863339348 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.02105483853348014 false}]]"
….
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.08421935413392057 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.1403881259748575 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.08254900503950167 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.051602341629876265 false}]]"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if you have any thoughts on how to proceed here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • post_aggregator_service_cache name is unclear, are these probabilities, or throughput?
  • the Boolean value at the end looks suspicious, what does it mean? If it's "using adaptive sampling" indicator then we need to know why it goes to false.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -346,6 +348,7 @@ func (p *PostAggregator) calculateProbability(service, operation string, qps flo
Probability: oldProbability,
UsingAdaptive: usingAdaptiveSampling,
})
safeexpvar.SetString("post_aggregator_service_cache[0]", fmt.Sprintf("%v", p.serviceCache[0].ToValue()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tostring loses important information, we should use hierarchical expvar.Map

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create e2e integration tests for Adaptive Sampling
2 participants