Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve #[instrument(ret, err)] handling for tracing-opentelemetry #2316

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

notheotherben
Copy link

Motivation

This PR is intended to help resolve an issue with the way that the #[instrument] proc-macro handles return values when working with tracing-opentelemetry. In particular, this focuses on resolving the empty events generated by #[instrument(err)] in environments which use the span event's title for rendering purposes (making these events at best useless and at worst a significant issue for on-call engineers attempting to understand the system's behaviour).

The heart of this problem comes down to two things:

  1. tracing-opentelemetry handles the std::error::Error type specially, capturing relevant error information for these values in the OpenTelemetry semantic convention event fields exception.type, exception.message and exception.stacktrace (if available). To do this, it needs to receive the raw std::error::Error reference, however the only behavioural options available in #[instrument] are Display and Debug formatting (which both result in a String).
  2. The implementation of #[instrument] will cause the instrumented span to have an error = (bool) field populated and will emit a span event with an error = (string) field and no other metadata. This results in type-aware backends (like Honeycomb) erroneously coalescing the detailed string into a true/false value, hiding any useful information. It also means that backends which expect an otel.message will present an empty event title, making this harder to read for an engineer.

Solution

This PR attempts to overcome both of these problems in a relatively backwards compatible manner. To achieve this, it includes two changes:

  1. Adding a new display format (which is available for both err and ret) called Raw. This format will pass the value through without mutating it in any way, allowing layers like tracing-opentelemetry to handle them appropriately. In this case, that allows one to #[instrument(err(Raw))] to ensure that accurate exception information is captured in the automatically generated span events.
  2. Populating the generated span event's title with the display representation of the err value (and/or ret value) such that it can be easily consumed by engineers when analyzing the trace events associated with a span.

In practice, there are two areas where I'm marginally concerned about what the "best" way forward here is and ensuring backwards compatibility:

  1. I feel that Raw should likely be the default representation, with the option of using Display or Debug being available if the type being recorded is not supported by tracing (in much the same way as the current field annotations work). Perhaps this could even be achieved by using the same ?err and %err macros we use today for fields, further consolidating the API and making it easier to pick-up. That said, this would be a breaking change for most implementations which currently avail of the transparent Display/Debug representations baked into #[instrument].
  2. Adding a title to the span event generated for errors and return values introduces the risk of double the data emission for these events and could make it challenging for consumers to de-duplicate this in situations where they don't benefit from the improved usability and/or where they've already developed solutions to this issue. Without expanding the #[instrument] API to make this opt-in/opt-out, it's really challenging for someone to "fix" this without forking the project, and I suspect API expansions are likely not the right decision in the long-term.

This addresses an issue with the standard behaviour of #[instrument] on
OpenTelemetry installations, which prevents them from recording a useful
error event in these scenarios (resulting in `error = false` and no
other details in most cases).

I've opted to keep the default behaviour the same for backwards
compatibility reasons, since most of the test cases assume
`err: std::fmt::Display` instead of `err: std::err::Error` and anything
else would be a breaking change. This makes the fix opt-in for most
users.
This ensures that error events generated by this method will include a
useful name for easy consumption in common OpenTelemetry tracing tools.
This ensures that events generated by this instrumentation have a useful
name for consumption in common OpenTelemetry tracing tools.
@@ -177,9 +177,10 @@ impl Parse for EventArgs {
match ident.to_string().as_str() {
"Debug" => result.mode = FormatMode::Debug,
"Display" => result.mode = FormatMode::Display,
"Raw" => result.mode = FormatMode::Raw,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the name "Raw" for this mode --- it's not super obvious to the user what this actually means. Since the error is being recorded as a dyn Error trait object, and the other two format modes are named "Debug" and "Display", it might make more sense for this to be "Error" instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, after taking a closer look, I see that this also controls how ret values, which are not errors, may be recorded. Maybe Value is a better name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Value here, but I also wonder whether we need to maybe subdivide this a bit more. For context, this PR is really introducing two things:

  1. An ability to record the err and ret value attributes "in their raw value form" instead of relying on std::fmt::* for coercion, which resolves issues with how OpenTelemetry handles Error internally.
  2. Adding event messages to represent the err and ret values so that they appear nicely in practical production UIs like Honeycomb/Jaeger etc.

The second of those points currently is always using Display semantics, however there are types where that is not implemented and Debug would be preferable. In those scenarios we may want to be able to specify Value+Debug and Value+Display to control these behaviours, or alternatively treat "log the attribute value in its raw form" as an option separate from the formatting of the messages.

I'll update this to use Value for now and we can discuss any other changes (either here or in a future PR depending on what you'd prefer).

tracing-attributes/src/lib.rs Outdated Show resolved Hide resolved
tracing-attributes/src/lib.rs Outdated Show resolved Hide resolved
tracing::event!(target: #target, #level_tokens, error = ?e, "{:?}", e)
)),
FormatMode::Raw => Some(quote!(
tracing::event!(target: #target, #level_tokens, error = e, "{}", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some types, such as anyhow::Error, don't actually implement Error directly, but have AsRef<&dyn Error> implementations. is there anything we can do to make this mode also work nicely for events that have to be as_ref'd, without breaking returns for values which are Error implementers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting area to think about. Presumably this falls into the problem domain of mapping attribute values to the record_error(...) hook within the Layer implementation?

I'm expecting that in this specific scenario we're not going to see too much by way of business logic around value types, though if I've missed something then please shout. As for implementing the business logic for handling AsRef<&dyn Error> I'm happy to chase that down, though I would propose landing that in a separate PR to keep the concepts neatly segregated.

tracing-attributes/src/attr.rs Outdated Show resolved Hide resolved
tracing-attributes/src/attr.rs Outdated Show resolved Hide resolved
tracing-attributes/src/attr.rs Outdated Show resolved Hide resolved
tracing-attributes/src/attr.rs Outdated Show resolved Hide resolved
tracing-attributes/src/attr.rs Outdated Show resolved Hide resolved
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more notes, one of which is actually important (the test won't actually pass as it's written currently). other than that, this looks good!

tracing-attributes/src/attr.rs Outdated Show resolved Hide resolved
tracing-attributes/src/lib.rs Outdated Show resolved Hide resolved
tracing-attributes/src/lib.rs Outdated Show resolved Hide resolved
tracing-attributes/src/lib.rs Outdated Show resolved Hide resolved
tracing-attributes/tests/err.rs Outdated Show resolved Hide resolved
@notheotherben
Copy link
Author

Thanks for the review there @hawkw, I think I've covered the docs issues (and some other stuff I noticed going through there) - shout if you spot anything else.

tracing-attributes/src/attr.rs Outdated Show resolved Hide resolved
tracing-attributes/src/attr.rs Outdated Show resolved Hide resolved
@hawkw
Copy link
Member

hawkw commented Aug 18, 2023

looks like cargo fmt needs to be run on this PR: https://github.com/tokio-rs/tracing/actions/runs/5900755392/job/16019086920?pr=2316

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants