Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: fix missed register_callsite error #2938

Open
wants to merge 2 commits into
base: v0.1.x
Choose a base branch
from

Conversation

hds
Copy link
Contributor

@hds hds commented Apr 16, 2024

Note: This is a change against the v0.1.x branch.

Motivation

There are 2 triggers which will cause a subscriber to receive a call to
Subscriber::register_callsite for a specific callsite.

  1. The first time the event or span at that callsite is executed.
  2. When a new subscriber is added or removed (for example, calls to
    set_default or with_default)

It is trigger (2) that will cause a new subscriber to receive
Subscriber::register_callsite for all the callsites which had already
been registered before it became active.

When a callsite is registered for trigger (1), the callsite starts in
state UNREGISTERED.

The first thread to encounter the callsite will transition it to
REGISTERING and determine the overall interest for the callsite by
registering with all known dispatchers (which will call into
Subscriber::register_callsite).

Once that is complete, the callsite is added to the list of all known
callsites and its state is transitioned to REGISTERED.

is (re)built for all known dispatchers. The callsite starts in state
UNREGISTERED. The This calls down into
Subscriber::register_callsite for each subscriber. Once that is
complete, the callsite is added to the global list of known callsites.

While the callsite interest is being rebuilt, other threads that
encounter the callsite will be given Interest::sometimes() until the
registration is complete. However, if a new subscriber is added during
this window, all the interest for all callsites will be rebuilt, but
because the new callsite (in state REGISTERING) won't be included
because it isn't yet in the global list of callsites.

This can cause a case where that new subscriber being added won't
receive Subscriber::register_callsite before it receives the subsequent
call to Subscriber::event or Subscriber::new_span.

The documentation on Registering Callsites is not very explicit on
this point, but it does suggest that Subscriber::register_callsite
will be called before the call to either Subscriber::event or
Subscriber::new_span, and the current behavior can break this implicit
contract.

Solution

This change swaps the order of rebuilding the callsite interest and
adding the callsite to the global list so that the callsite gets pushed
first, avoiding this window in which a subscriber won't get a call to
register_callsite.

As such, a callsite may have its interest read before it is set. In this
case, the existing implementation will return Interest::sometimes()
for the DefaultCallsite implementation. Other implementations (outside
of the tracing project) may perform this differently, but in this
case, there is no documented guarantee regarding the ordering.

A regression test is included which provokes the race condition 100% of
the time before the changes in this fix.

Fixes: #2743

@hds hds force-pushed the hds/missed-register-callsite-fix-v0.1.x branch from f8db879 to 8b6dca6 Compare April 16, 2024 17:37
There are 2 triggers which will cause a subscriber to receive a call to
`Subscriber::register_callsite` for a specific callsite.
1. The first time the event or span at that callsite is executed.
2. When a new subscriber is added or removed (for example, calls to
   `set_default` or `with_default`)

It is trigger (2) that will cause a new subscriber to receive
`Subscriber::register_callsite` for all the callsites which had already
been registered before it became active.

When a callsite is registered for trigger (1), the callsite starts in
state `UNREGISTERED`.

The first thread to encounter the callsite will transition it to
`REGISTERING` and determine the overall interest for the callsite by
registering with all known dispatchers (which will call into
`Subscriber::register_callsite`).

Once that is complete, the callsite is added to the list of all known
callsites and its state is transitioned to `REGISTERED`.

is (re)built for all known dispatchers. The callsite starts in state
`UNREGISTERED`. The  This calls down into
`Subscriber::register_callsite` for each subscriber. Once that is
complete, the callsite is added to the global list of known callsites.

While the callsite interest is being rebuilt, other threads that
encounter the callsite will be given `Interest::sometimes()` until the
registration is complete. However, if a new subscriber is added during
this window, all the interest for all callsites will be rebuilt, but
because the new callsite (in state `REGISTERING`) won't be included
because it isn't yet in the global list of callsites.

This can cause a case where that new subscriber being added won't
receive `Subscriber::register_callsite` before it receives the subsequent
call to `Subscriber::event` or `Subscriber::new_span`.

The documentation on [Registering Callsites] is not very explicit on
this point, but it does suggest that `Subscriber::register_callsite`
will be called before the call to either `Subscriber::event` or
`Subscriber::new_span`, and the current behavior can break this implicit
contract.

[Registering Callsites]: https://docs.rs/tracing-core/0.1.32/tracing_core/callsite/index.html#registering-callsites

This change swaps the order of rebuilding the callsite interest and
adding the callsite to the global list so that the callsite gets pushed
first, avoiding this window in which a subscriber won't get a call to
`register_callsite`.

As such, a callsite may have its interest read before it is set. In this
case, the existing implementation will return `Interest::sometimes()`
for the `DefaultCallsite` implementation. Other implementations (outside
of the `tracing` project) may perform this differently, but in this
case, there is no documented guarantee regarding the ordering.

A regression test is included which provokes the race condition 100% of
the time before the changes in this fix.

Fixes: #2743
@hds
Copy link
Contributor Author

hds commented Apr 17, 2024

@hawkw I'd love to get your opinion on the change I've made in the callsite module. It seems fine for me (and certainly no tests failing), but I think you're probably the only person in the world who would know.

The test in tracing can probably go, and I might need another test in tracing-core to cover the dyn Callsite case. Also, interestingly, I couldn't reproduce this issue on master.

@hawkw
Copy link
Member

hawkw commented Apr 17, 2024

@hawkw I'd love to get your opinion on the change I've made in the callsite module. It seems fine for me (and certainly no tests failing), but I think you're probably the only person in the world who would know.

Thanks for the ping! I'll try to give it a look today or tomorrow.

@hawkw hawkw self-requested a review April 17, 2024 18:02
@mladedav
Copy link
Contributor

@hds you mentioned the docs not being explicit on the topic, could you maybe clarify them? You just might be the second most knowledgeable person on the topic now.

@hds hds marked this pull request as ready for review October 2, 2024 21:58
@hds hds requested a review from a team as a code owner October 2, 2024 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants