Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUMM-1609 Invoke Kronos from a single thread #600

Closed
wants to merge 1 commit into from

Conversation

maxep
Copy link
Member

@maxep maxep commented Sep 21, 2021

What and why?

As per this issue comment, Clock.now and .sync from Kronos are expected to be called from the same thread.

Fix #588

How?

Invoke Kronos from a dedicated DispatchQueue.

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference

@maxep maxep requested a review from a team as a code owner September 21, 2021 14:51
@maxep
Copy link
Member Author

maxep commented Sep 22, 2021

@ncreated @buranmert we could also make currentDate async but this would require some refactoring.
I'm also not sure how to test this since ServerDateProvider is mocked in our test suite 🤔

@maxep maxep self-assigned this Sep 22, 2021
Comment on lines +21 to +36
/// Queue used for invoking Kronos clock.
private let queue = DispatchQueue(
label: "com.datadoghq.date-provider",
target: .global(qos: .userInteractive)
)

func synchronize(with pool: String, completion: @escaping (Date?) -> Void) {
queue.async {
Clock.sync(from: pool, completion: { serverTime, _ in
completion(serverTime)
})
}
}

func currentDate() -> Date? {
return Clock.now
queue.sync { Clock.now }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we managed to reproduce this issue and are we sure this is the right fix 🤔💭?

This TSAN issue in Kronos is raised by two simultaneous operations accessing the same memory:

Digging deeper into the completion block execution in Kronos, it leads to this:

CFHostSetClient(hostReference, callback, &clientContext)
CFHostScheduleWithRunLoop(hostReference, CFRunLoopGetMain(), CFRunLoopMode.commonModes.rawValue)

which looks like always running the completion callback on the main thread. This is in fact what I can observe when putting two basic print() statements in our code and running Example project:

Screenshot 2021-09-22 at 12 08 42

To add further, Kronos maintainers say they call both APIs from the main thread:

In our app we only ever call this from the main thread. I think we'd be happy to support this if you wanted to submit a change for it though

So, by using the queue, we only synchronize calls to Clock.sync and Clock.now (which is not the issue), but we do not force the completion block to be called on the same thread as we later use Clock.now. The completion is always called on the main thread.

Before going further - does my observation make sense @maxep @buranmert ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! This wasn't my understanding of this comment: So technically right now we expect you to call .now and .sync from the same thread.

If actually the .now should be called from the same thread as the completion block of .sync, then yes it should be called on the .main. Seeing this, your observation makes total sense! Now, having to call NTPServerDateProvider.currentData on the main thread will require async for sure. I will also look at submitting a change to Kronos.

I'm not able to reproduce but I will do more investigation.

Copy link
Member

@ncreated ncreated Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, having to call NTPServerDateProvider.currentData on the main thread will require async for sure. I will also look at submitting a change to Kronos.

Yes, I think these are valid choices.

IMO fixing the issue in Kronos seems more correct - right now it seems to enforce main thread access to Clock.now, Clock.annotatedNow and Clock.timestamp to all users. As this is not stated in their public doc this might be a bug and we could contribute.

However, if we want to fix it on our side, IMO queue.async {} is not an easy way to go. That might enforce quite refactoring to data collection system, as we read currentDate() synchronously in many places.

A better idea would be to use our ValuePublisher which is meant for building async / sync bridge in a performant way. It uses concurrent queue optimised for simultaneous reads and exclusive writes, which fits perfectly our usage of Clock.sync (write) and Clock.now (read):

  • Clock.sync is used only once, in Datadog.initialize(), to receive the real NTP time;
  • Clock.now is called from different threads for every event we create (log / span / rum).

With above constraints, and ValuePublisher, we could try something this in NTPServerDateProvider:

internal class NTPServerDateProvider: ServerDateProvider {
    private let serverTimePublisher: ValuePublisher<Date?> = ValuePublisher(initialValue: .empty)

    func synchronize(with ntpPool: String, completion: @escaping (Date?) -> Void) {
        Clock.sync(from: ntpPool, completion: { serverTime, _ in
            self.serverTimePublisher.publishAsync(serverTime)
            completion(serverTime)
        })
    }

    func currentDate() -> Date? {
        return serverTimePublisher.currentValue
    }
}

This way, the real (NTP) time will be available after sync completes and all should be thread-safe. This would be regressive to what we have now, where Clock.now is available earlier than completion and it becomes more precise while NTP sync continues (to become stable when sync ends). To fill this gap further, we could have a look at Kronos first completion API and even look closer at their UserDefaults capability. By using both to additionally feed serverTimePublisher, we should achieve similar behaviour to what we have now, in a thread-safe manner and minimal change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for pointing out the ValuePublisher, this would be a great approach indeed. I will first update this PR with a proposal and then look at modifying Kronos 👍

@maxep
Copy link
Member Author

maxep commented Sep 24, 2021

Closing in favor of #607

@maxep maxep closed this Sep 24, 2021
@maxep maxep deleted the maxep/RUMM-1609/fix-threadsanitizer-data-race branch March 6, 2023 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ThreadSanitizer data race
2 participants