Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUMM-1609 Invoke Kronos from a single thread #600

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 14 additions & 6 deletions Sources/Datadog/Core/System/Time/ServerDateProvider.swift
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,28 @@ import Kronos
internal protocol ServerDateProvider {
/// Start the clock synchronisation with NTP server.
/// Calls the `completion` by passing it the server time when the synchronization succeeds or`nil` if it fails.
func synchronize(with ntpPool: String, completion: @escaping (Date?) -> Void)
func synchronize(with pool: String, completion: @escaping (Date?) -> Void)
/// Returns the server time or `nil` if not yet determined.
/// This time gets more precise while synchronization is pending.
func currentDate() -> Date?
}

internal class NTPServerDateProvider: ServerDateProvider {
func synchronize(with ntpPool: String, completion: @escaping (Date?) -> Void) {
Clock.sync(from: ntpPool, completion: { serverTime, _ in
completion(serverTime)
})
/// Queue used for invoking Kronos clock.
private let queue = DispatchQueue(
label: "com.datadoghq.date-provider",
target: .global(qos: .userInteractive)
)

func synchronize(with pool: String, completion: @escaping (Date?) -> Void) {
queue.async {
Clock.sync(from: pool, completion: { serverTime, _ in
completion(serverTime)
})
}
}

func currentDate() -> Date? {
return Clock.now
queue.sync { Clock.now }
Comment on lines +21 to +36
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we managed to reproduce this issue and are we sure this is the right fix 🤔💭?

This TSAN issue in Kronos is raised by two simultaneous operations accessing the same memory:

Digging deeper into the completion block execution in Kronos, it leads to this:

CFHostSetClient(hostReference, callback, &clientContext)
CFHostScheduleWithRunLoop(hostReference, CFRunLoopGetMain(), CFRunLoopMode.commonModes.rawValue)

which looks like always running the completion callback on the main thread. This is in fact what I can observe when putting two basic print() statements in our code and running Example project:

Screenshot 2021-09-22 at 12 08 42

To add further, Kronos maintainers say they call both APIs from the main thread:

In our app we only ever call this from the main thread. I think we'd be happy to support this if you wanted to submit a change for it though

So, by using the queue, we only synchronize calls to Clock.sync and Clock.now (which is not the issue), but we do not force the completion block to be called on the same thread as we later use Clock.now. The completion is always called on the main thread.

Before going further - does my observation make sense @maxep @buranmert ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! This wasn't my understanding of this comment: So technically right now we expect you to call .now and .sync from the same thread.

If actually the .now should be called from the same thread as the completion block of .sync, then yes it should be called on the .main. Seeing this, your observation makes total sense! Now, having to call NTPServerDateProvider.currentData on the main thread will require async for sure. I will also look at submitting a change to Kronos.

I'm not able to reproduce but I will do more investigation.

Copy link
Member

@ncreated ncreated Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, having to call NTPServerDateProvider.currentData on the main thread will require async for sure. I will also look at submitting a change to Kronos.

Yes, I think these are valid choices.

IMO fixing the issue in Kronos seems more correct - right now it seems to enforce main thread access to Clock.now, Clock.annotatedNow and Clock.timestamp to all users. As this is not stated in their public doc this might be a bug and we could contribute.

However, if we want to fix it on our side, IMO queue.async {} is not an easy way to go. That might enforce quite refactoring to data collection system, as we read currentDate() synchronously in many places.

A better idea would be to use our ValuePublisher which is meant for building async / sync bridge in a performant way. It uses concurrent queue optimised for simultaneous reads and exclusive writes, which fits perfectly our usage of Clock.sync (write) and Clock.now (read):

  • Clock.sync is used only once, in Datadog.initialize(), to receive the real NTP time;
  • Clock.now is called from different threads for every event we create (log / span / rum).

With above constraints, and ValuePublisher, we could try something this in NTPServerDateProvider:

internal class NTPServerDateProvider: ServerDateProvider {
    private let serverTimePublisher: ValuePublisher<Date?> = ValuePublisher(initialValue: .empty)

    func synchronize(with ntpPool: String, completion: @escaping (Date?) -> Void) {
        Clock.sync(from: ntpPool, completion: { serverTime, _ in
            self.serverTimePublisher.publishAsync(serverTime)
            completion(serverTime)
        })
    }

    func currentDate() -> Date? {
        return serverTimePublisher.currentValue
    }
}

This way, the real (NTP) time will be available after sync completes and all should be thread-safe. This would be regressive to what we have now, where Clock.now is available earlier than completion and it becomes more precise while NTP sync continues (to become stable when sync ends). To fill this gap further, we could have a look at Kronos first completion API and even look closer at their UserDefaults capability. By using both to additionally feed serverTimePublisher, we should achieve similar behaviour to what we have now, in a thread-safe manner and minimal change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for pointing out the ValuePublisher, this would be a great approach indeed. I will first update this PR with a proposal and then look at modifying Kronos 👍

}
}