-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL event.timestamp is not always unique #13019
SQL event.timestamp is not always unique #13019
Conversation
rasa/core/tracker_store.py
Outdated
@@ -1325,7 +1325,7 @@ def _event_query( | |||
) | |||
) | |||
|
|||
return event_query.order_by(self.SQLEvent.timestamp) | |||
return event_query.order_by(self.SQLEvent.timestamp, self.SQLEvent.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the change to order by both? We do not need batch ordering, but simple overall ordering. The SQLEvent.id is unique and chronologically ordered by nature, which makes timestamp redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zackjn I've changed the order to order by ID
event_1 = random_user_uttered_event(1) | ||
event_2 = random_user_uttered_event(3) | ||
event_3 = random_user_uttered_event(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arguments here refer to the timestamp of events. The test here was to assume that,
conversation1 = [event1, event2]
where event2
happens before event1. However this is not a fair assumption since the event timestamps are not externally assigned but depend on the timestamp of machine -- as defined here https://github.com/RasaHQ/rasa/blob/3.6.x/rasa/shared/core/events.py#L284
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand why the initial test setup assumed the order of events is [event1, event2]
if event2 happens first? Was that to test that the timestamp ordering happens? If so, why do you need to alter the test if you've sorted the events by timestamp in the exporter code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ancalita I'm not sure why did the test assumed that order of events. The changes we've made in the Tracker Store are only limited to SQL Tracker Store. However there's an in-memory tracker store being used here. My additional changes to sort events in exporter still apply. The test fails with the following diff,
> assert event_timestamps == [e.timestamp for e in events[_id]]
E assert [2, 3] == [3, 2]
E At index 0 diff: 2 != 3
E Full diff:
E - [3, 2]
E + [2, 3]
Which is, that it is trying to assert that the order of events returned is the same as given. I changed the given here since it is a false test case.
rasa/core/exporter.py
Outdated
# the order of events was changed after ATO-2192 | ||
# we should sort the events by timestamp to keep the order | ||
events.sort(key=lambda x: x["timestamp"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certain tests in exporter failed nonetheless so I've added this statement to sort the event list. This sorting does not inflate the performance cost much as it is only used during rasa export
command -- which is a fire-and-forget command that is seldom used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we update the tests instead of adding this extra sort function 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance may be inflated quite well for large dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this sorting shouldn't be required, we only need this here because exporter
relies on synthetic data that isn't realistic. However adding sorting here is a way to minimise the 2nd order effects from this change. In terms of performance, sorting the events explicitly here is better than sorting by timestamp in the tracker store query. The tracker store query is used much more frequently compared to rasa export
command
🚀 A preview of the docs have been deployed at the following URL: https://13019--rasahq-docs-rasa-v2.netlify.app/docs/rasa |
Proposed changes:
While the original query is quite complicated,
I've some quick results on the performance of this query. Order by ID understandably is the best option for performance.
Order by
timestamp
Order by
id
Order by
timestamp
andid
Status (please check what you already did):
black
(please check Readme for instructions)