Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL event.timestamp is not always unique #13019

Conversation

vcidst
Copy link
Contributor

@vcidst vcidst commented Mar 1, 2024

Proposed changes:

  • Works on https://rasahq.atlassian.net/browse/ATO-2192
  • The timestamps are being set by events.py when the events are created. There seem to be two losses at play here,
    • documentation of time.time() notes that "not all systems provide time with a better precision than 1 second". This is a little wild to me!
    • precision loss due to timestamp being a float type: Given that timestamp is created during event object creation. The case of (ev1.id > ev2.id) && (ev1.timestamp < ev2.timestamp) don't seem to be possible here. I also can't think of why sorting by timestamp would be good idea at all (possibility of collisions even in high-precision systems).

While the original query is quite complicated,

SELECT *
FROM events
WHERE sender_id = :sender_id
  AND (
    timestamp >= (
      SELECT max(events.timestamp) AS session_start
      FROM events
      WHERE events.sender_id = :sender_id
        AND events.type_name = :type_name
    )
    OR (
      (
        SELECT max(events.timestamp) AS session_start
        FROM events
        WHERE events.sender_id = :sender_id
          AND events.type_name = :type_name
      ) IS NULL
    )
  )
ORDER BY timestamp, id;

I've some quick results on the performance of this query. Order by ID understandably is the best option for performance.

Order by timestamp

zi=# explain select * from events order by timestamp;                                                       QUERY PLAN                             
-------------------------------------------------------------------
 Sort  (cost=544.08..553.94 rows=3945 width=490)
   Sort Key: "timestamp"
   ->  Seq Scan on events  (cost=0.00..308.45 rows=3945 width=490)
(3 rows)

Order by id

----------------------------------------------------------------------------------
 Index Scan using events_pkey on events  (cost=0.29..543.61 rows=10088 width=177)
(1 row)

Order by timestamp and id

zi=# explain select * from events order by timestamp, id;
                             QUERY PLAN                             
--------------------------------------------------------------------
 Sort  (cost=1040.75..1065.97 rows=10088 width=177)
   Sort Key: "timestamp", id
   ->  Seq Scan on events  (cost=0.00..369.88 rows=10088 width=177)
(3 rows)

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@vcidst vcidst requested a review from a team as a code owner March 1, 2024 12:40
@@ -1325,7 +1325,7 @@ def _event_query(
)
)

return event_query.order_by(self.SQLEvent.timestamp)
return event_query.order_by(self.SQLEvent.timestamp, self.SQLEvent.id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the change to order by both? We do not need batch ordering, but simple overall ordering. The SQLEvent.id is unique and chronologically ordered by nature, which makes timestamp redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zackjn I've changed the order to order by ID

Comment on lines +76 to +78
event_1 = random_user_uttered_event(1)
event_2 = random_user_uttered_event(3)
event_3 = random_user_uttered_event(2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arguments here refer to the timestamp of events. The test here was to assume that,
conversation1 = [event1, event2] where event2 happens before event1. However this is not a fair assumption since the event timestamps are not externally assigned but depend on the timestamp of machine -- as defined here https://github.com/RasaHQ/rasa/blob/3.6.x/rasa/shared/core/events.py#L284

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand why the initial test setup assumed the order of events is [event1, event2] if event2 happens first? Was that to test that the timestamp ordering happens? If so, why do you need to alter the test if you've sorted the events by timestamp in the exporter code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ancalita I'm not sure why did the test assumed that order of events. The changes we've made in the Tracker Store are only limited to SQL Tracker Store. However there's an in-memory tracker store being used here. My additional changes to sort events in exporter still apply. The test fails with the following diff,

>           assert event_timestamps == [e.timestamp for e in events[_id]]
E           assert [2, 3] == [3, 2]
E             At index 0 diff: 2 != 3
E             Full diff:
E             - [3, 2]
E             + [2, 3]

Which is, that it is trying to assert that the order of events returned is the same as given. I changed the given here since it is a false test case.

Comment on lines 245 to 247
# the order of events was changed after ATO-2192
# we should sort the events by timestamp to keep the order
events.sort(key=lambda x: x["timestamp"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certain tests in exporter failed nonetheless so I've added this statement to sort the event list. This sorting does not inflate the performance cost much as it is only used during rasa export command -- which is a fire-and-forget command that is seldom used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we update the tests instead of adding this extra sort function 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance may be inflated quite well for large dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this sorting shouldn't be required, we only need this here because exporter relies on synthetic data that isn't realistic. However adding sorting here is a way to minimise the 2nd order effects from this change. In terms of performance, sorting the events explicitly here is better than sorting by timestamp in the tracker store query. The tracker store query is used much more frequently compared to rasa export command

@vcidst vcidst requested a review from ancalita March 4, 2024 10:08
@vcidst vcidst requested a review from ancalita March 4, 2024 15:43
Copy link
Contributor

github-actions bot commented Mar 4, 2024

🚀 A preview of the docs have been deployed at the following URL: https://13019--rasahq-docs-rasa-v2.netlify.app/docs/rasa

@vcidst vcidst merged commit 0df5212 into 3.6.x Mar 4, 2024
98 of 99 checks passed
@vcidst vcidst deleted the ATO-2192-SQLEvent.timestamp-is-not-always-unique-should-use-SQLEvent.id branch March 4, 2024 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants