Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: filemanager partitions #615

Merged
merged 8 commits into from
Oct 25, 2024
Merged

feat: filemanager partitions #615

merged 8 commits into from
Oct 25, 2024

Conversation

mmalenic
Copy link
Member

@mmalenic mmalenic commented Oct 24, 2024

Related to #614, finishes first task.

Changes

  • Add is_current_state column to database for later partitioning.
    • Migrates existing data to the correct is_current_state.
    • API uses this column now instead of performing the calculation every time.
    • Tests are updated to take this into account.

I know this adds some overhead to the ingester, but it is performing pretty well in testing (< a few ms per insert), and I think it's worth it because it reduces the performance issues on the API side. If this ever is a problem for the ingester to handle, then it can always be run as a separate asynchronous process.

@mmalenic mmalenic self-assigned this Oct 24, 2024
@mmalenic mmalenic added filemanager an issue relating to the filemanager feature New feature labels Oct 24, 2024
Comment on lines +24 to +52
select
s3_object_id,
-- This finds the first value in the set which represents the most up-to-date state.
-- If ordered by the sequencer, the first row is the one that needs to have `is_current_state`
-- set to 'true' only for `Created` events, as `Deleted` events are always non-current state.
case when row_number() over (order by s3_object.sequencer desc) = 1 then
event_type = 'Created'
-- Set `is_current_state` to 'false' for all other rows, as this is now historical data.
else
false
end as updated_state
from s3_object
where
-- This should be fairly efficient as it's only targeting objects where `is_current_state` is true,
-- or objects with the highest sequencer values (in case of an out-of-order event). This means that
-- although there is a performance impact for running this on ingestion, it should be minimal with
-- the right indexes.
input.bucket = s3_object.bucket and
input.key = s3_object.key and
input.version_id = s3_object.version_id and
-- This is an optimization which prevents querying against all objects in the set.
(
-- Only need to update current objects
s3_object.is_current_state = true or
-- Or objects where there is a newer sequencer than the one being processed.
-- This is required in case an out-of-order event is encountered. This always
-- includes the object being processed as it's required for the above row-logic.
s3_object.sequencer >= input.sequencer
)
Copy link
Member Author

@mmalenic mmalenic Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this makes sense! Let me know if I can improve the explanation or if it would be good to reduce some of the complexity. For example, that last nested or condition probably isn't necessary for correctness but I think it helps with performance.

@mmalenic mmalenic merged commit d45b20b into main Oct 25, 2024
6 checks passed
@mmalenic mmalenic deleted the feat/filemanager-partitions branch October 25, 2024 00:27
the performance impact of determining this is too great on every API call. This does incur a performance penalty for
ingestion. However, it should be minimal as only other current records need to be considered. This is because records
only need to transition from current to historical (and not the other way around). Records which are current represent
a smaller subset of all records, meaning that only a smaller part of the dataset needs to be queried when ingesting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I completely follow here....
Why do other records need to be queried when a new one is ingested?

from current to historical (and not the other way around)
How would objects restored from the archive bucket be handled?

Also, perhaps for future consideration, if handling the data within one table get's tricky, consider moving records between tables ? E.g. a current_records vs deleted_records table? (I think the Django history functionality works like that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do other records need to be queried when a new one is ingested?

A record starts as current, and then some logic needs to make it historical.

For example, say Created event "A" comes in for a given key/bucket. At that point in time, "A" has is_current_state set to true on ingestion. Then, another Created event "B" comes in for the same key/bucket, which represents overwriting that object. Now, "B" has is_current_state set to true given that it's the new current state. Something needs to flip is_current_state to false on "A" as it's no longer current. This requires querying the database, which is what this query does on ingestion.

This doesn't have to happen on ingestion, it could happen in another process some time afterwards. However, I think the cost of doing it at ingestion isn't too high so I've added it to make the state of the records as accurate as soon as possible.

Does that make sense (I'll add this in the docs for the next PR)?

Also, perhaps for future consideration, if handling the data within one table get's tricky, consider moving records between tables ?

Yeah, agreed, that's the idea with task 2 on #614. I'll do this by partitioning the table, which more-or-less creates two separate tables, but allows queries to treat it like one table. When this is set up in postgres, it happens automatically on a particular partition key. This is why I'm adding the is_current_state here, because this allows the table to be partitioned on that column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. I forgot we are dealing with object versions....
Querying with bucket + key + version should be quick and efficient... all good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature filemanager an issue relating to the filemanager
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants