Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Make CDF use TableConfiguration and refactor log replay #645

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

OussamaSaoudi
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi commented Jan 15, 2025

What changes are proposed in this pull request?

This PR replaces the old CDF protocol and metadata checks to use TableConfiguration. This paves the way for checking deletion vector enablement, in-commit timestamp enablement, and verifying column mapping support in the future.

This PR also refactors the LogReplay logic to simplify its design. The LogReplayScanner is removed. Instead, each commit file is mapped to a ProcessedCdfCommit, which can then be mapped to Iterator<TableChangesScanBatch>

Depends on: #644
Please only look at these commits

How was this change tested?

This is a refactor. All existing tests pass.

Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 91.08527% with 23 lines in your changes missing coverage. Please review.

Project coverage is 84.31%. Comparing base (3305d3a) to head (7243c36).

Files with missing lines Patch % Lines
kernel/src/table_changes/log_replay.rs 89.51% 0 Missing and 15 partials ⚠️
kernel/src/actions/visitors.rs 41.66% 0 Missing and 7 partials ⚠️
kernel/src/table_changes/scan.rs 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #645      +/-   ##
==========================================
+ Coverage   84.22%   84.31%   +0.09%     
==========================================
  Files          77       77              
  Lines       17694    17772      +78     
  Branches    17694    17772      +78     
==========================================
+ Hits        14902    14984      +82     
+ Misses       2080     2075       -5     
- Partials      712      713       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@OussamaSaoudi OussamaSaoudi force-pushed the table_changes_refactor branch from 774fd50 to 23f4b6f Compare January 15, 2025 00:20
Comment on lines +35 to +42
configuration: HashMap::from([
("delta.enableChangeDataFeed".to_string(), "true".to_string()),
(
"delta.enableDeletionVectors".to_string(),
"true".to_string(),
),
("delta.columnMapping.mode".to_string(), "none".to_string()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: I wish rust had better (more transparent) handling of String vs. &str... code like this gets so ugly and unwieldy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep the to_string everywhere is kind of gross, but at least we're aware of where the allocations are happening

@@ -28,6 +28,30 @@ fn get_schema() -> StructType {
])
}

fn table_config() -> TableConfiguration {
let schema_string = serde_json::to_string(&get_schema()).unwrap();
let metadata = Metadata {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why is Metadata infallible while Protocol is fallible? Seems like any number of things could go wrong with it, such as partition column names that aren't actually in the schema?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me wonders if we should make a clean break between P&M as "plain old data" vs. table configuration as the one stop shop for validation of that raw data? So then, any time you see a P or M in isolation, you have to assume it's broken in arbitrary ways. It's only "for sure" self-consistent and valid if a TC successfully wrapped it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Metadata just doesn't have a constructor for kernel/rust data that does these checks. All it has is pub fn try_new_from_data(data: &dyn EngineData)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me wonders if we should make a clean break between P&M as "plain old data" vs. table configuration as the one stop shop for validation of that raw data?

Ohhh so proposing that we move the validation that we do in Protocol::try_new to TC? Then this Metadata validation could also live there.

Moreover if we think of protocol as just "raw, unchecked data", I'm starting to wonder if we should move ensure_read_supported, has_writer_feature and has_reader_feature to TC?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... TC becomes the logical heart of the table, with P&M as simple raw information underneath

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for anyone reading this thread, I'm tracking table config stuff in #650

Comment on lines 193 to 195
schema,
table_configuration: start_snapshot.table_configuration().clone(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does schema differ from table_configuration.schema()?
I'm seeing this pattern a lot in this new table configuration code. If they're the same, we should remove the redundant arg; otherwise we should document clearly why/how they could differ.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema is the schema of the CDF (taken from the end version). This table_configuration is from the start version. I'll change the naming to start_table_config to be clearer.

Comment on lines 360 to 361
Some([ReaderFeatures::DeletionVectors]),
Some([ReaderFeatures::ColumnMapping]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should one of those be WriterFeatures? Also, the spec says:

Reader features should be listed in both readerFeatures and writerFeatures simultaneously, while writer features should be listed only in writerFeatures. It is not allowed to list a feature only in readerFeatures but not in writerFeatures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the case for all reader features. If that's the case, then ensure_read_supported should be checking that the reader features exist in both the reader and and the writer features list for a given protocol.

@@ -39,34 +40,32 @@ impl TableConfiguration {
column_mapping_mode,
})
}
pub fn with_protocol(self, protocol: impl Into<Option<Protocol>>) -> DeltaResult<Self> {
pub fn with_protocol(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: this was in the parent diff as well. Either the link in the PR description has the wrong set of commits, or this code leaked accidentally into the previous PR?

// Note: We do not perform data skipping yet because we need to visit all add and
// remove actions for deletion vector resolution to be correct.
//
// Consider a scenario with a pair of add/remove actions with the same path. The add
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Indentation change, best reviewed with whitespace changes hidden:
image

kernel/src/table_changes/log_replay.rs Outdated Show resolved Hide resolved
kernel/src/table_changes/log_replay.rs Outdated Show resolved Hide resolved
kernel/src/table_changes/log_replay.rs Show resolved Hide resolved
kernel/src/table_changes/log_replay.rs Outdated Show resolved Hide resolved
@OussamaSaoudi OussamaSaoudi force-pushed the table_changes_refactor branch from 95cac9c to 5474769 Compare January 17, 2025 06:08
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Jan 17, 2025
Some::<Vec<String>>(vec![]),
Some::<Vec<String>>(vec![]),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Creating Protocol from None is actually pretty challenging since it doesn't know the type of T in Option::<T>::None. Also, the type try_new takes for reader/writer features is Option<impl IntoIterator<Item = impl Into<String>>> shouldn't we be using impl ToString?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Into<String> and ToString have different characteristics, especially wrt String:

  • Into<String> for String is a move
  • ToString for String makes a copy
  • More types impl ToString than impl Into<String>

Dunno which we "should" use here...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm i guess we should prefer moves. Though this is affected by the proposed changes to move all logic to TableConfiguration, so I'll leave this for now and address it in #650

@OussamaSaoudi OussamaSaoudi force-pushed the table_changes_refactor branch from 5474769 to 436852b Compare January 19, 2025 01:17
let string_list: DataType = ArrayType::new(STRING, false).into();
let string_string_map = MapType::new(STRING, STRING, false).into();
let str_list: DataType = ArrayType::new(STRING, false).into();
let str_str_map: DataType = MapType::new(STRING, STRING, false).into();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Renaming so that the lines below don't get too long.

@OussamaSaoudi OussamaSaoudi marked this pull request as ready for review January 19, 2025 01:48
@OussamaSaoudi OussamaSaoudi changed the title [wip] refactor: Make CDF use TableConfiguration and refactor log replay refactor: Make CDF use TableConfiguration and refactor log replay Jan 19, 2025
@OussamaSaoudi OussamaSaoudi removed the breaking-change Change that will require a version bump label Jan 19, 2025
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except one possible latent bug to double-check

visitor.visit_rows_of(actions.as_ref())?;
let has_metadata = visitor.metadata.is_some();
match (visitor.protocol, visitor.metadata) {
(None, None) => {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(None, None) => {}
(None, None) => {} // no change

match (visitor.protocol, visitor.metadata) {
(None, None) => {}
(p, m) => {
let p = p.unwrap_or_else(|| table_configuration.protocol().clone());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let p = p.unwrap_or_else(|| table_configuration.protocol().clone());
// at least one of P&M changed, so update the table configuration
let p = p.unwrap_or_else(|| table_configuration.protocol().clone());

Comment on lines 193 to 194
m,
p,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're anyway putting these on their own lines, probably don't need the let any more. Just move the code here directly.

// The only (path, deletion_vector) pairs we must track are ones whose path is the
// same as an `add` action.
remove_dvs.retain(|rm_path, _| add_paths.contains(rm_path));
if has_metadata {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this makes it more self-describing?

Suggested change
if has_metadata {
if metadata_changed {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be better as this?

require!(
    !metadata_changed || table_schema.as_ref() == table_configuration.schema(),
    Error::change_data_feed_incompatible_schema(
        table_schema,
        table_configuration.schema()
    )
);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this may encourage an expensive swap

table_schema.as_ref() == table_configuration.schema() || !metadata_changed

I'll keep it using the if for now.

Comment on lines +380 to +381
getters[14].get_int(i, "protocol.min_reader_version")?
{
let protocol =
ProtocolVisitor::visit_protocol(i, min_reader_version, &getters[12..=15])?;
ProtocolVisitor::visit_protocol(i, min_reader_version, &getters[14..=17])?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong (existing bug)? Wouldn't we read [14] twice (or [12] previously)?
Or does it demand the full list even tho it won't use the first one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visit_protocol expects all fields of protocol to be passed in, even though it only reads elements 1,2, 3 (ignores 0).

visit_protocol:

require!(
    getters.len() == 4,
    Error::InternalError(format!(
        "Wrong number of ProtocolVisitor getters: {}",
        getters.len()
    ))
);
let min_writer_version: i32 = getters[1].get(row_index, "protocol.min_writer_version")?;
let reader_features: Option<Vec<_>> =
    getters[2].get_opt(row_index, "protocol.reader_features")?;
let writer_features: Option<Vec<_>> =
    getters[3].get_opt(row_index, "protocol.writer_features")?;

Protocol::try_new(
    min_reader_version,
    min_writer_version,
    reader_features,
    writer_features,
)

I could change this, but maybe that could be another PR?

Copy link
Collaborator

@scovich scovich Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need -- I was just verifying that we didn't have a double-use of that field by mentioning it twice.

@OussamaSaoudi OussamaSaoudi force-pushed the table_changes_refactor branch from fae5a21 to 0ece561 Compare January 22, 2025 01:03
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Jan 22, 2025
@OussamaSaoudi OussamaSaoudi force-pushed the table_changes_refactor branch from 0ece561 to eb925a1 Compare January 28, 2025 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants