-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Make CDF use TableConfiguration and refactor log replay #645
base: main
Are you sure you want to change the base?
refactor: Make CDF use TableConfiguration and refactor log replay #645
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #645 +/- ##
==========================================
+ Coverage 84.22% 84.31% +0.09%
==========================================
Files 77 77
Lines 17694 17772 +78
Branches 17694 17772 +78
==========================================
+ Hits 14902 14984 +82
+ Misses 2080 2075 -5
- Partials 712 713 +1 ☔ View full report in Codecov by Sentry. |
774fd50
to
23f4b6f
Compare
configuration: HashMap::from([ | ||
("delta.enableChangeDataFeed".to_string(), "true".to_string()), | ||
( | ||
"delta.enableDeletionVectors".to_string(), | ||
"true".to_string(), | ||
), | ||
("delta.columnMapping.mode".to_string(), "none".to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: I wish rust had better (more transparent) handling of String
vs. &str
... code like this gets so ugly and unwieldy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep the to_string
everywhere is kind of gross, but at least we're aware of where the allocations are happening
@@ -28,6 +28,30 @@ fn get_schema() -> StructType { | |||
]) | |||
} | |||
|
|||
fn table_config() -> TableConfiguration { | |||
let schema_string = serde_json::to_string(&get_schema()).unwrap(); | |||
let metadata = Metadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why is Metadata infallible while Protocol is fallible? Seems like any number of things could go wrong with it, such as partition column names that aren't actually in the schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of me wonders if we should make a clean break between P&M as "plain old data" vs. table configuration as the one stop shop for validation of that raw data? So then, any time you see a P or M in isolation, you have to assume it's broken in arbitrary ways. It's only "for sure" self-consistent and valid if a TC successfully wrapped it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Metadata just doesn't have a constructor for kernel/rust data that does these checks. All it has is pub fn try_new_from_data(data: &dyn EngineData)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of me wonders if we should make a clean break between P&M as "plain old data" vs. table configuration as the one stop shop for validation of that raw data?
Ohhh so proposing that we move the validation that we do in Protocol::try_new
to TC? Then this Metadata
validation could also live there.
Moreover if we think of protocol as just "raw, unchecked data", I'm starting to wonder if we should move ensure_read_supported
, has_writer_feature
and has_reader_feature
to TC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... TC becomes the logical heart of the table, with P&M as simple raw information underneath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for anyone reading this thread, I'm tracking table config stuff in #650
kernel/src/table_changes/mod.rs
Outdated
schema, | ||
table_configuration: start_snapshot.table_configuration().clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does schema
differ from table_configuration.schema()
?
I'm seeing this pattern a lot in this new table configuration code. If they're the same, we should remove the redundant arg; otherwise we should document clearly why/how they could differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema is the schema of the CDF (taken from the end version). This table_configuration is from the start version. I'll change the naming to start_table_config
to be clearer.
Some([ReaderFeatures::DeletionVectors]), | ||
Some([ReaderFeatures::ColumnMapping]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should one of those be WriterFeatures
? Also, the spec says:
Reader features should be listed in both
readerFeatures
andwriterFeatures
simultaneously, while writer features should be listed only inwriterFeatures
. It is not allowed to list a feature only inreaderFeatures
but not inwriterFeatures
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the case for all reader features. If that's the case, then ensure_read_supported
should be checking that the reader features exist in both the reader and and the writer features list for a given protocol.
kernel/src/table_configuration.rs
Outdated
@@ -39,34 +40,32 @@ impl TableConfiguration { | |||
column_mapping_mode, | |||
}) | |||
} | |||
pub fn with_protocol(self, protocol: impl Into<Option<Protocol>>) -> DeltaResult<Self> { | |||
pub fn with_protocol( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: this was in the parent diff as well. Either the link in the PR description has the wrong set of commits, or this code leaked accidentally into the previous PR?
// Note: We do not perform data skipping yet because we need to visit all add and | ||
// remove actions for deletion vector resolution to be correct. | ||
// | ||
// Consider a scenario with a pair of add/remove actions with the same path. The add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
95cac9c
to
5474769
Compare
Some::<Vec<String>>(vec![]), | ||
Some::<Vec<String>>(vec![]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Creating Protocol from None
is actually pretty challenging since it doesn't know the type of T in Option::<T>::None
. Also, the type try_new
takes for reader/writer features is Option<impl IntoIterator<Item = impl Into<String>>>
shouldn't we be using impl ToString
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Into<String>
and ToString
have different characteristics, especially wrt String
:
Into<String> for String
is a moveToString for String
makes a copy- More types impl
ToString
than implInto<String>
Dunno which we "should" use here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm i guess we should prefer moves. Though this is affected by the proposed changes to move all logic to TableConfiguration, so I'll leave this for now and address it in #650
5474769
to
436852b
Compare
let string_list: DataType = ArrayType::new(STRING, false).into(); | ||
let string_string_map = MapType::new(STRING, STRING, false).into(); | ||
let str_list: DataType = ArrayType::new(STRING, false).into(); | ||
let str_str_map: DataType = MapType::new(STRING, STRING, false).into(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Renaming so that the lines below don't get too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except one possible latent bug to double-check
visitor.visit_rows_of(actions.as_ref())?; | ||
let has_metadata = visitor.metadata.is_some(); | ||
match (visitor.protocol, visitor.metadata) { | ||
(None, None) => {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(None, None) => {} | |
(None, None) => {} // no change |
match (visitor.protocol, visitor.metadata) { | ||
(None, None) => {} | ||
(p, m) => { | ||
let p = p.unwrap_or_else(|| table_configuration.protocol().clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let p = p.unwrap_or_else(|| table_configuration.protocol().clone()); | |
// at least one of P&M changed, so update the table configuration | |
let p = p.unwrap_or_else(|| table_configuration.protocol().clone()); |
m, | ||
p, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we're anyway putting these on their own lines, probably don't need the let
any more. Just move the code here directly.
// The only (path, deletion_vector) pairs we must track are ones whose path is the | ||
// same as an `add` action. | ||
remove_dvs.retain(|rm_path, _| add_paths.contains(rm_path)); | ||
if has_metadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this makes it more self-describing?
if has_metadata { | |
if metadata_changed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be better as this?
require!(
!metadata_changed || table_schema.as_ref() == table_configuration.schema(),
Error::change_data_feed_incompatible_schema(
table_schema,
table_configuration.schema()
)
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this may encourage an expensive swap
table_schema.as_ref() == table_configuration.schema() || !metadata_changed
I'll keep it using the if for now.
getters[14].get_int(i, "protocol.min_reader_version")? | ||
{ | ||
let protocol = | ||
ProtocolVisitor::visit_protocol(i, min_reader_version, &getters[12..=15])?; | ||
ProtocolVisitor::visit_protocol(i, min_reader_version, &getters[14..=17])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong (existing bug)? Wouldn't we read [14]
twice (or [12]
previously)?
Or does it demand the full list even tho it won't use the first one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
visit_protocol
expects all fields of protocol to be passed in, even though it only reads elements 1,2, 3 (ignores 0).
visit_protocol
:
require!(
getters.len() == 4,
Error::InternalError(format!(
"Wrong number of ProtocolVisitor getters: {}",
getters.len()
))
);
let min_writer_version: i32 = getters[1].get(row_index, "protocol.min_writer_version")?;
let reader_features: Option<Vec<_>> =
getters[2].get_opt(row_index, "protocol.reader_features")?;
let writer_features: Option<Vec<_>> =
getters[3].get_opt(row_index, "protocol.writer_features")?;
Protocol::try_new(
min_reader_version,
min_writer_version,
reader_features,
writer_features,
)
I could change this, but maybe that could be another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need -- I was just verifying that we didn't have a double-use of that field by mentioning it twice.
fae5a21
to
0ece561
Compare
0ece561
to
eb925a1
Compare
552c5f0
to
7243c36
Compare
What changes are proposed in this pull request?
This PR replaces the old CDF protocol and metadata checks to use
TableConfiguration
. This paves the way for checking deletion vector enablement, in-commit timestamp enablement, and verifying column mapping support in the future.This PR also refactors the LogReplay logic to simplify its design. The LogReplayScanner is removed. Instead, each commit file is mapped to a
ProcessedCdfCommit
, which can then be mapped toIterator<TableChangesScanBatch>
Depends on: #644
Please only look at these commits
How was this change tested?
This is a refactor. All existing tests pass.