feat: add schema evolution to merge statement #3136

JustinRush80 · 2025-01-16T13:53:09Z

Description

Add schema evolution (only merge) to the MERGE statement. New columns are added based on the columns predicates in the MERGE operations (eg. target.id = source.id). Using when_not_matched_insert_all and when_matched_update_all will add any new column to the target schema

Related Issue(s)

closes Schema evolution on upsert (merge) #2282

Documentation

ion-elgreco · 2025-01-17T09:52:14Z

@JustinRush80 can you rebase your branch against main, or allow us to rebase it

I will do thorough review tomorrow then :)

ion-elgreco · 2025-01-17T13:53:54Z

@JustinRush80 could you rebase again, something went wrong since files changed is huge

codecov · 2025-01-18T00:51:27Z

Codecov Report

Attention: Patch coverage is 94.85792% with 38 lines in your changes missing coverage. Please review.

Project coverage is 72.12%. Comparing base (523c6d7) to head (dbebc49).

Files with missing lines	Patch %	Lines
crates/core/src/operations/merge/mod.rs	95.63%	2 Missing and 30 partials ⚠️
python/src/merge.rs	0.00%	4 Missing ⚠️
python/src/lib.rs	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3136      +/-   ##
==========================================
+ Coverage   71.73%   72.12%   +0.38%     
==========================================
  Files         138      138              
  Lines       44362    45087     +725     
  Branches    44362    45087     +725     
==========================================
+ Hits        31825    32520     +695     
- Misses      10496    10504       +8     
- Partials     2041     2063      +22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

crates/core/src/operations/merge/mod.rs

ion-elgreco

Thanks a lot for picking this up! Just a couple modifications required but looks good so far!

ion-elgreco · 2025-01-19T11:22:43Z

python/deltalake/table.py

@@ -972,6 +972,7 @@ def merge(
        predicate: str,
        source_alias: Optional[str] = None,
        target_alias: Optional[str] = None,
+        schema_mode: Optional[str] = None,


Let's make this simply "merge_schema: bool = False", since we only have one mode :)

I see! For both python and rust apis or just the python api?

ion-elgreco · 2025-01-19T11:25:36Z

python/tests/test_merge.py

+
+    assert last_action["operation"] == "MERGE"
+    assert result == expected
+


Can you add an assert on the new schema from the DeltaTable, as sanity check in all the tests

ion-elgreco · 2025-01-19T11:37:02Z

crates/core/src/operations/merge/mod.rs

+        )?;
+        let schema = Arc::new(schema_bulider.finish());
+        new_schema = Some(schema.clone());
+        if schema != snapshot.input_schema()? {


This might give false positives. A while ago I made merge_arrow_schema, pass through the large/view types. But the input_schema() will actually have small types.

I think we should do the not_eq comparison when it's a Delta Schema (StructType). Can you also add a test where merge_schema is True, where we write Large or View types to a table but without any new columns. Then the result shouldn't have a new schema action in the log history

@ion-elgreco is metrics.num_target_files_added the right field to see any new schema actions? or is there another way to see an added actions?

You would want to read the commit file I think, something like this should work
let actions = crate::logstore::get_actions(version, self.read_commit_entry(version).await).await;

ion-elgreco · 2025-01-19T11:40:05Z

crates/core/src/operations/merge/mod.rs

+    {
+        if target_schema.field_from_column(columns).is_err() {
+            let new_fields = source_schema.field_with_unqualified_name(columns.name())?;
+            ending_schema.push(new_fields.to_owned().with_nullable(true));


Before we insert the new fields in the schema, we should actually do some safety checks on the metadata. We cannot add new fields which has generated columns enabled by adding generated expressions, you can look in the recent PR of generated columns how this is prevented.

like return a error message when the source data has a generated columns and the end user wants to add it via schema evolution?

ion-elgreco · 2025-01-19T11:41:56Z

crates/core/src/operations/merge/mod.rs

+        .filter(|ops| matches!(ops.r#type, OperationType::Update | OperationType::Insert))
+        .flat_map(|ops| ops.operations.keys())
+    {
+        if target_schema.field_from_column(columns).is_err() {


How about schema evolution in nested types such as structs? I think field_from_column looks at top level fields, isn't it?

I believe so but I will added a unit test and refactor if needed!

yes this process works for struct!

ion-elgreco · 2025-01-19T11:43:41Z

crates/core/src/operations/merge/mod.rs

@@ -1714,6 +1829,183 @@ mod tests {

        assert_merge(table, metrics).await;
    }
+    #[tokio::test]
+    async fn test_merge_with_schema_mode_no_change_of_schema() {


This one we can extend with checking it doesn't update schema if you use large/view arrow types in the source

ion-elgreco · 2025-01-19T12:11:58Z

crates/core/src/operations/merge/mod.rs

-                    .map(|c| Expr::Column(c.clone()))
-                    .collect_vec(),
-            )?
+            .select(select_columns)?


Maybe we don't need to keep track of null columns, I was doing some side improvements while working on streaming support for MERGE, you can see it here:

dead668#diff-12f59fe3c4440b7ae4ee1a5ac810b42c1d7357c246aae7b5770e840e52d3ec52R1218-R1230.

It essentially boils down to projecting early with the required metadata columns to filter down for cdf, and then we drop just these columns after the filter.

Great! I will take a look and refactor!

Yesterday the changes in MERGE operation got merged, I have one more change lined up for streaming support, but I don't think that will affect you

Signed-off-by: JustinRush80 <[email protected]>

github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Jan 16, 2025

JustinRush80 force-pushed the feat/schema_evo branch from fcc92b2 to 7f1b955 Compare January 16, 2025 13:56

JustinRush80 marked this pull request as ready for review January 16, 2025 14:16

JustinRush80 requested review from wjones127, fvaleye, roeap, ion-elgreco, rtyler and hntd187 as code owners January 16, 2025 14:16

JustinRush80 force-pushed the feat/schema_evo branch from 6daa9ec to 0a45cf0 Compare January 17, 2025 13:21

JustinRush80 force-pushed the feat/schema_evo branch from 0a45cf0 to 53042d8 Compare January 17, 2025 14:03

JustinRush80 commented Jan 18, 2025

View reviewed changes

crates/core/src/operations/merge/mod.rs Outdated Show resolved Hide resolved

ion-elgreco requested changes Jan 19, 2025

View reviewed changes

JustinRush80 added 7 commits January 22, 2025 21:13

fix merge conflict

67485b5

Signed-off-by: JustinRush80 <[email protected]>

fix merge conflict

dae7eff

Signed-off-by: JustinRush80 <[email protected]>

fix unit test and added another test for add actions

c2eafb7

Signed-off-by: JustinRush80 <[email protected]>

change schema_mode to merge_schema for both api

127c21b

Signed-off-by: JustinRush80 <[email protected]>

fix merge conflict

3beae1f

Signed-off-by: JustinRush80 <[email protected]>

comparison with structtype schema

0a6e4c6

Signed-off-by: JustinRush80 <[email protected]>

fix merge conflict

04cae01

Signed-off-by: JustinRush80 <[email protected]>

JustinRush80 force-pushed the feat/schema_evo branch from 7c88080 to 04cae01 Compare January 23, 2025 02:27

refactor MERGE cdf with schema evolution

dbebc49

Signed-off-by: JustinRush80 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add schema evolution to merge statement #3136

feat: add schema evolution to merge statement #3136

JustinRush80 commented Jan 16, 2025

ion-elgreco commented Jan 17, 2025

ion-elgreco commented Jan 17, 2025

codecov bot commented Jan 18, 2025 •

edited

Loading

ion-elgreco left a comment

ion-elgreco Jan 19, 2025

JustinRush80 Jan 19, 2025

ion-elgreco Jan 19, 2025

ion-elgreco Jan 19, 2025

ion-elgreco Jan 19, 2025

JustinRush80 Jan 21, 2025

ion-elgreco Jan 21, 2025

ion-elgreco Jan 19, 2025

JustinRush80 Jan 23, 2025 •

edited

Loading

ion-elgreco Jan 19, 2025

JustinRush80 Jan 21, 2025

JustinRush80 Jan 23, 2025

ion-elgreco Jan 19, 2025

ion-elgreco Jan 19, 2025

JustinRush80 Jan 21, 2025

ion-elgreco Jan 21, 2025


		assert last_action["operation"] == "MERGE"
		assert result == expected

feat: add schema evolution to merge statement #3136

Are you sure you want to change the base?

feat: add schema evolution to merge statement #3136

Conversation

JustinRush80 commented Jan 16, 2025

Description

Related Issue(s)

Documentation

ion-elgreco commented Jan 17, 2025

ion-elgreco commented Jan 17, 2025

codecov bot commented Jan 18, 2025 • edited Loading

Codecov Report

ion-elgreco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JustinRush80 Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 18, 2025 •

edited

Loading

JustinRush80 Jan 23, 2025 •

edited

Loading