Core, Spark: Scan only live entries in RewriteTablePathUtil #12006
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Instead of scanning all entries in data/manifest for identifying list of content files to copy, scan only the live one. This is essential to prevent rewrite table path to carry the files already expired as part of snapshot expiration in the source table.
Existing logic fetch both added/existing/deleted entry from manifest to collect list of content files to be copied and rely on reducer for deduplicate based on file name.
However we want to avoid the scenario where the given content file with only deleted status in older manifest, as snapshot expiration might already removed the snapshot which reference the given content file, and deleted as part of snapshot expiration.
With some concrete examlpe to help with explanation,
the expiration of first snapshot
8729031490038117099
, will remove d2.parquet on disk,second snapshot
6024975807438659167
might still have data manifest entry of deleted (status=2) for d2.parquet.However it's not desired to include d2.parquet as part of files for path rewrite.
CC @szehon-ho