Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split metadata tables into separate modules #872

Merged
merged 1 commit into from
Jan 7, 2025

Conversation

rshkv
Copy link
Contributor

@rshkv rshkv commented Jan 3, 2025

Split metadata tables into separate modules.

Context for this is to address #863 (comment) where the point was made that metadata_scan.rs will grow unwieldy if we shove all metadata table implementations in there. Especially as we're going to add extra utilities for those metadata tables.

The structure in this PR is:

inspect/
  metadata_table.rs: contains TableMetadata
  snapshots.rs: contains "snapshots" table
  manifests.rs: contains "manifests" table

In the future this can expand as described in #863 (comment).

@rshkv
Copy link
Contributor Author

rshkv commented Jan 5, 2025

cc @liurenjie1024 @Xuanwo

@rshkv rshkv force-pushed the wr/metadata-split branch from e8cabed to be77803 Compare January 5, 2025 19:05
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rshkv for this pr, LGTM! Some issues are tracked in others like #870 , just left one question to discuss.

/// - <https://iceberg.apache.org/docs/latest/spark-queries/#querying-with-sql>
/// - <https://py.iceberg.apache.org/api/#inspecting-tables>
#[derive(Debug)]
pub struct MetadataTable(Table);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why we need this data struct here. It seems just a wrapper to provide more api, while just like this:

impl Table {
 pub fn snapshots(&self) -> SnapshotsTable {
  ...
}
 pub fn manifests(&self) -> ManifestsTable {
 ...
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Xuanwo @sdd @Fokko What do you think?

Copy link
Member

@Xuanwo Xuanwo Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why we need this data struct here. It seems just a wrapper to provide more api, while just like this:

Yes, I intend to make the API exposed at Table more organized. For example, users will have:

table.metadata_table().snapshots();
table.metadata_table().manifests();

instead of:

// Could be confused with `table.metadata().snapshots()` which returns `Snapshot`.
table.snapshots();

// Verbose and long
table.metadata_snapshots_table();
table.metadata_manifests_table();

While I believe we could use better API names, such as table.inspect().snapshots(), the overall structure looks good to me and aligns better with other implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata_table to inspect rename is here: #881

}

/// Returns the schema of the snapshots table.
pub fn schema(&self) -> Schema {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as #868 . But we could defer this to later issue.

}

/// Scans the snapshots table.
pub fn scan(&self) -> Result<RecordBatch> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as #870

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rshkv for this pr, LGTM!

@liurenjie1024
Copy link
Contributor

I'll merge this first as it's a pure refactoring.

@liurenjie1024 liurenjie1024 merged commit 25e8909 into apache:main Jan 7, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants