Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: refine writer interface to support directory hierarchy #893

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Jan 16, 2025

This PR is intended to resolve #891. On our original design, we allow the user to create the writer builder and combine them and build them finally. However, original design is not support to change the config of builder after we combine them. But in some case, we hope to recreate the writer for different partition using writer builder. E.g. in #819, we want to regenerate the table location for different partition. So I follow the design from iceberg-java https://github.com/apache/iceberg/blob/d96901b843395fe669f6bd4f618f8e5e46c0eed4/core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java#L334 :

  1. separate the OutputFile from FileWriter.
  2. create a new abstract SinglePartitionWriterBuilder.
  3. introduce the OutputFileGenerator following https://github.com/apache/iceberg/blob/d96901b843395fe669f6bd4f618f8e5e46c0eed4/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L36

It's the responsibility of SinglePartitionWriterBuilder to create the OutputFile. And the build function of SinglePartitionWriterBuilder take a partition value which means that we can create different partition of writer from this Builder.

@hackintoshrao
Copy link

hackintoshrao commented Jan 21, 2025

Thanks, @ZENOTME for the PR. I'm looking into it. Does the current approach handle multiple partition columns so that we can write into a multi-level directory structure? Or is it currently limited to just one partition column?

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Jan 21, 2025

Thanks, @ZENOTME for the PR. I'm looking into it. Does the current approach handle multiple partition columns so that we can write into a multi-level directory structure? Or is it currently limited to just one partition column?

It can handle multiple-level, e.g we have partition <col1=1,col2=2>, it will generate location table/data/col1=1/col=2. See:

fn partition_dir_path(partition: &Struct, partition_type: &StructType) -> Result<String> {
. And it supports to custom the LocationGenerator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partition writes not creating expected directory hierarchy on S3 (MinIO)
2 participants