Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: make DataFile Serializable && Deserializable #774

Closed
ZENOTME opened this issue Dec 11, 2024 · 8 comments
Closed

Discussion: make DataFile Serializable && Deserializable #774

ZENOTME opened this issue Dec 11, 2024 · 8 comments

Comments

@ZENOTME
Copy link
Contributor

ZENOTME commented Dec 11, 2024

Context

Make Datafile Serializable && Deserializable is useful, e.g. In distributed compute engine, it will create multiple writers in multiple machines and write the data in parallel and get the DataFile as the results, these DataFiles will be sent to a coordinator and append using transaction. In this case, DataFile should able to be Serializable && Deserializable.

Solution

For now, we support Serialize DataFile in _serde module and we should convert the DataFile to _serde::DataFile first, the interface looks like: pub fn try_from(value: super::DataFile, partition_type: &StructType,is_version_1: bool) -> _serde::DataFile. More detail:

.

There is something we need to resolve to support Datafile Serializable && Deserializable:

  1. The related interface needs to be exposed to the public
  2. The interface is not friendly. If the DataFile can be self-contain, things will be easier, e.g. DataFile itself can be Serialize && Deserialize, the user doesn't need to convert it using an interface like pub fn try_from(value: super::DataFile, partition_type: &StructType,is_version_1: bool) -> _serde::DataFile

To solve the above, I think there are two solutions:

  1. Make DateFile self-contain, store the partition type and version in DataFile directly so that it converts into _serde::DataFile directly and it can be Serialize && Deserialize.
  2. Provide something like
struct SerializableDataFile {
  version: i32,
  partition_type: StructType
  data_file: DataFile
}

I prefer solution 1 because it looks more natural. Welcome to different opinions and solutions. cc @liurenjie1024 @Fokko @Xuanwo @c-thiel

@Fokko
Copy link
Contributor

Fokko commented Dec 11, 2024

Hey @ZENOTME thanks for raising this.

Technically the Datafile is already serializable, you can encode it into Iceberg Avro :) I know how this works in Java and Python, but less known in Rust. I would expect the writer to have a copy of the Table, which contains the current partition information, would that help?

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 13, 2024

Hey @ZENOTME thanks for raising this.

Technically the Datafile is already serializable, you can encode it into Iceberg Avro :) I know how this works in Java and Python, but less known in Rust. I would expect the writer to have a copy of the Table, which contains the current partition information, would that help?

Thanks @Fokko. I think this is the solution that works. But we should expose the _serde::DataFile so that the user can use this serializable representation. I create the PR for this #797.

@Xuanwo
Copy link
Member

Xuanwo commented Dec 14, 2024

Hi, thank you @ZENOTME for starting this discussion. I prefer to make DataFile itself serializable. Maybe we can do this samething for DataFile like TableMetadata: TableMetadata itself is Serialize but we won't expose the underlying types.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 15, 2024

Hi, thank you @ZENOTME for starting this discussion. I prefer to make DataFile itself serializable. Maybe we can do this samething for DataFile like TableMetadata: TableMetadata itself is Serialize but we won't expose the underlying types.

To make DataFile itself serializable, I think we should contain the type info of DataFile in it.

Also I think contain type info in DataFile can also help #777.

The partition in DataFile should include types to facilitate validation. e.g. the field name and field id

I think that's a great thing to do anyway. It isn't super expensive, and will avoid folks bricking their table. Preferably by field-ID for both V1 and V2, otherwise order for V1, and field-IDs for V2.

There are two choices: contain the partition type info in DataFile or contain the whole struct type of DataFile. The reason I prefer to contain the whole struct type of DataFile is the better compatibility, e.g in the future the DataFile add more struct types.

@liurenjie1024
Copy link
Contributor

Thanks @ZENOTME for raising this. In your question, the typical use case is compute engine, which needs to serialized DataFile. But in fact, what a compute engines required is to serialize/deserialized planned task if I understad correctly. If there are use cases DataFile itself must be serialized/deserialized, I prefer solution 1.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 28, 2024

Thanks @ZENOTME for raising this. In your question, the typical use case is compute engine, which needs to serialized DataFile. But in fact, what a compute engines required is to serialize/deserialized planned task if I understad correctly. If there are use cases DataFile itself must be serialized/deserialized, I prefer solution 1.

For read part, the compute engines required is to serialize/deserialized planned task. But for write part, the compute engines required to pass the data file. #797 is implement used solution 1.

@liurenjie1024
Copy link
Contributor

Should we close this now?

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Jan 3, 2025

Yes, I think #797 is enough now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants