Core: Add InternalData read and write builders #12060

rdblue · 2025-01-23T00:50:04Z

This adds InternalData with read and write builder interfaces that can be used with Avro and Parquet by passing a FileFormat. Formats are registered by calling InternalData.register with callbacks to create format-specific builders.

The class is InternalData because registered builders are expected to use the internal object model that is used for Iceberg metadata files. Using a specific object model avoids needing to register callbacks to create value readers and writers that produce the format needed by the caller.

To demonstrate the new interfaces, this PR implements them using both Avro and Parquet. Parquet can't be used because it would fail at runtime until #11904 is committed (it is also missing custom type support).

Avro is working. To demonstrate that the builders can be used for metadata, this updates ManifestWriter, ManifestReader, and ManifestListWriter to use InternalData builders. It was also necessary to migrate the metadata classes to extend StructLike for the internal writers instead of IndexedRecord.

rdblue · 2025-01-23T00:55:53Z

core/src/main/java/org/apache/iceberg/InternalData.java

+      DynMethods.StaticMethod registerParquet =
+          DynMethods.builder("register")
+              .impl("org.apache.iceberg.parquet.Parquet")
+              .buildStaticChecked();


This uses DynMethods to call Parquet's register method directly, rather than using a ServiceLoader. There is no need for the complexity because we want to keep the number of supported formats small rather than plugging in custom formats.

I'm also considering refactoring so that the register method here is package-private so that no one can easily call it.

I do not understand this one, why can we call Avro Register() but not Parquet.register(). I'm also not clear on the Service Loader comment, is that just to note we don't want to make this dynamic and only want hardcoded formats to be supported?

This is due to the gradle project level isolation. Avro is currently included in core, but Parquet is in a separate subproject. I'm in favor of being explicit about what is supported (i.e. hard-coded), but we would like to keep parquet in a separate project to reduce dependency proliferation from api/core.

rdblue · 2025-01-23T00:58:24Z

core/src/main/java/org/apache/iceberg/V1Metadata.java

-    public org.apache.avro.Schema getSchema() {
-      return AVRO_SCHEMA;
+    public int size() {
+      return MANIFEST_LIST_SCHEMA.columns().size();


Avro schemas are no longer needed when using StructLike rather than IndexedRecord.

rdblue · 2025-01-23T00:59:18Z

core/src/main/java/org/apache/iceberg/V1Metadata.java

    private DataFile wrapped = null;

-    IndexedDataFile(org.apache.avro.Schema avroSchema) {
-      this.avroSchema = avroSchema;
-      this.partitionWrapper = new IndexedStructLike(avroSchema.getField("partition").schema());


There is also no need for a wrapper to adapt PartitionData to IndexedRecord because it is already StructLike.

Big fan of this change

rdblue · 2025-01-23T01:00:13Z

core/src/main/java/org/apache/iceberg/avro/Avro.java

@@ -90,14 +103,18 @@ private enum Codec {
  }

  public static WriteBuilder write(OutputFile file) {
+    if (file instanceof EncryptedOutputFile) {


Encryption is handled by adding this. The read side already has a similar check.

rdblue · 2025-01-23T01:01:08Z

core/src/main/java/org/apache/iceberg/avro/InternalReader.java

@@ -76,6 +76,15 @@ public void setSchema(Schema schema) {
    initReader();
  }

+  @Override
+  public void setCustomTypes(


Because the InternalReader is no longer created by ManifestReader, the custom types now need to be passed to the read builder and forwarded to the reader here. Custom type support will need to be implemented for Parquet as well.

rdblue · 2025-01-23T01:02:11Z

core/src/main/java/org/apache/iceberg/avro/SupportsCustomRecords.java

@@ -20,7 +20,7 @@

 import java.util.Map;

-/** An interface for Avro DatumReaders to support custom record classes. */
+/** An interface for Avro DatumReaders to support custom record classes by name. */


This is used to distinguish between Iceberg custom types that used field IDs and StructLike and Avro's, which worked by renaming Avro records to class names that would be dynamically loaded.

rdblue · 2025-01-23T01:02:27Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+  }
+
+  private static WriteBuilder writeInternal(OutputFile outputFile) {
+    return write(outputFile);


This will be where the internal object model is injected for Parquet.

rdblue · 2025-01-23T01:03:02Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

@@ -1171,6 +1188,16 @@ public ReadBuilder withNameMapping(NameMapping newNameMapping) {
      return this;
    }

+    @Override
+    public ReadBuilder setRootType(Class<? extends StructLike> rootClass) {
+      throw new UnsupportedOperationException("Custom types are not yet supported");


When the internal object model is complete, this should be implemented to instantiate the expected metadata classes while reading.

danielcweeks · 2025-01-23T17:27:06Z

core/src/main/java/org/apache/iceberg/InternalData.java

+    }
+  }
+
+  public static void register(


I feel like we shouldn't be exposing this. This opens up registration of any reader/writer and we need to be more opinionated about what we support here.

I'm a little less cautious here, I don't mind it being open but I don't think it's a huge deal to start out with this more protected either.

I would just counter that it's easier to open up in the future if there's a good use case, but it's hard to close this door if we expose registration.

Yeah, I agree. I'll reduce the exposure.

danielcweeks · 2025-01-23T17:29:28Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

@@ -125,6 +126,18 @@ public class Parquet {

  private Parquet() {}

+  public static void register() {
+    InternalData.register(FileFormat.PARQUET, Parquet::writeInternal, Parquet::readInternal);


We should move this up into the InternalData initialization. I think we want to explicitly register, not rely on self registration.

I wanted to reduce the amount of reflection code so I thought it would make sense to have the majority of this code in the format. We could load the writeInternal and readInternal methods dynamically, I guess.

RussellSpitzer · 2025-01-23T17:31:49Z

core/src/main/java/org/apache/iceberg/InternalData.java

+      registerParquet.invoke();
+
+    } catch (NoSuchMethodException e) {
+      // failing to load Parquet is normal and does not require a stack trace


This if normal for now? Don't we expect this to be failing bug in the future? I'm also a little interested in when we would actually fail here if we are using the Iceberg repo as is.

I wouldn't say that it's normal to fail. I'm actually not aware of any situations where the api/core modules are used but parquet isn't included. I think in almost all scenarios, it'll be available.

This would be normal whenever the iceberg-parquet module isn't in the classpath. For instance, the manifest read and write tests that are currently using InternalData in this PR hit this but operate normally because Parquet isn't used.

RussellSpitzer · 2025-01-23T17:44:43Z

core/src/main/java/org/apache/iceberg/InternalData.java

+      return writeBuilder.apply(file);
+    }
+
+    throw new UnsupportedOperationException(


nit: Personally I think it may be a bit clearer to extract the handling the missing writer/reader

maybe have

writerFor(File format) { writer = WRITE_BUILDERS.get(format) if (writer == null) { throw new Unsupported Exception } else { return writer; } }

So that this code is just

return writerFor(format).apply(file)

Mostly I feel a little unease about the implicit else in the current logic so having an else would also make me feel a little better

RussellSpitzer · 2025-01-23T17:49:29Z

core/src/main/java/org/apache/iceberg/InternalData.java

+    WriteBuilder meta(String property, String value);
+
+    /**
+     * Set a file metadata properties from a Map.


Suggested change

* Set a file metadata properties from a Map.

* Set file metadata properties from a Map.

RussellSpitzer · 2025-01-23T17:52:12Z

core/src/main/java/org/apache/iceberg/InternalData.java

+    /**
+     * Set a writer configuration property.
+     *
+     * <p>Write configuration affects writer behavior. To add file metadata properties, use {@link


Suggested change

* <p>Write configuration affects writer behavior. To add file metadata properties, use {@link

* <p>Write configuration affects this writer's behavior. To add metadata properties to the written file use {@link

?

RussellSpitzer · 2025-01-23T17:55:46Z

core/src/main/java/org/apache/iceberg/InternalData.java

+    ReadBuilder reuseContainers();
+
+    /** Set a custom class for in-memory objects at the schema root. */
+    ReadBuilder setRootType(Class<? extends StructLike> rootClass);


I'll probably get to this later in the PR but i'm interested in why we need this and setCustomType

Ok I see how it's used below, I'm wondering if instead of needing this, could we just automatically set these readers based on the root type? Ie

setRootType(ManifestEntry) --- Automatically sets field types based on Manifest entry?

Or do we have a plan for using this in a more custom manner in the future?

The problem this is solving is that we don't have an assigned ID for the root type. We could use a sentinel value like -1, but that could technically collide. I just don't want to rely on setCustomType(ROOT_FIELD_ID, SomeObject.class).

RussellSpitzer · 2025-01-23T18:45:34Z

From a release prospective, should we merge this post 1.8? Just thinking we probably want it in the build for a bit before we ship it? I know that partition stats is downstream of this so that is a dependency to consider but i'm not sure we can get that all together rapidly if we want to do this in the next week or so.

rdblue · 2025-01-23T18:58:33Z

From a release prospective, should we merge this post 1.8? Just thinking we probably want it in the build for a bit before we ship it? I know that partition stats is downstream of this so that is a dependency to consider but i'm not sure we can get that all together rapidly if we want to do this in the next week or so.

I agree. There's no need to target this for 1.8, especially when it isn't clear that the Parquet internal object model will make it. I just wanted to get this out for discussion since we are currently blocked on creating Parquet metadata files until we either merge Parquet into core or implement something like this.

rdblue added 3 commits January 22, 2025 16:41

Core: Add InternalData read and write builders.

cdbd67b

Use InternalData in ManifestWriter and ManifestListWriter.

78d3a08

Implement Parquet registration for InternalData.

09a2a98

github-actions bot added parquet core labels Jan 23, 2025

rdblue requested review from RussellSpitzer, danielcweeks and aokolnychyi January 23, 2025 00:50

Apply spotless.

214ef9a

rdblue commented Jan 23, 2025

View reviewed changes

Make Parquet.register() static.

ef05f9c

danielcweeks reviewed Jan 23, 2025

View reviewed changes

RussellSpitzer reviewed Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Add InternalData read and write builders #12060

Core: Add InternalData read and write builders #12060

rdblue commented Jan 23, 2025

rdblue Jan 23, 2025

RussellSpitzer Jan 23, 2025

danielcweeks Jan 23, 2025

rdblue Jan 23, 2025

rdblue Jan 23, 2025

RussellSpitzer Jan 23, 2025

rdblue Jan 23, 2025

rdblue Jan 23, 2025

rdblue Jan 23, 2025

rdblue Jan 23, 2025

rdblue Jan 23, 2025

danielcweeks Jan 23, 2025

RussellSpitzer Jan 23, 2025

danielcweeks Jan 23, 2025

rdblue Jan 23, 2025

danielcweeks Jan 23, 2025

rdblue Jan 23, 2025

RussellSpitzer Jan 23, 2025

danielcweeks Jan 23, 2025

rdblue Jan 23, 2025

RussellSpitzer Jan 23, 2025

RussellSpitzer Jan 23, 2025

RussellSpitzer Jan 23, 2025

RussellSpitzer Jan 23, 2025

RussellSpitzer Jan 23, 2025

rdblue Jan 23, 2025

RussellSpitzer commented Jan 23, 2025

rdblue commented Jan 23, 2025

	* Set a file metadata properties from a Map.
	* Set file metadata properties from a Map.

	* <p>Write configuration affects writer behavior. To add file metadata properties, use {@link
	* <p>Write configuration affects this writer's behavior. To add metadata properties to the written file use {@link

Core: Add InternalData read and write builders #12060

Are you sure you want to change the base?

Core: Add InternalData read and write builders #12060

Conversation

rdblue commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer commented Jan 23, 2025

rdblue commented Jan 23, 2025