-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Add InternalData read and write builders #12060
base: main
Are you sure you want to change the base?
Conversation
DynMethods.StaticMethod registerParquet = | ||
DynMethods.builder("register") | ||
.impl("org.apache.iceberg.parquet.Parquet") | ||
.buildStaticChecked(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses DynMethods
to call Parquet's register
method directly, rather than using a ServiceLoader
. There is no need for the complexity because we want to keep the number of supported formats small rather than plugging in custom formats.
I'm also considering refactoring so that the register
method here is package-private so that no one can easily call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this one, why can we call Avro Register() but not Parquet.register(). I'm also not clear on the Service Loader comment, is that just to note we don't want to make this dynamic and only want hardcoded formats to be supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to the gradle project level isolation. Avro is currently included in core, but Parquet is in a separate subproject. I'm in favor of being explicit about what is supported (i.e. hard-coded), but we would like to keep parquet in a separate project to reduce dependency proliferation from api/core.
public org.apache.avro.Schema getSchema() { | ||
return AVRO_SCHEMA; | ||
public int size() { | ||
return MANIFEST_LIST_SCHEMA.columns().size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avro schemas are no longer needed when using StructLike
rather than IndexedRecord
.
private DataFile wrapped = null; | ||
|
||
IndexedDataFile(org.apache.avro.Schema avroSchema) { | ||
this.avroSchema = avroSchema; | ||
this.partitionWrapper = new IndexedStructLike(avroSchema.getField("partition").schema()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also no need for a wrapper to adapt PartitionData
to IndexedRecord
because it is already StructLike
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big fan of this change
@@ -90,14 +103,18 @@ private enum Codec { | |||
} | |||
|
|||
public static WriteBuilder write(OutputFile file) { | |||
if (file instanceof EncryptedOutputFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encryption is handled by adding this. The read
side already has a similar check.
@@ -76,6 +76,15 @@ public void setSchema(Schema schema) { | |||
initReader(); | |||
} | |||
|
|||
@Override | |||
public void setCustomTypes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the InternalReader
is no longer created by ManifestReader
, the custom types now need to be passed to the read builder and forwarded to the reader here. Custom type support will need to be implemented for Parquet as well.
@@ -20,7 +20,7 @@ | |||
|
|||
import java.util.Map; | |||
|
|||
/** An interface for Avro DatumReaders to support custom record classes. */ | |||
/** An interface for Avro DatumReaders to support custom record classes by name. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to distinguish between Iceberg custom types that used field IDs and StructLike
and Avro's, which worked by renaming Avro records to class names that would be dynamically loaded.
} | ||
|
||
private static WriteBuilder writeInternal(OutputFile outputFile) { | ||
return write(outputFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be where the internal object model is injected for Parquet.
@@ -1171,6 +1188,16 @@ public ReadBuilder withNameMapping(NameMapping newNameMapping) { | |||
return this; | |||
} | |||
|
|||
@Override | |||
public ReadBuilder setRootType(Class<? extends StructLike> rootClass) { | |||
throw new UnsupportedOperationException("Custom types are not yet supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the internal object model is complete, this should be implemented to instantiate the expected metadata classes while reading.
} | ||
} | ||
|
||
public static void register( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we shouldn't be exposing this. This opens up registration of any reader/writer and we need to be more opinionated about what we support here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little less cautious here, I don't mind it being open but I don't think it's a huge deal to start out with this more protected either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just counter that it's easier to open up in the future if there's a good use case, but it's hard to close this door if we expose registration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. I'll reduce the exposure.
@@ -125,6 +126,18 @@ public class Parquet { | |||
|
|||
private Parquet() {} | |||
|
|||
public static void register() { | |||
InternalData.register(FileFormat.PARQUET, Parquet::writeInternal, Parquet::readInternal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move this up into the InternalData initialization. I think we want to explicitly register, not rely on self registration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to reduce the amount of reflection code so I thought it would make sense to have the majority of this code in the format. We could load the writeInternal
and readInternal
methods dynamically, I guess.
registerParquet.invoke(); | ||
|
||
} catch (NoSuchMethodException e) { | ||
// failing to load Parquet is normal and does not require a stack trace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if normal for now? Don't we expect this to be failing bug in the future? I'm also a little interested in when we would actually fail here if we are using the Iceberg repo as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say that it's normal to fail. I'm actually not aware of any situations where the api/core
modules are used but parquet
isn't included. I think in almost all scenarios, it'll be available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be normal whenever the iceberg-parquet
module isn't in the classpath. For instance, the manifest read and write tests that are currently using InternalData
in this PR hit this but operate normally because Parquet isn't used.
return writeBuilder.apply(file); | ||
} | ||
|
||
throw new UnsupportedOperationException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Personally I think it may be a bit clearer to extract the handling the missing writer/reader
maybe have
writerFor(File format) {
writer = WRITE_BUILDERS.get(format)
if (writer == null) {
throw new Unsupported Exception
} else {
return writer;
}
}
So that this code is just
return writerFor(format).apply(file)
Mostly I feel a little unease about the implicit else in the current logic so having an else
would also make me feel a little better
WriteBuilder meta(String property, String value); | ||
|
||
/** | ||
* Set a file metadata properties from a Map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Set a file metadata properties from a Map. | |
* Set file metadata properties from a Map. |
/** | ||
* Set a writer configuration property. | ||
* | ||
* <p>Write configuration affects writer behavior. To add file metadata properties, use {@link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* <p>Write configuration affects writer behavior. To add file metadata properties, use {@link | |
* <p>Write configuration affects this writer's behavior. To add metadata properties to the written file use {@link |
?
ReadBuilder reuseContainers(); | ||
|
||
/** Set a custom class for in-memory objects at the schema root. */ | ||
ReadBuilder setRootType(Class<? extends StructLike> rootClass); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll probably get to this later in the PR but i'm interested in why we need this and setCustomType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see how it's used below, I'm wondering if instead of needing this, could we just automatically set these readers based on the root type? Ie
setRootType(ManifestEntry) --- Automatically sets field types based on Manifest entry?
Or do we have a plan for using this in a more custom manner in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem this is solving is that we don't have an assigned ID for the root type. We could use a sentinel value like -1
, but that could technically collide. I just don't want to rely on setCustomType(ROOT_FIELD_ID, SomeObject.class)
.
From a release prospective, should we merge this post 1.8? Just thinking we probably want it in the build for a bit before we ship it? I know that partition stats is downstream of this so that is a dependency to consider but i'm not sure we can get that all together rapidly if we want to do this in the next week or so. |
I agree. There's no need to target this for 1.8, especially when it isn't clear that the Parquet internal object model will make it. I just wanted to get this out for discussion since we are currently blocked on creating Parquet metadata files until we either merge Parquet into core or implement something like this. |
This adds
InternalData
with read and write builder interfaces that can be used with Avro and Parquet by passing aFileFormat
. Formats are registered by callingInternalData.register
with callbacks to create format-specific builders.The class is
InternalData
because registered builders are expected to use the internal object model that is used for Iceberg metadata files. Using a specific object model avoids needing to register callbacks to create value readers and writers that produce the format needed by the caller.To demonstrate the new interfaces, this PR implements them using both Avro and Parquet. Parquet can't be used because it would fail at runtime until #11904 is committed (it is also missing custom type support).
Avro is working. To demonstrate that the builders can be used for metadata, this updates
ManifestWriter
,ManifestReader
, andManifestListWriter
to useInternalData
builders. It was also necessary to migrate the metadata classes to extendStructLike
for the internal writers instead ofIndexedRecord
.