forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-45827][SQL] Add Variant data type in Spark
## What changes were proposed in this pull request? This PR adds Variant data type in Spark. It doesn't actually introduce any binary encoding, but just has the `value` and `metadata` binaries. This PR includes: - The in-memory Variant representation in different types of Spark rows. All rows except `UnsafeRow` use the `VariantVal` object to store an Variant value. In the `UnsafeRow`, the two binaries are stored contiguously. - Spark parquet writer and reader support for the Variant type. This is agnostic to the detailed binary encoding but just transparently reads the two binaries. - A dummy Spark `parse_json` implementation so that I can manually test the writer and reader. It currently returns an `VariantVal` with value being the raw bytes of the input string and empty metadata. This is **not** a valid Variant value in the final binary encoding. ## How was this patch tested? Manual testing. Some supported usages: ``` > sql("create table T using parquet as select parse_json('1') as o") > sql("select * from T").show +---+ | o| +---+ | 1| +---+ > sql("insert into T select parse_json('[2]') as o") > sql("select * from T").show +---+ | o| +---+ |[2]| | 1| +---+ ``` Closes apache#43707 from chenhao-db/variant-type. Authored-by: Chenhao Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
- Loading branch information
1 parent
cd19d6c
commit aa10ac7
Showing
56 changed files
with
545 additions
and
19 deletions.
There are no files selected for viewing
110 changes: 110 additions & 0 deletions
110
common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.unsafe.types; | ||
|
||
import org.apache.spark.unsafe.Platform; | ||
|
||
import java.io.Serializable; | ||
import java.util.Arrays; | ||
|
||
/** | ||
* The physical data representation of {@link org.apache.spark.sql.types.VariantType} that | ||
* represents a semi-structured value. It consists of two binary values: {@link VariantVal#value} | ||
* and {@link VariantVal#metadata}. The value encodes types and values, but not field names. The | ||
* metadata currently contains a version flag and a list of field names. We can extend/modify the | ||
* detailed binary format given the version flag. | ||
* <p> | ||
* A {@link VariantVal} can be produced by casting another value into the Variant type or parsing a | ||
* JSON string in the {@link org.apache.spark.sql.catalyst.expressions.variant.ParseJson} | ||
* expression. We can extract a path consisting of field names and array indices from it, cast it | ||
* into a concrete data type, or rebuild a JSON string from it. | ||
* <p> | ||
* The storage layout of this class in {@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} | ||
* and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: the fixed-size part is | ||
* a long value "offsetAndSize". The upper 32 bits is the offset that points to the start position | ||
* of the actual binary content. The lower 32 bits is the total length of the binary content. The | ||
* binary content contains: 4 bytes representing the length of {@link VariantVal#value}, content of | ||
* {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is an internal and | ||
* transient format and can be modified at any time. | ||
*/ | ||
public class VariantVal implements Serializable { | ||
protected final byte[] value; | ||
protected final byte[] metadata; | ||
|
||
public VariantVal(byte[] value, byte[] metadata) { | ||
this.value = value; | ||
this.metadata = metadata; | ||
} | ||
|
||
public byte[] getValue() { | ||
return value; | ||
} | ||
|
||
public byte[] getMetadata() { | ||
return metadata; | ||
} | ||
|
||
/** | ||
* This function reads the binary content described in `writeIntoUnsafeRow` from `baseObject`. The | ||
* offset is computed by adding the offset in {@code offsetAndSize} and {@code baseOffset}. | ||
*/ | ||
public static VariantVal readFromUnsafeRow( | ||
long offsetAndSize, | ||
Object baseObject, | ||
long baseOffset) { | ||
// offset and totalSize is the upper/lower 32 bits in offsetAndSize. | ||
int offset = (int) (offsetAndSize >> 32); | ||
int totalSize = (int) offsetAndSize; | ||
int valueSize = Platform.getInt(baseObject, baseOffset + offset); | ||
int metadataSize = totalSize - 4 - valueSize; | ||
byte[] value = new byte[valueSize]; | ||
byte[] metadata = new byte[metadataSize]; | ||
Platform.copyMemory( | ||
baseObject, | ||
baseOffset + offset + 4, | ||
value, | ||
Platform.BYTE_ARRAY_OFFSET, | ||
valueSize | ||
); | ||
Platform.copyMemory( | ||
baseObject, | ||
baseOffset + offset + 4 + valueSize, | ||
metadata, | ||
Platform.BYTE_ARRAY_OFFSET, | ||
metadataSize | ||
); | ||
return new VariantVal(value, metadata); | ||
} | ||
|
||
public String debugString() { | ||
return "VariantVal{" + | ||
"value=" + Arrays.toString(value) + | ||
", metadata=" + Arrays.toString(metadata) + | ||
'}'; | ||
} | ||
|
||
/** | ||
* @return A human-readable representation of the Variant value. It is always a JSON string at | ||
* this moment. | ||
*/ | ||
@Override | ||
public String toString() { | ||
// NOTE: the encoding is not yet implemented, this is not the final implementation. | ||
return new String(value); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
43 changes: 43 additions & 0 deletions
43
sql/api/src/main/scala/org/apache/spark/sql/types/VariantType.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.types | ||
|
||
import org.apache.spark.annotation.Unstable | ||
|
||
/** | ||
* The data type representing semi-structured values with arbitrary hierarchical data structures. It | ||
* is intended to store parsed JSON values and most other data types in the system (e.g., it cannot | ||
* store a map with a non-string key type). | ||
* | ||
* @since 4.0.0 | ||
*/ | ||
@Unstable | ||
class VariantType private () extends AtomicType { | ||
// The default size is used in query planning to drive optimization decisions. 2048 is arbitrarily | ||
// picked and we currently don't have any data to support it. This may need revisiting later. | ||
override def defaultSize: Int = 2048 | ||
|
||
/** This is a no-op because values with VARIANT type are always nullable. */ | ||
private[spark] override def asNullable: VariantType = this | ||
} | ||
|
||
/** | ||
* @since 4.0.0 | ||
*/ | ||
@Unstable | ||
case object VariantType extends VariantType |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.