This extension module is for using Apache Sedona seamlessly with Apache Iceberg, where UDT and serializer of geometry values were unified, and spatial predicates of Apache Sedona will be pushed down to Iceberg tables for partition pruning and data skipping.
Add the sedona-iceberg extension jar to the --jars
argument of spark-submit
command, and append
org.apache.iceberg.spark.extensions.SedonaIcebergExtensions
to
spark.sql.extensions
config property.
Typical spark job submission script looks like this:
spark-submit \
--jars /path/to/iceberg-spark-runtime-jar,/path/to/sedona-iceberg-extension-jar,/path/to/geotools-wrapper-geotools-jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=org.apache.sedona.core.serde.SedonaKryoRegistrator \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.apache.iceberg.spark.extensions.SedonaIcebergExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
...
Notices:
-
Don't forget to register kryo serializers provided by Apache Sedona, otherwise you'll suffer from poor performance and high memory usage.
-
Since GeoTools was published under GPL license, so we cannot bundle GeoTools into our extension jar. You need to obtain and add GeoTools jar yourself. Please refer to sedona documentation on GeoTools for detail.
example
directory contains an example spark job processing geometries stored
in iceberg tables using Apache Sedona. Please refer to example/launch.sh
for
spark-submit command for launching jobs.
You can use the following command to build the extension jar with specific spark version by yourself:
./gradlew -DsparkVersion=3.1 build
Currently, we only support Spark 3.1, 3.2 and 3.3.