Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-664] Add native GeoPackage reader #1603

Merged
merged 1 commit into from
Oct 15, 2024
Merged

Conversation

Imbruced
Copy link
Member

@Imbruced Imbruced commented Sep 24, 2024

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

  • No

What changes were proposed in this PR?

Geopackage datasource

How was this patch tested?

integration tests

Did this PR include necessary documentation updates?

  • I ll adjust the docs

@Imbruced
Copy link
Member Author

As a follow up to this mr, I need to add

  • ewkb parsing (I think reading from circular geometries wont work)
  • support for parallel loading via limit and offset
  • add kindoff filtering based on envelope offset
  • direct transformation from geopackage serialization to sedona

@Kontinuation
Copy link
Member

Kontinuation commented Sep 25, 2024

Thank you for this great work! My major concern is whether it works with GeoPackage files stored on cloud storage such as HDFS or S3. As far as I know org.xerial:sqlite-jdbc only works with local files, so we may have to download the GeoPackage file from cloud storage to the local file system before reading it.

@Imbruced
Copy link
Member Author

@Kontinuation good point, I ll write the test to make sure it works. Integration test with minio would be enough I think.

@Imbruced
Copy link
Member Author

WIP

@github-actions github-actions bot added the docs label Sep 30, 2024
@Imbruced Imbruced marked this pull request as ready for review October 1, 2024 21:15
Comment on lines 49 to 60
// skip srid for now
reader.getInt()

skipEnvelope(resolvedFlags._1, reader)

val wkb = new Array[Byte](reader.remaining())
reader.get(wkb)

val wkbReader = new org.locationtech.jts.io.WKBReader()
val geom = wkbReader.read(wkb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we take SRID into consideration. val wkbReader = new WKBReader(new GeometryFactory(new PrecisionModel(), srid)) would be sufficient.

Comment on lines 83 to 92
if (pathString.toLowerCase(Locale.ROOT).endsWith(".geopackage")) {
val path = new Path(pathString)
val fs = path.getFileSystem(hadoopConf)

val isDirectory = Try(fs.getFileStatus(path).isDirectory).getOrElse(false)
if (isDirectory) {
pathString
} else {
pathString.substring(0, pathString.length - 3) + "???"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, if pathString ends with ".geopackage", and it is not a directory, it will be transformed to "****.geopack???". I cannot grasp the idea of this transformation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are right, I was sure that in this part I already have list of files in the directory, I am wondering how it would behave if I am passing list of files and if its actually needed (user specifying paths with different file formats).

val serializableConf = new SerializableConfiguration(
sparkSession.sessionState.newHadoopConfWithOptions(options.asScala.toMap))

val tempFile = FileSystemUtils.copyToLocal(serializableConf.value, files.head.getPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we detect if the path is a local path skip calling copyToLocal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sure

@Imbruced
Copy link
Member Author

Imbruced commented Oct 2, 2024

@Kontinuation thanks for the review !

@@ -98,6 +97,36 @@
<groupId>org.locationtech.jts</groupId>
<artifactId>jts-core</artifactId>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why these dependencies only appear in the spark-3.3 profile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test testing loading from s3 (I used minio with test containers) is only in spark 3.3. I can duplicate it to other versions of spark as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these to the spark-common pom.xml so all Spark 3.X pom.xml will share it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think deps with test scope is not propagating when you put it in common and I would avoid putting it as normal deps, I copied them for each of the spark version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support other Spark versions 3.0, 3.1, and 3.2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test it with those I can add data sources

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the implementation differs among the versions, we should definitely replicate the tests and test them for each version

@atiannicelli
Copy link

@jiayuasu I can't wait to be able to use this for Overture Buildings Conflation. We are looking at using a couple datasets that are packaged as gpkg files.

@Imbruced Imbruced force-pushed the feature/geopackage-reader branch 4 times, most recently from 41d4965 to 18186b9 Compare October 12, 2024 22:39
@jiayuasu jiayuasu changed the title Feature/geopackage reader [SEDONA-664] Add native GeoPackage reader Oct 14, 2024
.option("showMetadata", "true")
.load(path)

df.where("data_type = 'tiles'").show(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove or comment out all show() functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might add linter for that cases, sorry for that.

@jiayuasu jiayuasu added this to the sedona-1.7.0 milestone Oct 14, 2024
docs/tutorial/sql.md Outdated Show resolved Hide resolved
@Imbruced
Copy link
Member Author

@jiayuasu remove show method calls

@Imbruced
Copy link
Member Author

@jiayuasu I applied changes, and what I saw is that we are running previous pipelines even if new are starting I think it's worth adding concurrency mechanism in github actions
https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#concurrency

@jiayuasu jiayuasu merged commit 4c7da62 into master Oct 15, 2024
50 checks passed
@jiayuasu jiayuasu deleted the feature/geopackage-reader branch October 15, 2024 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants