Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support duckdb as alternate sink/source...or more #482

Open
caufieldjh opened this issue Apr 1, 2024 · 4 comments
Open

Support duckdb as alternate sink/source...or more #482

caufieldjh opened this issue Apr 1, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@caufieldjh
Copy link
Collaborator

In KG construction call today (Apr 1 2024), discussion touched on DuckDB and its relevance to KG exchange.
At minimum, supporting this infrastructure could make it easier to query and access graphs.
Beyond that, this DB platform or its alternatives could replace some internal KGX operations, particularly the more memory-hungry ones like merge.

@caufieldjh caufieldjh added the enhancement New feature or request label Apr 1, 2024
@caufieldjh
Copy link
Collaborator Author

Or perhaps this is just about Parquet support.

@cmungall
Copy link
Contributor

cmungall commented Apr 3, 2024

I think these are separate asks. Duckdb is very flexible in what it reads, it can run directly off parquet, arrow, sqlite3, and plain csv (locally or remotely). It's kind of awesome. But there may still be a use case for having a native duckdb file.

It may be worthwhile taking a step back and talking about internal representation a moment.

The biolink standard defines slots like synonym as multivalued. kgx has always serialized this as pipe separated lists because, well csv is the lowest common denominator. We don't have to live in that universe any more. Any modern data science format has a better tabular data model whether it's jsonl, parquet, arrow, duckdb. Even stalwart databases like pg support arrays/lists as first class.

See https://github.com/orgs/linkml/discussions/1996

This becomes even more relevant when we think about distributing closures with the kgx. I think this should be the default, and we should adopt the de-facto monarch kgx standard devised by @kevinschaper adapted from @kltm's golr. Basically subject_closure and object_closure. Except we can just represent this naturally as a list.

And modern tabular formats have better support for nested datamodels too, which could provide at last a solution to #218. duckdb supports both structure objects, as well as a json type, similar to pg's jsonb.

Of course these can all be serialized as csv by using standard pipe delimiting but this should be seen as the LCD serialization for CSV only and give modern data scientists something better.

@ptgolden
Copy link

Beyond that, this DB platform or its alternatives could replace some internal KGX operations, particularly the more memory-hungry ones like merge.

@caufieldjh, I noticed you ended up writing that merge code in kg-microbe-merge. Merging the Monarch ingests inevitably hits swap hard-- did you see a lot of memory savings there?

@caufieldjh
Copy link
Collaborator Author

Ah, Harshad wrote that. I seem to recall that it did save on memory but not massively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants