Choose an embedded database that fits the "load" strategy use case of neume #246

TimDaub · 2022-08-30T09:26:24Z

fast writes
must it allow distributing data (single machine vs many machines)?
needs to be FOSS compatible and cannot be proprietary
needs to have well documented and well-maintained nodejs package
cannot be an extra process and must be embeddable (e.g. like better-sqlite)
must have as little possible feature overhead and complexity as possible
must allow us to create two-dimensional indexes, e.g. "The json data for block number 123 is at offset 456 in the database file". It seems a very basic key-value based storage is fine.
For ACID, we care most about
- Atomicity: A transaction must either be completely written to disk or fail completely
- Correctness: A transaction must never corrupt the database file
We don't care strictly for Isolation and it some cases it can be fine if transactions are racing each other. Most of our crawler's results are additive and hence cumulative. For transactions that aren't topologically dependent, just have them being written to disk in whatever order and that's fine
Durability: We don't care much what happens to influx transactions during a crash of neume. As long as we have an none-corrupt database that we can use to recover from the crash.

useful website

https://blog.actorsfit.com/a?ID=00750-2fcdb554-a2ca-444c-b681-11078940ffea

long-list

sqlite but with WAL mode enabled: https://github.com/WiseLibs/better-sqlite3
https://github.com/TryGhost/node-sqlite3
https://github.com/Level/level
https://github.com/typicode/lowdb
Berkeleydb and write our own C bindings
http://fallabs.com/kyotocabinet/
https://www.npmjs.com/package/rocksdb
https://www.npmjs.com/package/levelup
http://sophia.systems/drivers.html#node.js
https://github.com/hideo55/node-unqlite
https://github.com/cruppstahl/upscaledb

excluded:

https://github.com/louischatriot/nedb not maintained
redis/mysql/mariadb/postgres needs a separate process

il3ven · 2022-08-30T19:21:47Z

@TimDaub In issue #207 you mentioned that we are not interested in storing structured data. Does opening of this issue mean we are ready to convert our JSON into SQL tables?

TimDaub · 2022-08-30T21:37:28Z

For now, most important is reducing the complexity of random access via indexes and complying neatly the the above outlined criteria. But since we're gonna build an API eventually, we might need to use a database that would allow us to join tables. But e.g. for now, I personally don't see that need.

Unless, with music-os-accumulator, we're doing just that... joins...

TimDaub · 2022-09-08T08:00:43Z

Here's another use case for the load component.

We were able to confirm in all of
- call-block-logs-transformation
- logs-to-subgraph-transformation
- and eventually music-os-accumulation duplicate ids by comparing against address/tokenId: https://github.com/neume-network/data/runs/8244367899?check_suite_focus=true#step:6:13
But circumventing these duplicates at the level of a call-block-logs-transformer or extractor is very difficult, at least, for the time that we're not generating unique ids and loading them in some data base that ensures a global consistency
I also believe that we've never seen these duplicates so that as for catalog v1 and sound, we've been ensuring uniqueness through the nft id approach mentioned above:

strategies/src/strategies/music-os-accumulator/extractor.mjs

Lines 96 to 97 in 8f17695

map.set(id, data);

map.set(data.tokenURI, data);
But with Mintsongs v2 and catalog v2, this has now changed and in music-os-accumulator, we're not checking duplicates anymore:

strategies/src/strategies/music-os-accumulator/extractor.mjs

Lines 199 to 200 in 8f17695

trackList = [...trackList, ...strategies[3].map];

trackList = [...trackList, ...strategies[4].map];
Here's the original issue from neume-network/data: duplicate songs in music-os-accumulator data#43

TimDaub · 2022-09-16T15:28:20Z

note to myself: It'd be awesome if every strategy could define their identifier within neume itself and then other identifiers could link to those buckets and identifiers with uris, similar to JSON-LD does it.

il3ven · 2022-09-22T15:45:16Z

I like https://www.sqlite.org/json1.html. Instead of having fixed tables we can store json in columns and also query it if needed. We can even create indexes on the json data for faster retrievals. Plus, sqlite is also battle tested.

note to myself: It'd be awesome if every strategy could define their identifier within neume itself and then other identifiers could link to those buckets and identifiers with uris, similar to JSON-LD does it.

In sqlite we should be able to do this with foreign keys.

TimDaub · 2022-09-28T15:51:03Z

I'd be all in for using the single thing that e.g. makes sqlite solve our usecases but considering that we may want to distribute the crawl results later via a network like IPFS as in this specification (neume-network/neuIPs#2), I think it'd be premature to use sqlite now. How about if for now we add a load component and allow the strategy implementer to define a "identity" function for each line in the transformation flat file?

TimDaub · 2022-09-28T15:52:10Z

e.g. users should codify the ID function of this line and any other: https://github.com/neume-network/data/blob/8911801860195b50c743f0985c0a6abf61d3dcc0/results/mintsongs-get-tokenuri-transformation#L3

TimDaub pinned this issue Aug 30, 2022

TimDaub self-assigned this Aug 31, 2022

TimDaub mentioned this issue Sep 30, 2022

Write: String should always be a tuple of (id, content) #301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose an embedded database that fits the "load" strategy use case of neume #246

Choose an embedded database that fits the "load" strategy use case of neume #246

TimDaub commented Aug 30, 2022 •

edited

Loading

il3ven commented Aug 30, 2022

TimDaub commented Aug 30, 2022

TimDaub commented Sep 8, 2022 •

edited

Loading

TimDaub commented Sep 16, 2022

il3ven commented Sep 22, 2022

TimDaub commented Sep 28, 2022

TimDaub commented Sep 28, 2022

Choose an embedded database that fits the "load" strategy use case of neume #246

Choose an embedded database that fits the "load" strategy use case of neume #246

Comments

TimDaub commented Aug 30, 2022 • edited Loading

il3ven commented Aug 30, 2022

TimDaub commented Aug 30, 2022

TimDaub commented Sep 8, 2022 • edited Loading

TimDaub commented Sep 16, 2022

il3ven commented Sep 22, 2022

TimDaub commented Sep 28, 2022

TimDaub commented Sep 28, 2022

TimDaub commented Aug 30, 2022 •

edited

Loading

TimDaub commented Sep 8, 2022 •

edited

Loading