Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop MySQL dependency #145

Open
bmschmidt opened this issue Apr 23, 2021 · 1 comment
Open

Drop MySQL dependency #145

bmschmidt opened this issue Apr 23, 2021 · 1 comment

Comments

@bmschmidt
Copy link
Member

This is a big one that I put here partly just to see who's still interested in this repo!

I've been playing around a bit the last couple days with DuckDB, a columnar database that's the heir to MonetDB, which I had thought about for this project but never used.

Duck is much lighter than anything out there except SQLlite, but unlike SQLlite, does a column-oriented store more appropriate for the queries here because related blocks of memory will be close together.

The purpose of MySQL for this project has always been:

  1. Handle the details of JOIN queries
  2. Build massive B-tree indices that allow reasonably fast access to wordid pages and put all the bookids for a given wordid contiguously on disk so that you don't have to seek to a million different places on the hard drive.

Duck DB can do the first fine; and for the second, the builtin BRIN indexes turn out to be faster than MySQL if you can handle the fairly difficult work of sorting a billion records or so before loading it into duck. Having managed to sort the rate my professor bookworm master_bookcounts data in Apache Arrow feather format (that's a whole different story--this can takes days on MySQL for a trillion words, but I think I've got a decent O(N-log(N)) multi-pass on-disk sort going.), it performs better than MySQL on some standard queries.

Plus, it doesn't have the expensive request for large in-memory tables; and the actual on-disk files seems to be a bit smaller, even though we're using 4-byte ints instead of 3-byte ints.

SELECT SUM(count), date_year FROM master_bookcounts NATURAL JOIN fastcat WHERE wordid = 9 GROUP BY date_year;

MySQL: 16 rows in set (5.246 sec)
DuckDB: 3.65 s

SELECT date_year, department, SUM(count) FROM master_bookcounts NATURAL JOIN catalog WHERE wordid = 118 GROUP BY date_year, department;

8624 rows. 1 min 2 seconds in MySQL (Not fair because we've always grouped on integer keys, not text keys for department).
3.78 seconds in DuckDB (Still grouping on text keys!!)

@bmschmidt
Copy link
Member Author

Pull request here:

#146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant