Drop MySQL dependency #145

bmschmidt · 2021-04-23T20:54:17Z

This is a big one that I put here partly just to see who's still interested in this repo!

I've been playing around a bit the last couple days with DuckDB, a columnar database that's the heir to MonetDB, which I had thought about for this project but never used.

Duck is much lighter than anything out there except SQLlite, but unlike SQLlite, does a column-oriented store more appropriate for the queries here because related blocks of memory will be close together.

The purpose of MySQL for this project has always been:

Handle the details of JOIN queries
Build massive B-tree indices that allow reasonably fast access to wordid pages and put all the bookids for a given wordid contiguously on disk so that you don't have to seek to a million different places on the hard drive.

Duck DB can do the first fine; and for the second, the builtin BRIN indexes turn out to be faster than MySQL if you can handle the fairly difficult work of sorting a billion records or so before loading it into duck. Having managed to sort the rate my professor bookworm master_bookcounts data in Apache Arrow feather format (that's a whole different story--this can takes days on MySQL for a trillion words, but I think I've got a decent O(N-log(N)) multi-pass on-disk sort going.), it performs better than MySQL on some standard queries.

Plus, it doesn't have the expensive request for large in-memory tables; and the actual on-disk files seems to be a bit smaller, even though we're using 4-byte ints instead of 3-byte ints.

SELECT SUM(count), date_year FROM master_bookcounts NATURAL JOIN fastcat WHERE wordid = 9 GROUP BY date_year;

MySQL: 16 rows in set (5.246 sec)
DuckDB: 3.65 s

SELECT date_year, department, SUM(count) FROM master_bookcounts NATURAL JOIN catalog WHERE wordid = 118 GROUP BY date_year, department;

8624 rows. 1 min 2 seconds in MySQL (Not fair because we've always grouped on integer keys, not text keys for department).
3.78 seconds in DuckDB (Still grouping on text keys!!)

bmschmidt · 2021-09-10T14:40:41Z

Pull request here:

#146

bmschmidt mentioned this issue Sep 10, 2021

Drop pandas dependency #147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop MySQL dependency #145

Drop MySQL dependency #145

bmschmidt commented Apr 23, 2021

bmschmidt commented Sep 10, 2021

Drop MySQL dependency #145

Drop MySQL dependency #145

Comments

bmschmidt commented Apr 23, 2021

bmschmidt commented Sep 10, 2021