You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a big one that I put here partly just to see who's still interested in this repo!
I've been playing around a bit the last couple days with DuckDB, a columnar database that's the heir to MonetDB, which I had thought about for this project but never used.
Duck is much lighter than anything out there except SQLlite, but unlike SQLlite, does a column-oriented store more appropriate for the queries here because related blocks of memory will be close together.
The purpose of MySQL for this project has always been:
Handle the details of JOIN queries
Build massive B-tree indices that allow reasonably fast access to wordid pages and put all the bookids for a given wordid contiguously on disk so that you don't have to seek to a million different places on the hard drive.
Duck DB can do the first fine; and for the second, the builtin BRIN indexes turn out to be faster than MySQL if you can handle the fairly difficult work of sorting a billion records or so before loading it into duck. Having managed to sort the rate my professor bookworm master_bookcounts data in Apache Arrow feather format (that's a whole different story--this can takes days on MySQL for a trillion words, but I think I've got a decent O(N-log(N)) multi-pass on-disk sort going.), it performs better than MySQL on some standard queries.
Plus, it doesn't have the expensive request for large in-memory tables; and the actual on-disk files seems to be a bit smaller, even though we're using 4-byte ints instead of 3-byte ints.
SELECTSUM(count), date_year FROM master_bookcounts NATURAL JOIN fastcat WHERE wordid =9GROUP BY date_year;
MySQL: 16 rows in set (5.246 sec)
DuckDB: 3.65 s
SELECT date_year, department, SUM(count) FROM master_bookcounts NATURAL JOIN catalog WHERE wordid =118GROUP BY date_year, department;
8624 rows. 1 min 2 seconds in MySQL (Not fair because we've always grouped on integer keys, not text keys for department).
3.78 seconds in DuckDB (Still grouping on text keys!!)
The text was updated successfully, but these errors were encountered:
This is a big one that I put here partly just to see who's still interested in this repo!
I've been playing around a bit the last couple days with DuckDB, a columnar database that's the heir to MonetDB, which I had thought about for this project but never used.
Duck is much lighter than anything out there except SQLlite, but unlike SQLlite, does a column-oriented store more appropriate for the queries here because related blocks of memory will be close together.
The purpose of MySQL for this project has always been:
Duck DB can do the first fine; and for the second, the builtin BRIN indexes turn out to be faster than MySQL if you can handle the fairly difficult work of sorting a billion records or so before loading it into duck. Having managed to sort the rate my professor bookworm master_bookcounts data in Apache Arrow feather format (that's a whole different story--this can takes days on MySQL for a trillion words, but I think I've got a decent O(N-log(N)) multi-pass on-disk sort going.), it performs better than MySQL on some standard queries.
Plus, it doesn't have the expensive request for large in-memory tables; and the actual on-disk files seems to be a bit smaller, even though we're using 4-byte ints instead of 3-byte ints.
MySQL: 16 rows in set (5.246 sec)
DuckDB: 3.65 s
8624 rows. 1 min 2 seconds in MySQL (Not fair because we've always grouped on integer keys, not text keys for department).
3.78 seconds in DuckDB (Still grouping on text keys!!)
The text was updated successfully, but these errors were encountered: