Replies: 2 comments 10 replies
-
Interesting, I think we might have overshot the goal with the changes in #381. Readaheads are unfortunately our best bet right now since we're still tied to blocking query execution within the web worker. |
Beta Was this translation helpful? Give feedback.
-
@ankoh Now that duckdb can persist the art indexes: https://duckdb.org/2022/07/27/art-storage.html does this change the situation for querying remote parquet (say from wasm)? |
Beta Was this translation helpful? Give feedback.
-
Thanks again for your work on this.
I'm curious if there are strategies for reducing the browser payload on remote parquet files. Putting this in 'discussions' because it's partly incumbent on the person uploading the data, although I wonder if duckdb-wasm can do more.
To test what's possible with partial queries on remote files, I've set up a static site (ngrams dot benschmidt dot org) that queries a 1GB parquet file with two columns: the words and counts for unigrams in one of the google ngrams databases. https://books.google.com/ngrams.
Thanks to the recent fixes it runs pretty fast and caches parts of the database locally.
But the payload being sent to the browser is significantly larger than I'd expect. I'm searching by
word0
, which is sorted alphabetically, so that parquet metadata can indicate what chunks need to be read, and I've written the data in parquet with 130KB chunks to try to minimize transfer needs. But almost every transfer ends with a 16MB chunk; running three or four queries takes 60 MB or so.I don't fully understand parquet metadata, but I thought that it should be possible to check whether chunks are relevant without reading so many of them in. Back of the envelope, all that really needs to happen for this to be read is probably 1MB or so in range indexes and at most 4 parquet chunks of actual data. Any thoughts on how to get down towards that floor?
Beta Was this translation helpful? Give feedback.
All reactions