Strategies for reducing data transfer from remote parquet? #407

bmschmidt · 2021-11-19T16:40:39Z

bmschmidt
Nov 19, 2021

Thanks again for your work on this.

I'm curious if there are strategies for reducing the browser payload on remote parquet files. Putting this in 'discussions' because it's partly incumbent on the person uploading the data, although I wonder if duckdb-wasm can do more.

To test what's possible with partial queries on remote files, I've set up a static site (ngrams dot benschmidt dot org) that queries a 1GB parquet file with two columns: the words and counts for unigrams in one of the google ngrams databases. https://books.google.com/ngrams.

Thanks to the recent fixes it runs pretty fast and caches parts of the database locally.

But the payload being sent to the browser is significantly larger than I'd expect. I'm searching by word0, which is sorted alphabetically, so that parquet metadata can indicate what chunks need to be read, and I've written the data in parquet with 130KB chunks to try to minimize transfer needs. But almost every transfer ends with a 16MB chunk; running three or four queries takes 60 MB or so.

I don't fully understand parquet metadata, but I thought that it should be possible to check whether chunks are relevant without reading so many of them in. Back of the envelope, all that really needs to happen for this to be read is probably 1MB or so in range indexes and at most 4 parquet chunks of actual data. Any thoughts on how to get down towards that floor?

ankoh · 2021-11-19T16:57:34Z

ankoh
Nov 19, 2021
Collaborator

Interesting, I think we might have overshot the goal with the changes in #381.
We changed the readahead acceleration to a factor of 8 so reading 2-3 of your 130KB chunks sequentially will already max out at a readahead of 16MB.
In your case, a smaller factor might have worked much better.
So I think we're currently on a bumpy road towards more robust defaults but we might benefit a lot from making the readahead acceleration configurable per database and per file.
That way, we can iterate on that magic number a bit faster.

Readaheads are unfortunately our best bet right now since we're still tied to blocking query execution within the web worker.
Eventually, we want to reduce the readahead acceleration in favor of multiple concurrent fetches but this requires more ground-work towards asynchronous I/O in DuckDB.

9 replies

ankoh Nov 20, 2021
Collaborator

That's exciting! I'd also lean towards parquet in your case.

Regarding database files, Parquet does not contain persistent indexes and will therefore struggle with very selective queries.
DuckDB will happily skip entire row groups based on their min-max statistics but will usually still do a table scan.
Just open our web shell and check the query plan with EXPLAIN.

A persistent index similar to SQLites Btrees would allow us to perform point/range lookups using 3-4 (cached) tree pages which would significantly speed up selective queries over the network.
In the end, this depends on the workload.
Analytical queries usually favour table scans over indexes but we might sometimes lean more towards (remote) indexes in the browser to minimise the amount of fetched data.
This is still more on the vision side since DuckDB does not persist its ART indexes today but this will eventually change.
I think that remote reading auxiliary data structures might go a long way for DuckDB-Wasm.

ankoh Nov 20, 2021
Collaborator

But the parquet file format is mature and will already be much faster to process than (compressed) json, so using it for the data repositories sounds fantastic!

ankoh Nov 20, 2021
Collaborator

By the way, if you want to play around with an unpolished feature of the web shell, you can get information about the accessed file ranges using:

ryan-williams Jan 8, 2024

At shell.duckdb.org today I see:

duckdb> .fstats
Unknown command: .fstats

.help shows:

.files track $FILE     Collect file statistics.

but I'm so far unable to get it to work as shown above.

Here's an example shell session attempt where I attempt to run .files track, a query, and .files read on a .parquet file:

Output

DuckDB Web Shell
Database: v0.9.2
Package:  @duckdb/[email protected]

Connected to a local transient in-memory database.
Enter .help for usage hints.

duckdb> .files
No files registered

duckdb> .files track 's3://duckdb-wasm-test/1e5.parquet';
Tracking file statistics for: 's3://duckdb-wasm-test/1e5.parquet';

duckdb> .files
No files registered

duckdb> select year from 's3://duckdb-wasm-test/1e5.parquet' limit 10;
┌──────┐
│ year │
╞══════╡
│ 2001 │
│ 2001 │
│ 2001 │
│ 2001 │
│ 2001 │
│ 2001 │
│ 2001 │
│ 2001 │
│ 2001 │
│ 2001 │
└──────┘

duckdb> .files
┌───────────────────────────────────┬───────────┬──────────┬────────────┐
│ File Name                         ┆ File Size ┆ Protocol ┆ Statistics │
╞═══════════════════════════════════╪═══════════╪══════════╪════════════╡
│ s3://duckdb-wasm-test/1e5.parquet ┆ 4.43 MB   ┆ Buffer   ┆ false      │
└───────────────────────────────────┴───────────┴──────────┴────────────┘

duckdb> .files reads 's3://duckdb-wasm-test/1e5.parquet';
                                                                                                                                         █ 1 B   
┌───────────────────────────────────────────────┬───────────────────────────────────────────────┬───────────────────────────────────────────────┐
│                                               │                                               │                                               │
└───────────────────────────────────────────────┴───────────────────────────────────────────────┴───────────────────────────────────────────────┘
 Cold                                            Read-Ahead                                      Buffered                                       
 0 B                                             0 B                                             0 B                                            

duckdb> .files track 's3://duckdb-wasm-test/1e5.parquet';
Tracking file statistics for: 's3://duckdb-wasm-test/1e5.parquet';

duckdb> .files
┌───────────────────────────────────┬───────────┬──────────┬────────────┐
│ File Name                         ┆ File Size ┆ Protocol ┆ Statistics │
╞═══════════════════════════════════╪═══════════╪══════════╪════════════╡
│ s3://duckdb-wasm-test/1e5.parquet ┆ 4.43 MB   ┆ Buffer   ┆ false      │
└───────────────────────────────────┴───────────┴──────────┴────────────┘

duckdb>

It shows 0 bytes of reads; suggestions welcome.

chriszrc May 30, 2024

Any update here? Using .fstats at https://shell.duckdb.org/ still results in Unknown command: .fstats

chriszrc · 2023-08-09T15:21:38Z

chriszrc
Aug 9, 2023

@ankoh Now that duckdb can persist the art indexes:

https://duckdb.org/2022/07/27/art-storage.html

does this change the situation for querying remote parquet (say from wasm)?

1 reply

ryan-williams Jan 9, 2024

At a minimum, I hoped the new indexes would support more efficient queries against .duckdb files where INDEXes had been added. I tried to measure it and got scattered results: #1575. Data transferred still seems to be far higher than should be required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strategies for reducing data transfer from remote parquet? #407

{{title}}

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Strategies for reducing data transfer from remote parquet? #407

bmschmidt Nov 19, 2021

Replies: 2 comments · 10 replies

ankoh Nov 19, 2021 Collaborator

ankoh Nov 20, 2021 Collaborator

ankoh Nov 20, 2021 Collaborator

ankoh Nov 20, 2021 Collaborator

ryan-williams Jan 8, 2024

chriszrc May 30, 2024

chriszrc Aug 9, 2023

ryan-williams Jan 9, 2024

bmschmidt
Nov 19, 2021

Replies: 2 comments 10 replies

ankoh
Nov 19, 2021
Collaborator

ankoh Nov 20, 2021
Collaborator

ankoh Nov 20, 2021
Collaborator

ankoh Nov 20, 2021
Collaborator

chriszrc
Aug 9, 2023