You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using geoid filtering, we see longer load times than not. This is a bit counter intuitive as we would expect this to require a smaller read from the remote storage.
Some benchmarks
Without geo filtering
Query plan
SELECT [col("B17021_E006"), col("GEO_ID")] FROM
Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
PROJECT */25318 COLUMNS
Benchmark 1: ./target/release/popgetter_cli
Time (mean ± σ): 3.164 s ± 0.284 s [User: 0.407 s, System: 0.159 s]
Range (min … max): 2.684 s … 3.447 s 10 runs
With geo filtering
Query plan
FILTER col("GEO_ID").is_in([Series[geo_ids]]) FROM
SELECT [col("B17021_E006"), col("GEO_ID")] FROM
Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
PROJECT */25318 COLUMNS
Benchmark 1: ./target/release/popgetter_cli
Time (mean ± σ): 7.296 s ± 0.312 s [User: 4.364 s, System: 0.182 s]
Range (min … max): 6.866 s … 8.064 s 10 runs
This is a bit weird and I am wondering if the issue is the large header for this file (which has about 25000 columns). Perhaps revisit this once we have the data split in to multiple smaller parquet files.
Questions
Is the network IO the same for both commands? If this is the case then polars is likely grabbing both the geoid and the required columns then filtering locally. If that is whats happening then a question is why is filtering 84,000 rows adding aprox 4 seconds to the execution.
If the header read is slow and for some reason this is happening multiple times when we are filtering rows, we might see an improvement with multiple parquet files.
The text was updated successfully, but these errors were encountered:
When using geoid filtering, we see longer load times than not. This is a bit counter intuitive as we would expect this to require a smaller read from the remote storage.
Some benchmarks
Without geo filtering
Query plan
With geo filtering
Query plan
This is a bit weird and I am wondering if the issue is the large header for this file (which has about 25000 columns). Perhaps revisit this once we have the data split in to multiple smaller parquet files.
Questions
Is the network IO the same for both commands? If this is the case then polars is likely grabbing both the geoid and the required columns then filtering locally. If that is whats happening then a question is why is filtering 84,000 rows adding aprox 4 seconds to the execution.
If the header read is slow and for some reason this is happening multiple times when we are filtering rows, we might see an improvement with multiple parquet files.
The text was updated successfully, but these errors were encountered: