Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore issues with performance when using geo filtering with metrics #17

Open
stuartlynn opened this issue Apr 26, 2024 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@stuartlynn
Copy link
Contributor

When using geoid filtering, we see longer load times than not. This is a bit counter intuitive as we would expect this to require a smaller read from the remote storage.

Some benchmarks

Without geo filtering

Query plan

 SELECT [col("B17021_E006"), col("GEO_ID")] FROM

    Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
    PROJECT */25318 COLUMNS
Benchmark 1: ./target/release/popgetter_cli
  Time (mean ± σ):      3.164 s ±  0.284 s    [User: 0.407 s, System: 0.159 s]
  Range (min … max):    2.684 s …  3.447 s    10 runs

With geo filtering

Query plan

FILTER col("GEO_ID").is_in([Series[geo_ids]]) FROM
 SELECT [col("B17021_E006"), col("GEO_ID")] FROM

    Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
    PROJECT */25318 COLUMNS
Benchmark 1: ./target/release/popgetter_cli
  Time (mean ± σ):      7.296 s ±  0.312 s    [User: 4.364 s, System: 0.182 s]
  Range (min … max):    6.866 s …  8.064 s    10 runs

This is a bit weird and I am wondering if the issue is the large header for this file (which has about 25000 columns). Perhaps revisit this once we have the data split in to multiple smaller parquet files.

Questions

  • Is the network IO the same for both commands? If this is the case then polars is likely grabbing both the geoid and the required columns then filtering locally. If that is whats happening then a question is why is filtering 84,000 rows adding aprox 4 seconds to the execution.

  • If the header read is slow and for some reason this is happening multiple times when we are filtering rows, we might see an improvement with multiple parquet files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog:
Development

No branches or pull requests

1 participant