Explore issues with performance when using geo filtering with metrics #17

stuartlynn · 2024-04-26T11:02:17Z

When using geoid filtering, we see longer load times than not. This is a bit counter intuitive as we would expect this to require a smaller read from the remote storage.

Some benchmarks

Without geo filtering

Query plan

 SELECT [col("B17021_E006"), col("GEO_ID")] FROM

    Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
    PROJECT */25318 COLUMNS

Benchmark 1: ./target/release/popgetter_cli
  Time (mean ± σ):      3.164 s ±  0.284 s    [User: 0.407 s, System: 0.159 s]
  Range (min … max):    2.684 s …  3.447 s    10 runs

With geo filtering

Query plan

FILTER col("GEO_ID").is_in([Series[geo_ids]]) FROM
 SELECT [col("B17021_E006"), col("GEO_ID")] FROM

    Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
    PROJECT */25318 COLUMNS

Benchmark 1: ./target/release/popgetter_cli
  Time (mean ± σ):      7.296 s ±  0.312 s    [User: 4.364 s, System: 0.182 s]
  Range (min … max):    6.866 s …  8.064 s    10 runs

This is a bit weird and I am wondering if the issue is the large header for this file (which has about 25000 columns). Perhaps revisit this once we have the data split in to multiple smaller parquet files.

Questions

Is the network IO the same for both commands? If this is the case then polars is likely grabbing both the geoid and the required columns then filtering locally. If that is whats happening then a question is why is filtering 84,000 rows adding aprox 4 seconds to the execution.
If the header read is slow and for some reason this is happening multiple times when we are filtering rows, we might see an improvement with multiple parquet files.

The text was updated successfully, but these errors were encountered:

stuartlynn added the enhancement New feature or request label Apr 26, 2024

stuartlynn mentioned this issue Apr 26, 2024

Functions for reading metrics remotely from Azure #16

Merged

3 tasks

andrewphilipsmith added this to Popgetter Aug 8, 2024

andrewphilipsmith moved this to Import in Popgetter Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore issues with performance when using geo filtering with metrics #17

Explore issues with performance when using geo filtering with metrics #17

stuartlynn commented Apr 26, 2024

Explore issues with performance when using geo filtering with metrics #17

Explore issues with performance when using geo filtering with metrics #17

Comments

stuartlynn commented Apr 26, 2024

Without geo filtering

With geo filtering

Questions