scan_parquet panics when file is bigger than 2^32 but materialized query isn't. duckdb and pyarrow can do query. #20777

deanm0000 · 2025-01-17T19:42:53Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

use polars::prelude::*;

fn main() {
    let cloud_path= // some path
    let df= LazyFrame::scan_parquet(cloud_path, ScanArgsParquet::default()).unwrap().filter(col("node_id").eq(lit(1)))
        .collect().unwrap();
    eprintln!("{}",df);
}

Same error with python, of course.

df=pl.scan_parquet(cloud_path).filter(pl.col("node_id")==1).collect()

Log output

Async thread count: 4
async download_chunk_size: 67108864
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
[there are 15k row groups that get skipped]
POLARS ROW_GROUP PREFETCH_SIZE: 128
parquet scan with parallel = RowGroups
thread 'main' panicked at src/main.rs:8:20:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("Parquet file produces more than pow(2, 32) rows; consider compiling with polars-bigidx feature (polars-u64-idx package on python), or set 'streaming'"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Issue description

I ran it again with

let args = ScanArgsParquet {
    parallel: ParallelStrategy::Prefiltered,
    ..Default::default()
};
let df= LazyFrame::scan_parquet(cloud_path, args).unwrap().filter(col("node_id").eq(lit(1)))
    .collect().unwrap();

but the log still says parquet scan with parallel = RowGroups same is true when I did ParallelStrategy::None

Expected behavior

The size of the row group is well under 2^32 so it should be able to materialize. Using pyarrow dataset filter, or duckdb can each get the row group just fine.

I can sort of read a row group with

let async_reader = ParquetAsyncReader::from_uri(cloud_path,None,None).await.unwrap();
let df=async_reader.with_slice(Some((0, 710_196))).finish().await.unwrap(); //I just hard coded the first row group's slice index
eprintln!("{}",df);

but I get this weird panic after it prints the df.

thread 'tokio-runtime-worker' panicked at /home/dean/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/polars-io-0.45.1/src/cloud/polars_object_store.rs:257:29:
assertion `left == right` failed
  left: 14204928
 right: 18907009

it's weird because the left/right numbers change every time I run this.

I compiled with release and I didn't get the tokio panic

Installed versions

polars = { version = "0.45.1", features = ["json", "temporal", "timezones","dtype-datetime",
"strings", "dtype-date","lazy","parquet", "simd", "performant", "azure", "dtype-u8","offset_by", "streaming", "partition_by", "is_in"]}
polars-core = {version = "0.45.1" }
polars-io = {version="0.45.1"}
polars-plan = {version="0.45.1"}

The text was updated successfully, but these errors were encountered:

deanm0000 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars rust Related to Rust Polars labels Jan 17, 2025

deanm0000 changed the title ~~scan_parquet~~ scan_parquet panics when file is bigger than 2^32 but materialized query isn't. duckdb and pyarrow can do query. Jan 17, 2025

deanm0000 added A-io Area: reading and writing data A-io-cloud Area: reading/writing to cloud storage A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions labels Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scan_parquet panics when file is bigger than 2^32 but materialized query isn't. duckdb and pyarrow can do query. #20777

scan_parquet panics when file is bigger than 2^32 but materialized query isn't. duckdb and pyarrow can do query. #20777

deanm0000 commented Jan 17, 2025 •

edited

Loading

scan_parquet panics when file is bigger than 2^32 but materialized query isn't. duckdb and pyarrow can do query. #20777

scan_parquet panics when file is bigger than 2^32 but materialized query isn't. duckdb and pyarrow can do query. #20777

Comments

deanm0000 commented Jan 17, 2025 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

deanm0000 commented Jan 17, 2025 •

edited

Loading