Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_parquet panics when file is bigger than 2^32 but materialized query isn't. duckdb and pyarrow can do query. #20777

Open
2 tasks done
deanm0000 opened this issue Jan 17, 2025 · 0 comments
Labels
A-io Area: reading and writing data A-io-cloud Area: reading/writing to cloud storage A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars rust Related to Rust Polars

Comments

@deanm0000
Copy link
Collaborator

deanm0000 commented Jan 17, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

use polars::prelude::*;

fn main() {
    let cloud_path= // some path
    let df= LazyFrame::scan_parquet(cloud_path, ScanArgsParquet::default()).unwrap().filter(col("node_id").eq(lit(1)))
        .collect().unwrap();
    eprintln!("{}",df);
}

Same error with python, of course.

df=pl.scan_parquet(cloud_path).filter(pl.col("node_id")==1).collect()

Log output

Async thread count: 4
async download_chunk_size: 67108864
POLARS PREFETCH_SIZE: 32
querying metadata of 1/1 files...
reading of 1/1 file...
parquet file must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet file can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
[there are 15k row groups that get skipped]
POLARS ROW_GROUP PREFETCH_SIZE: 128
parquet scan with parallel = RowGroups
thread 'main' panicked at src/main.rs:8:20:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("Parquet file produces more than pow(2, 32) rows; consider compiling with polars-bigidx feature (polars-u64-idx package on python), or set 'streaming'"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Issue description

I ran it again with

let args = ScanArgsParquet {
    parallel: ParallelStrategy::Prefiltered,
    ..Default::default()
};
let df= LazyFrame::scan_parquet(cloud_path, args).unwrap().filter(col("node_id").eq(lit(1)))
    .collect().unwrap();

but the log still says parquet scan with parallel = RowGroups same is true when I did ParallelStrategy::None

Expected behavior

The size of the row group is well under 2^32 so it should be able to materialize. Using pyarrow dataset filter, or duckdb can each get the row group just fine.

I can sort of read a row group with

let async_reader = ParquetAsyncReader::from_uri(cloud_path,None,None).await.unwrap();
let df=async_reader.with_slice(Some((0, 710_196))).finish().await.unwrap(); //I just hard coded the first row group's slice index
eprintln!("{}",df);

but I get this weird panic after it prints the df.

thread 'tokio-runtime-worker' panicked at /home/dean/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/polars-io-0.45.1/src/cloud/polars_object_store.rs:257:29:
assertion `left == right` failed
  left: 14204928
 right: 18907009

it's weird because the left/right numbers change every time I run this.

I compiled with release and I didn't get the tokio panic

Installed versions

polars = { version = "0.45.1", features = ["json", "temporal", "timezones","dtype-datetime",
"strings", "dtype-date","lazy","parquet", "simd", "performant", "azure", "dtype-u8","offset_by", "streaming", "partition_by", "is_in"]}
polars-core = {version = "0.45.1" }
polars-io = {version="0.45.1"}
polars-plan = {version="0.45.1"}

@deanm0000 deanm0000 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars rust Related to Rust Polars labels Jan 17, 2025
@deanm0000 deanm0000 changed the title scan_parquet scan_parquet panics when file is bigger than 2^32 but materialized query isn't. duckdb and pyarrow can do query. Jan 17, 2025
@deanm0000 deanm0000 added A-io Area: reading and writing data A-io-cloud Area: reading/writing to cloud storage A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions labels Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data A-io-cloud Area: reading/writing to cloud storage A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

No branches or pull requests

1 participant