Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to properly validate a polars.LazyFrame? #1776

Open
csubhodeep opened this issue Aug 4, 2024 · 2 comments
Open

How to properly validate a polars.LazyFrame? #1776

csubhodeep opened this issue Aug 4, 2024 · 2 comments
Labels
question Further information is requested

Comments

@csubhodeep
Copy link

csubhodeep commented Aug 4, 2024

Question about pandera

Hello pandera community, I am trying out pandera to validate a normal polars.LazyFrame as described in the first example in the docs.

Now if I understood the docs correctly, by design, calling the validate method on the LazyFrame would only check the schema. I have the following questions:

  1. What is the extra benefit here for the user to declare a pandera.DataFrameSchema when they can just use the == operator to compare the schema with a pre-defined polars.Schema object?
  2. Now in case we want to do in-depth data validation on the LazyFrame we should call the collect method on it but then if in a situation we have, let's say, 50 columns but in the pandera.DataFrameSchema we have 3 columns then does it make sense to pull the rest 50 columns in-memory?

Would it make more sense to do control this behaviour inside the validate method, this way pandera could add a projection on columns selecting only the ones that have been defined in the pandera.DataFrameSchema and then maybe execute the validation checks/logics and then finally call the collect internally instead of asking the user to call collect before doing the validations.

@csubhodeep csubhodeep added the question Further information is requested label Aug 4, 2024
@butterlyn
Copy link

For example (2), can't you just select the columns you want to validate before collecting?

@csubhodeep
Copy link
Author

csubhodeep commented Aug 6, 2024

For example (2), can't you just select the columns you want to validate before collecting?

@butterlyn do we do the same for pandas? If not, then I am not sure why we need to make an exception wrt the usage only for polars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants