Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] [PYSPARK] Avoid running nullable checks if nullable=True #1403

Merged

Conversation

filipeo2-mck
Copy link
Contributor

@filipeo2-mck filipeo2-mck commented Nov 2, 2023

The check for nullable columns in PySpark is being triggered every time, when the column does not have the nullable property (False by default) and when it's set with True.
When it's set, it shouldn't run. As each run relies on a .count() PySpark action, that is very costly regarding time and cpu usage.

In my test scenario, I have a complex Spark DAG and 42 nullable=True in a schema with 92 Fields in total.
Without this fix it was running in 1h20m. With this fix it decreased for 41m.

Added a corresponding test case for it.

Copy link

codecov bot commented Nov 2, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (37c24d9) 94.23% compared to head (4f6ce43) 94.23%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1403   +/-   ##
=======================================
  Coverage   94.23%   94.23%           
=======================================
  Files          91       91           
  Lines        6975     6976    +1     
=======================================
+ Hits         6573     6574    +1     
  Misses        402      402           
Files Coverage Δ
pandera/backends/pyspark/column.py 69.41% <100.00%> (+0.36%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@filipeo2-mck filipeo2-mck changed the title [BUGFIX] [PYSPARK] Avoid running nullable checks if nullable=true [BUGFIX] [PYSPARK] Avoid running nullable checks if nullable==True Nov 2, 2023
@filipeo2-mck filipeo2-mck changed the title [BUGFIX] [PYSPARK] Avoid running nullable checks if nullable==True [BUGFIX] [PYSPARK] Avoid running nullable checks if nullable=True Nov 2, 2023
@filipeo2-mck filipeo2-mck marked this pull request as ready for review November 2, 2023 12:53
@filipeo2-mck filipeo2-mck marked this pull request as draft November 2, 2023 12:57
@filipeo2-mck filipeo2-mck marked this pull request as ready for review November 2, 2023 15:09
@filipeo2-mck
Copy link
Contributor Author

@NeerajMalhotra-QB

@NeerajMalhotra-QB NeerajMalhotra-QB self-requested a review November 3, 2023 20:04
@NeerajMalhotra-QB
Copy link
Collaborator

NeerajMalhotra-QB commented Nov 3, 2023

The check for nullable columns in PySpark is being triggered every time, when the column does not have the nullable property (False by default) and when it's set with True. When it's set, it shouldn't run. As each run relies on a .count() PySpark action, that is very costly regarding time and cpu usage.

Not necessarily that .count() is always costly. If source data is in optimized formats like parquet, then metadata is already generated for it and often count, max, min etc. actions may pull from metadata directly instead of computing everytime.

@NeerajMalhotra-QB
Copy link
Collaborator

NeerajMalhotra-QB commented Nov 3, 2023

@filipeo2-mck good catch.. :)
Overall looks good to me, will approve once above change is added.

@filipeo2-mck
Copy link
Contributor Author

The check for nullable columns in PySpark is being triggered every time, when the column does not have the nullable property (False by default) and when it's set with True. When it's set, it shouldn't run. As each run relies on a .count() PySpark action, that is very costly regarding time and cpu usage.

Not necessarily that .count() is always costly. If source data is in optimized formats like parquet, then metadata is already generated for it and often count, max, min etc. actions may pull from metadata directly instead of computing everytime.

Agree, unfortunately not for the presence of nulls :(

About:

...will approve once above change is added

Which change are you refering to?

@maxispeicher
Copy link
Contributor

One question that came to my mind while looking at the code. Right now the check_nullable check is treated as a schema check, however, in the way it is implemented it is actually a data check.
Wouldn't it make more sense for the schema check_nullable to "just" check the dtype of the column and see if it matches the nullable definition from the pandera type?

@kasperjanehag
Copy link

Looking good on my end! However, I agree with @maxispeicher's point? I believe a user who haven't used Pandera wouldn't expect pandera to actually check for null entries when ValidationScope.SCHEMA. @filipeo2-mck @NeerajMalhotra-QB WDYT?

@filipeo2-mck
Copy link
Contributor Author

Hum, interesting discussion @maxispeicher...

My current understanding is that, as we do for unique, the nullable is a schema-related property that makes use of current column values to be checked. Both rely in their values to be checked but are not essentially a "data" validation, but rather, a expected behavior at column level. It's the schema that allows or not nulls for a column.

I did some research about this:

  • DBML is a format to define database schemas only (it does not contain data-specific checks like we have in Pandera) and it has the nullable and unique properties in its column definitions:

    DBML (Database Markup Language) is an open-source DSL language designed to define and document database schemas and structures

  • In Spark, this attribute belongs to a StructField, that is used to create a schema-like object for Row definition (StructType).

  • PyArrow follows the same approach of Spark, setting it as a Field property inside a Schema definition.

  • On the other side, SQLServer classify it as a Constraint, making it being closer to data.

I just followed it when adding this bugfix. As we just have schema or data validation types to fit in, I don't disagree with current implementation. We don't have a third option ("constraints?") in the code to make use of.

Happy to see other opinions or other sources with additional info :)

@maxispeicher
Copy link
Contributor

@filipeo2-mck I'm also fine with the current implementation from a logic perspective, but as you've also discovered these null-checks can become quite costly if you have a high number of columns.
However, to the best of my knowledge, pyspark already handles these null checks when creating DataFrames, which means it should be enough to compare the schema and be way faster. MWE:

from pyspark.sql import SparkSession
spark = SparkSession.newSession()

df = spark.createDataFrame([{"c1": 1}, {"c1": None}], schema="c1 int not null")

will raise ValueError: field c1: This field is not nullable, but got None

@filipeo2-mck
Copy link
Contributor Author

It would be a good solution to avoid some of these count()s, by comparing both schemas 👍

I tried it but, if the I try to create df using this invalid condition, the df object is not instantiated and we don't have a dataframe to get a schema from:
image

If the above condition is not valid for dataframe creation, to create a valid one we need to allow nulls, ending with a dataframe definition that always must be checked anyway:
image
If a PySpark column allows null values, we still need to check if all the nullable=False-defined fields in Pandera are met.

As a Pandera validation happens after the creation of a valid df, I don't think this approach it's feasible :(

Just let me know if I misinterpreted your idea/suggestion.

@kasperjanehag
Copy link

Great discussion! :)

I guess it all comes down to the concept of nullable as a schema-related property or not:

  • In PySpark, StructField includes a nullable property that specifies whether a column can contain null values, reflecting this attribute as part of the schema definition.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Define a schema with nullable properties
schema = StructType([
    StructField("id", IntegerType(), nullable=False),
    StructField("name", StringType(), nullable=True)
])
  • Similarly, PyArrow treats nullability as an attribute of a Field within a Schema object, aligning with PySpark's approach and further emphasizing that nullability is a schema-level constraint​​.
import pyarrow as pa

# Define a schema with nullable properties
schema = pa.schema([
    ('id', pa.int32(), False),
    ('name', pa.string(), True)
])
  • On the contrary, SQLServer categorizes nullability as a constraint, which is a rule enforced on the data, suggesting a closer relationship with data validation rather than just schema definition​​.
CREATE TABLE Persons (
    PersonID int NOT NULL,
    LastName varchar(255) NOT NULL,
    FirstName varchar(255),
    Address varchar(255),
    City varchar(255)
);

So in a example when PySpark loads Parquet files or similar into a DataFrame, and inffering the schema from the Parquet metadata. In this scenario I would assume pandera to either simply check the schema (check what schma attributes has been inferred by loading the data) while a schema and data check would both make sure the schemas are avaliable and run the checks.

@kasperjanehag
Copy link

@filipeo2-mck on your comment

If a PySpark column allows null values, we still need to check if all the nullable=False-defined fields in Pandera are met.

As a Pandera validation happens after the creation of a valid df, I don't think this approach it's feasible :(

Is your reasoning that if we have a PySpark DataFrame using the following code

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType

# Initialize Spark Session
spark = SparkSession.builder.appName("Example").getOrCreate()

# Define a schema with StructType and StructField
schema = StructType([
    StructField("c1", IntegerType(), nullable=True) 
])

df = spark.createDataFrame([{"c1": 1}, {"c1": None}], schema=schema)

Then we still need to be able to check the c1 column for nullable=False in pandera .validate()? I.e even if a passed DataFrame has nullable=True, we still need the option to enforce nullable=False on that DataFrame in pandera?

@kasperjanehag
Copy link

Sorry, if I'm being slow, but to me that's still a schema check. 😅

@kasperjanehag
Copy link

Additionally, coming back to:

If a PySpark column allows null values, we still need to check if all the nullable=False-defined fields in Pandera are met.

Looking at the docs, I would argue that in the case where you have a PySpark DataFrame which allows null values, but your Pandera validation specified nullable=False, it shouldn't perform any data-level validations beyond schema conformance. Or am I missing something @NeerajMalhotra-QB @cosmicBboy @maxispeicher ?

@filipeo2-mck
Copy link
Contributor Author

filipeo2-mck commented Nov 7, 2023

I believe it's better if we work with the conditions in this table:

underlying data pyspark df's schema pandera's schema result
no nulls nullable=False nullable=False Pandera checks for null data (as is, costly)
no nulls nullable=False nullable=True Pandera was checking for nulls, this PR fixes it by skipping it 🆕
no nulls nullable=True nullable=False Pandera checks for null data (as is, costly)
no nulls nullable=True nullable=True Pandera was checking for nulls, this PR fixes it by skipping it 🆕
nulls nullable=False nullable=False Impossible, a PySpark DF cannot be loaded in this scenario (as previously evidenced)
nulls nullable=False nullable=True Impossible, a PySpark DF cannot be loaded in this scenario (as previously evidenced)
nulls nullable=True nullable=False Pandera checks for null data (as is, costly)
nulls nullable=True nullable=True Pandera was checking for nulls, this PR fixes it by skipping it 🆕

As I see, the only possible improvement over the existing status of this PR would be the first row (no nulls | nullable=False | nullable=False). The Pander validation could check the nullable status of the dataframe and skip it. Was this your idea, @maxispeicher?

@NeerajMalhotra-QB
Copy link
Collaborator

Additionally, coming back to:

If a PySpark column allows null values, we still need to check if all the nullable=False-defined fields in Pandera are met.

Looking at the docs, I would argue that in the case where you have a PySpark DataFrame which allows null values, but your Pandera validation specified nullable=False, it shouldn't perform any data-level validations beyond schema conformance. Or am I missing something @NeerajMalhotra-QB @cosmicBboy @maxispeicher ?

Great discussion, I really liked it :)

My take on this would be to keep it simple. We don't need to over complicate trying to solve for every use case. Hence my recommendation would be and I will side with @maxispeicher / @kasperjanehag, to use pandera's null check as schema level check which is much more efficient than data level check.

And if user wants a data level validation for nulls, then he/she can create a custom check function -> register it and use it in their applications for specific needs they have.

@filipeo2-mck
Copy link
Contributor Author

filipeo2-mck commented Nov 7, 2023

Is your reasoning that if we have a PySpark DataFrame using the following code
Then we still need to be able to check the c1 column for nullable=False in pandera .validate()?

Exactly. Pandera schema is about expectations over the format/quality data that is coming ("no null values can pass", in this case): If the dataframe has nullable=True, the values in this column may contain null values. Pandera schema, as an expectation, must evaluate it.

As mentioned in my previous comment, the only improvement situation that we could add a "skip" condition if if the dataframe has a nullable=False, as PySpark ensures that no null value will be present.

@maxispeicher
Copy link
Contributor

@filipeo2-mck Yes in this table I see the first row as a very common use-case respectively the default use-case for non-null columns.
I know it's not a very representative example but one example where this is very evident is e.g. in unit tests. It's a DataFrame with very few rows but many columns, roughly 10x40, where all of the columns should be not-null. It would be nice to also use pandera directly in the tests for ensuring this but the test run time increases from ~2 sec to ~55 sec just to perform the checks which could be reduced to a schema check.

Also I would actually expect that pandera raises an error if the pyspark schema allows null, however according to the pandera schema nulls are prohibited even if the actual data does not contain any NaNs (third row in the table).

@filipeo2-mck
Copy link
Contributor Author

@filipeo2-mck Yes in this table I see the first row as a very common use-case respectively the default use-case for non-null columns.
Agree, I'm implementing a new condition to encompass this first row and to avoid spending time running it. As soon as I have new timings, I'll push it and you can take a look, ok?

After all this great discussion I understood the different approach that you were proposing to transform the current data-level nullable check into a schema-level check only.
I don't want to implement it in this PR due some factors:

  • this change means changing current behavior of the nullable check (that users are already relying on) by a new one. It should demand a minor version release, as it's not compatible.
  • if current field-level nullable changes its behavior to do just schema-level validation and not the data validation part as it does today, how is the user supposed to do the second? I don't agree much about leaving the burden of implementing a custom check to do a simple nullable check to the end user. I believe a more appropriate solution would be a nullable check that changes the deepness of its check according to the ValidationDepth value (SCHEMA_ONLY, DATA_ONLY or SCHEMA_AND_DATA) and this change will demand a deeper update. @NeerajMalhotra-QB, WDYT about this second solution?
  • this PR is a small contribution to speed up the long processing time in PySpark integration, according to this issue I opened yesterday. For now, I'm just trying to optimize the processing time.

…a before applying nullable check

Signed-off-by: Filipe Oliveira <[email protected]>
@filipeo2-mck
Copy link
Contributor Author

@maxispeicher
Your suggestion regarding the first row of the table was implemented to avoid running it when False and False. Can you take a look, please?

Updated table:

underlying data pyspark df's schema pandera's schema result
no nulls nullable=False nullable=False Pandera was checking for nulls, this PR fixes it by skipping it 🆕
no nulls nullable=False nullable=True Pandera was checking for nulls, this PR fixes it by skipping it 🆕
no nulls nullable=True nullable=False Pandera checks for null data (as is, costly)
no nulls nullable=True nullable=True Pandera was checking for nulls, this PR fixes it by skipping it 🆕
nulls nullable=False nullable=False Impossible, a PySpark DF cannot be loaded in this scenario (as previously evidenced)
nulls nullable=False nullable=True Impossible, a PySpark DF cannot be loaded in this scenario (as previously evidenced)
nulls nullable=True nullable=False Pandera checks for null data (as is, costly)
nulls nullable=True nullable=True Pandera was checking for nulls, this PR fixes it by skipping it 🆕

@maxispeicher
Copy link
Contributor

@filipeo2-mck Yes looks good to me. Thanks for including it :)

@kasperjanehag
Copy link

@filipeo2-mck thanks for providing great overview. I feel like we're reaching a common understanding! :)

On "the different approach that you were proposing to transform the current data-level nullable check into a schema-level check only" is what I've been after as well. Essentially it's a matter of expectations from a schema-level check (should it actually process the data or only check the schema of the DataFrame.

Anyhow, to @filipeo2-mck's point:

  • I fully agree that this change means changing current behavior of the nullable check and should demand a minor version release, i.e let's not do it at part of this PR.

  • I also believe a more appropriate solution would be a nullable check that changes the deepness of its check according to the ValidationDepth value (SCHEMA_ONLY, DATA_ONLY or SCHEMA_AND_DATA). That's what I've meant all the time but haven't been good enough at explaining it 😅. For nullable that would mean:

    • SCHEMA_ONLY: only verify is DataFrame schema and Pandera schema has the same setting for nullable=True/False.
    • DATA_ONLY: verify the data no matter if the schema is correct or not
    • SCHEMA_AND_DATA: do both

Let's get this merged as is, then we can work on a separate issue that describes potential change on this behaviour in future changes?

Copy link

@kasperjanehag kasperjanehag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@filipeo2-mck
Copy link
Contributor Author

For nullable that would mean:

  • SCHEMA_ONLY: only verify is DataFrame schema and Pandera schema has the same setting for nullable=True/False.
  • DATA_ONLY: verify the data no matter if the schema is correct or not
  • SCHEMA_AND_DATA: do both

Agree! Thank you for the approval! :D

@cosmicBboy cosmicBboy merged commit 58a3309 into unionai-oss:main Nov 10, 2023
@filipeo2-mck
Copy link
Contributor Author

filipeo2-mck commented Nov 10, 2023

@maxispeicher , would you mind to evaluate #1414 too, please? I couldn't tag you there. Thank you.

max-raphael pushed a commit to max-raphael/pandera that referenced this pull request Jan 24, 2025
…nionai-oss#1403)

* avoid running nullable checks if nullable=true

Signed-off-by: Filipe Oliveira <[email protected]>

* add corresponding test cases for nullable fields

Signed-off-by: Filipe Oliveira <[email protected]>

* check schema-level information from both pyspark df and pandera shcema before applying nullable check

Signed-off-by: Filipe Oliveira <[email protected]>

---------

Signed-off-by: Filipe Oliveira <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants