-
-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] [PYSPARK] Avoid running nullable checks if nullable=True
#1403
[BUGFIX] [PYSPARK] Avoid running nullable checks if nullable=True
#1403
Conversation
Signed-off-by: Filipe Oliveira <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1403 +/- ##
=======================================
Coverage 94.23% 94.23%
=======================================
Files 91 91
Lines 6975 6976 +1
=======================================
+ Hits 6573 6574 +1
Misses 402 402
☔ View full report in Codecov by Sentry. |
nullable==True
nullable==True
nullable=True
Signed-off-by: Filipe Oliveira <[email protected]>
Not necessarily that |
@filipeo2-mck good catch.. :) |
Agree, unfortunately not for the presence of nulls :( About:
Which change are you refering to? |
One question that came to my mind while looking at the code. Right now the |
Looking good on my end! However, I agree with @maxispeicher's point? I believe a user who haven't used Pandera wouldn't expect pandera to actually check for null entries when |
Hum, interesting discussion @maxispeicher... My current understanding is that, as we do for I did some research about this:
I just followed it when adding this bugfix. As we just have schema or data validation types to fit in, I don't disagree with current implementation. We don't have a third option ("constraints?") in the code to make use of. Happy to see other opinions or other sources with additional info :) |
@filipeo2-mck I'm also fine with the current implementation from a logic perspective, but as you've also discovered these null-checks can become quite costly if you have a high number of columns. from pyspark.sql import SparkSession
spark = SparkSession.newSession()
df = spark.createDataFrame([{"c1": 1}, {"c1": None}], schema="c1 int not null") will raise |
Great discussion! :) I guess it all comes down to the concept of nullable as a schema-related property or not:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Define a schema with nullable properties
schema = StructType([
StructField("id", IntegerType(), nullable=False),
StructField("name", StringType(), nullable=True)
])
import pyarrow as pa
# Define a schema with nullable properties
schema = pa.schema([
('id', pa.int32(), False),
('name', pa.string(), True)
])
CREATE TABLE Persons (
PersonID int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Address varchar(255),
City varchar(255)
); So in a example when PySpark loads Parquet files or similar into a DataFrame, and inffering the schema from the Parquet metadata. In this scenario I would assume pandera to either simply check the schema (check what schma attributes has been inferred by loading the data) while a schema and data check would both make sure the schemas are avaliable and run the checks. |
@filipeo2-mck on your comment
Is your reasoning that if we have a PySpark DataFrame using the following code from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType
# Initialize Spark Session
spark = SparkSession.builder.appName("Example").getOrCreate()
# Define a schema with StructType and StructField
schema = StructType([
StructField("c1", IntegerType(), nullable=True)
])
df = spark.createDataFrame([{"c1": 1}, {"c1": None}], schema=schema) Then we still need to be able to check the |
Sorry, if I'm being slow, but to me that's still a schema check. 😅 |
Additionally, coming back to:
Looking at the docs, I would argue that in the case where you have a PySpark DataFrame which allows null values, but your Pandera validation specified |
I believe it's better if we work with the conditions in this table:
As I see, the only possible improvement over the existing status of this PR would be the first row ( |
Great discussion, I really liked it :) My take on this would be to keep it simple. We don't need to over complicate trying to solve for every use case. Hence my recommendation would be and I will side with @maxispeicher / @kasperjanehag, to use And if user wants a data level validation for |
Exactly. Pandera schema is about expectations over the format/quality data that is coming ("no As mentioned in my previous comment, the only improvement situation that we could add a "skip" condition if if the dataframe has a |
@filipeo2-mck Yes in this table I see the first row as a very common use-case respectively the default use-case for non-null columns. Also I would actually expect that pandera raises an error if the pyspark schema allows null, however according to the pandera schema nulls are prohibited even if the actual data does not contain any NaNs (third row in the table). |
After all this great discussion I understood the different approach that you were proposing to transform the current data-level
|
…a before applying nullable check Signed-off-by: Filipe Oliveira <[email protected]>
@maxispeicher Updated table:
|
@filipeo2-mck Yes looks good to me. Thanks for including it :) |
@filipeo2-mck thanks for providing great overview. I feel like we're reaching a common understanding! :) On "the different approach that you were proposing to transform the current data-level nullable check into a schema-level check only" is what I've been after as well. Essentially it's a matter of expectations from a schema-level check (should it actually process the data or only check the schema of the DataFrame. Anyhow, to @filipeo2-mck's point:
Let's get this merged as is, then we can work on a separate issue that describes potential change on this behaviour in future changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Agree! Thank you for the approval! :D |
@maxispeicher , would you mind to evaluate #1414 too, please? I couldn't tag you there. Thank you. |
…nionai-oss#1403) * avoid running nullable checks if nullable=true Signed-off-by: Filipe Oliveira <[email protected]> * add corresponding test cases for nullable fields Signed-off-by: Filipe Oliveira <[email protected]> * check schema-level information from both pyspark df and pandera shcema before applying nullable check Signed-off-by: Filipe Oliveira <[email protected]> --------- Signed-off-by: Filipe Oliveira <[email protected]>
The check for
nullable
columns in PySpark is being triggered every time, when the column does not have thenullable
property (False
by default) and when it's set withTrue
.When it's set, it shouldn't run. As each run relies on a
.count()
PySpark action, that is very costly regarding time and cpu usage.In my test scenario, I have a complex Spark DAG and 42
nullable=True
in a schema with 92Fields
in total.Without this fix it was running in 1h20m. With this fix it decreased for 41m.
Added a corresponding test case for it.