[BUGFIX] [PYSPARK] Avoid running nullable checks if `nullable=True` #1403

filipeo2-mck · 2023-11-02T11:46:10Z

The check for nullable columns in PySpark is being triggered every time, when the column does not have the nullable property (False by default) and when it's set with True.
When it's set, it shouldn't run. As each run relies on a .count() PySpark action, that is very costly regarding time and cpu usage.

In my test scenario, I have a complex Spark DAG and 42 nullable=True in a schema with 92 Fields in total.
Without this fix it was running in 1h20m. With this fix it decreased for 41m.

Added a corresponding test case for it.

Signed-off-by: Filipe Oliveira <[email protected]>

codecov · 2023-11-02T11:53:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (37c24d9) 94.23% compared to head (4f6ce43) 94.23%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1403   +/-   ##
=======================================
  Coverage   94.23%   94.23%           
=======================================
  Files          91       91           
  Lines        6975     6976    +1     
=======================================
+ Hits         6573     6574    +1     
  Misses        402      402

Files	Coverage Δ
pandera/backends/pyspark/column.py	`69.41% <100.00%> (+0.36%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Filipe Oliveira <[email protected]>

filipeo2-mck · 2023-11-02T15:26:27Z

@NeerajMalhotra-QB

NeerajMalhotra-QB · 2023-11-03T20:16:26Z

The check for nullable columns in PySpark is being triggered every time, when the column does not have the nullable property (False by default) and when it's set with True. When it's set, it shouldn't run. As each run relies on a .count() PySpark action, that is very costly regarding time and cpu usage.

Not necessarily that .count() is always costly. If source data is in optimized formats like parquet, then metadata is already generated for it and often count, max, min etc. actions may pull from metadata directly instead of computing everytime.

NeerajMalhotra-QB · 2023-11-03T20:18:31Z

@filipeo2-mck good catch.. :)
Overall looks good to me, will approve once above change is added.

filipeo2-mck · 2023-11-06T12:10:56Z

The check for nullable columns in PySpark is being triggered every time, when the column does not have the nullable property (False by default) and when it's set with True. When it's set, it shouldn't run. As each run relies on a .count() PySpark action, that is very costly regarding time and cpu usage.

Not necessarily that .count() is always costly. If source data is in optimized formats like parquet, then metadata is already generated for it and often count, max, min etc. actions may pull from metadata directly instead of computing everytime.

Agree, unfortunately not for the presence of nulls :(

About:

...will approve once above change is added

Which change are you refering to?

maxispeicher · 2023-11-07T08:24:39Z

One question that came to my mind while looking at the code. Right now the check_nullable check is treated as a schema check, however, in the way it is implemented it is actually a data check.
Wouldn't it make more sense for the schema check_nullable to "just" check the dtype of the column and see if it matches the nullable definition from the pandera type?

kasperjanehag · 2023-11-07T09:48:03Z

Looking good on my end! However, I agree with @maxispeicher's point? I believe a user who haven't used Pandera wouldn't expect pandera to actually check for null entries when ValidationScope.SCHEMA. @filipeo2-mck @NeerajMalhotra-QB WDYT?

filipeo2-mck · 2023-11-07T13:24:37Z

Hum, interesting discussion @maxispeicher...

My current understanding is that, as we do for unique, the nullable is a schema-related property that makes use of current column values to be checked. Both rely in their values to be checked but are not essentially a "data" validation, but rather, a expected behavior at column level. It's the schema that allows or not nulls for a column.

I did some research about this:

DBML is a format to define database schemas only (it does not contain data-specific checks like we have in Pandera) and it has the nullable and unique properties in its column definitions:

DBML (Database Markup Language) is an open-source DSL language designed to define and document database schemas and structures
In Spark, this attribute belongs to a StructField, that is used to create a schema-like object for Row definition (StructType).
PyArrow follows the same approach of Spark, setting it as a Field property inside a Schema definition.
On the other side, SQLServer classify it as a Constraint, making it being closer to data.

I just followed it when adding this bugfix. As we just have schema or data validation types to fit in, I don't disagree with current implementation. We don't have a third option ("constraints?") in the code to make use of.

Happy to see other opinions or other sources with additional info :)

maxispeicher · 2023-11-07T14:02:32Z

@filipeo2-mck I'm also fine with the current implementation from a logic perspective, but as you've also discovered these null-checks can become quite costly if you have a high number of columns.
However, to the best of my knowledge, pyspark already handles these null checks when creating DataFrames, which means it should be enough to compare the schema and be way faster. MWE:

from pyspark.sql import SparkSession
spark = SparkSession.newSession()

df = spark.createDataFrame([{"c1": 1}, {"c1": None}], schema="c1 int not null")

will raise ValueError: field c1: This field is not nullable, but got None

filipeo2-mck · 2023-11-07T14:38:24Z

It would be a good solution to avoid some of these count()s, by comparing both schemas 👍

I tried it but, if the I try to create df using this invalid condition, the df object is not instantiated and we don't have a dataframe to get a schema from:

If the above condition is not valid for dataframe creation, to create a valid one we need to allow nulls, ending with a dataframe definition that always must be checked anyway:

If a PySpark column allows null values, we still need to check if all the nullable=False-defined fields in Pandera are met.

As a Pandera validation happens after the creation of a valid df, I don't think this approach it's feasible :(

Just let me know if I misinterpreted your idea/suggestion.

kasperjanehag · 2023-11-07T15:12:35Z

Great discussion! :)

I guess it all comes down to the concept of nullable as a schema-related property or not:

In PySpark, StructField includes a nullable property that specifies whether a column can contain null values, reflecting this attribute as part of the schema definition.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Define a schema with nullable properties
schema = StructType([
    StructField("id", IntegerType(), nullable=False),
    StructField("name", StringType(), nullable=True)
])

Similarly, PyArrow treats nullability as an attribute of a Field within a Schema object, aligning with PySpark's approach and further emphasizing that nullability is a schema-level constraint.

import pyarrow as pa

# Define a schema with nullable properties
schema = pa.schema([
    ('id', pa.int32(), False),
    ('name', pa.string(), True)
])

On the contrary, SQLServer categorizes nullability as a constraint, which is a rule enforced on the data, suggesting a closer relationship with data validation rather than just schema definition.

CREATE TABLE Persons (
    PersonID int NOT NULL,
    LastName varchar(255) NOT NULL,
    FirstName varchar(255),
    Address varchar(255),
    City varchar(255)
);

So in a example when PySpark loads Parquet files or similar into a DataFrame, and inffering the schema from the Parquet metadata. In this scenario I would assume pandera to either simply check the schema (check what schma attributes has been inferred by loading the data) while a schema and data check would both make sure the schemas are avaliable and run the checks.

kasperjanehag · 2023-11-07T15:17:28Z

@filipeo2-mck on your comment

If a PySpark column allows null values, we still need to check if all the nullable=False-defined fields in Pandera are met.

As a Pandera validation happens after the creation of a valid df, I don't think this approach it's feasible :(

Is your reasoning that if we have a PySpark DataFrame using the following code

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType

# Initialize Spark Session
spark = SparkSession.builder.appName("Example").getOrCreate()

# Define a schema with StructType and StructField
schema = StructType([
    StructField("c1", IntegerType(), nullable=True) 
])

df = spark.createDataFrame([{"c1": 1}, {"c1": None}], schema=schema)

Then we still need to be able to check the c1 column for nullable=False in pandera .validate()? I.e even if a passed DataFrame has nullable=True, we still need the option to enforce nullable=False on that DataFrame in pandera?

kasperjanehag · 2023-11-07T15:17:56Z

Sorry, if I'm being slow, but to me that's still a schema check. 😅

kasperjanehag · 2023-11-07T16:06:09Z

Additionally, coming back to:

If a PySpark column allows null values, we still need to check if all the nullable=False-defined fields in Pandera are met.

Looking at the docs, I would argue that in the case where you have a PySpark DataFrame which allows null values, but your Pandera validation specified nullable=False, it shouldn't perform any data-level validations beyond schema conformance. Or am I missing something @NeerajMalhotra-QB @cosmicBboy @maxispeicher ?

filipeo2-mck · 2023-11-07T16:08:05Z

I believe it's better if we work with the conditions in this table:

underlying data	pyspark df's schema	pandera's schema	result
no nulls	nullable=False	nullable=False	Pandera checks for `null` data (as is, costly)
no nulls	nullable=False	nullable=True	Pandera was checking for `nulls`, this PR fixes it by skipping it 🆕
no nulls	nullable=True	nullable=False	Pandera checks for `null` data (as is, costly)
no nulls	nullable=True	nullable=True	Pandera was checking for `nulls`, this PR fixes it by skipping it 🆕
nulls	nullable=False	~~nullable=False~~	Impossible, a PySpark DF cannot be loaded in this scenario (as previously evidenced)
nulls	nullable=False	~~nullable=True~~	Impossible, a PySpark DF cannot be loaded in this scenario (as previously evidenced)
nulls	nullable=True	nullable=False	Pandera checks for `null` data (as is, costly)
nulls	nullable=True	nullable=True	Pandera was checking for `nulls`, this PR fixes it by skipping it 🆕

As I see, the only possible improvement over the existing status of this PR would be the first row (no nulls | nullable=False | nullable=False). The Pander validation could check the nullable status of the dataframe and skip it. Was this your idea, @maxispeicher?

NeerajMalhotra-QB · 2023-11-07T16:20:10Z

Additionally, coming back to:

If a PySpark column allows null values, we still need to check if all the nullable=False-defined fields in Pandera are met.

Looking at the docs, I would argue that in the case where you have a PySpark DataFrame which allows null values, but your Pandera validation specified nullable=False, it shouldn't perform any data-level validations beyond schema conformance. Or am I missing something @NeerajMalhotra-QB @cosmicBboy @maxispeicher ?

Great discussion, I really liked it :)

My take on this would be to keep it simple. We don't need to over complicate trying to solve for every use case. Hence my recommendation would be and I will side with @maxispeicher / @kasperjanehag, to use pandera's null check as schema level check which is much more efficient than data level check.

And if user wants a data level validation for nulls, then he/she can create a custom check function -> register it and use it in their applications for specific needs they have.

filipeo2-mck · 2023-11-07T16:21:30Z

Is your reasoning that if we have a PySpark DataFrame using the following code
Then we still need to be able to check the c1 column for nullable=False in pandera .validate()?

Exactly. Pandera schema is about expectations over the format/quality data that is coming ("no null values can pass", in this case): If the dataframe has nullable=True, the values in this column may contain null values. Pandera schema, as an expectation, must evaluate it.

As mentioned in my previous comment, the only improvement situation that we could add a "skip" condition if if the dataframe has a nullable=False, as PySpark ensures that no null value will be present.

maxispeicher · 2023-11-07T16:30:18Z

@filipeo2-mck Yes in this table I see the first row as a very common use-case respectively the default use-case for non-null columns.
I know it's not a very representative example but one example where this is very evident is e.g. in unit tests. It's a DataFrame with very few rows but many columns, roughly 10x40, where all of the columns should be not-null. It would be nice to also use pandera directly in the tests for ensuring this but the test run time increases from ~2 sec to ~55 sec just to perform the checks which could be reduced to a schema check.

Also I would actually expect that pandera raises an error if the pyspark schema allows null, however according to the pandera schema nulls are prohibited even if the actual data does not contain any NaNs (third row in the table).

filipeo2-mck · 2023-11-07T18:02:26Z

@filipeo2-mck Yes in this table I see the first row as a very common use-case respectively the default use-case for non-null columns.
Agree, I'm implementing a new condition to encompass this first row and to avoid spending time running it. As soon as I have new timings, I'll push it and you can take a look, ok?

After all this great discussion I understood the different approach that you were proposing to transform the current data-level nullable check into a schema-level check only.
I don't want to implement it in this PR due some factors:

this change means changing current behavior of the nullable check (that users are already relying on) by a new one. It should demand a minor version release, as it's not compatible.
if current field-level nullable changes its behavior to do just schema-level validation and not the data validation part as it does today, how is the user supposed to do the second? I don't agree much about leaving the burden of implementing a custom check to do a simple nullable check to the end user. I believe a more appropriate solution would be a nullable check that changes the deepness of its check according to the ValidationDepth value (SCHEMA_ONLY, DATA_ONLY or SCHEMA_AND_DATA) and this change will demand a deeper update. @NeerajMalhotra-QB, WDYT about this second solution?
this PR is a small contribution to speed up the long processing time in PySpark integration, according to this issue I opened yesterday. For now, I'm just trying to optimize the processing time.

…a before applying nullable check Signed-off-by: Filipe Oliveira <[email protected]>

filipeo2-mck · 2023-11-07T19:10:33Z

@maxispeicher
Your suggestion regarding the first row of the table was implemented to avoid running it when False and False. Can you take a look, please?

Updated table:

underlying data	pyspark df's schema	pandera's schema	result
no nulls	nullable=False	nullable=False	Pandera was checking for `nulls`, this PR fixes it by skipping it 🆕
no nulls	nullable=False	nullable=True	Pandera was checking for `nulls`, this PR fixes it by skipping it 🆕
no nulls	nullable=True	nullable=False	Pandera checks for `null` data (as is, costly)
no nulls	nullable=True	nullable=True	Pandera was checking for `nulls`, this PR fixes it by skipping it 🆕
nulls	nullable=False	~~nullable=False~~	Impossible, a PySpark DF cannot be loaded in this scenario (as previously evidenced)
nulls	nullable=False	~~nullable=True~~	Impossible, a PySpark DF cannot be loaded in this scenario (as previously evidenced)
nulls	nullable=True	nullable=False	Pandera checks for `null` data (as is, costly)
nulls	nullable=True	nullable=True	Pandera was checking for `nulls`, this PR fixes it by skipping it 🆕

maxispeicher · 2023-11-09T07:59:51Z

@filipeo2-mck Yes looks good to me. Thanks for including it :)

kasperjanehag · 2023-11-09T09:09:58Z

@filipeo2-mck thanks for providing great overview. I feel like we're reaching a common understanding! :)

On "the different approach that you were proposing to transform the current data-level nullable check into a schema-level check only" is what I've been after as well. Essentially it's a matter of expectations from a schema-level check (should it actually process the data or only check the schema of the DataFrame.

Anyhow, to @filipeo2-mck's point:

I fully agree that this change means changing current behavior of the nullable check and should demand a minor version release, i.e let's not do it at part of this PR.
I also believe a more appropriate solution would be a nullable check that changes the deepness of its check according to the ValidationDepth value (SCHEMA_ONLY, DATA_ONLY or SCHEMA_AND_DATA). That's what I've meant all the time but haven't been good enough at explaining it 😅. For nullable that would mean:
- SCHEMA_ONLY: only verify is DataFrame schema and Pandera schema has the same setting for nullable=True/False.
- DATA_ONLY: verify the data no matter if the schema is correct or not
- SCHEMA_AND_DATA: do both

Let's get this merged as is, then we can work on a separate issue that describes potential change on this behaviour in future changes?

kasperjanehag

LGTM!

filipeo2-mck · 2023-11-09T14:19:34Z

For nullable that would mean:

SCHEMA_ONLY: only verify is DataFrame schema and Pandera schema has the same setting for nullable=True/False.

DATA_ONLY: verify the data no matter if the schema is correct or not

SCHEMA_AND_DATA: do both

Agree! Thank you for the approval! :D

filipeo2-mck · 2023-11-10T14:59:32Z

@maxispeicher , would you mind to evaluate #1414 too, please? I couldn't tag you there. Thank you.

…nionai-oss#1403) * avoid running nullable checks if nullable=true Signed-off-by: Filipe Oliveira <[email protected]> * add corresponding test cases for nullable fields Signed-off-by: Filipe Oliveira <[email protected]> * check schema-level information from both pyspark df and pandera shcema before applying nullable check Signed-off-by: Filipe Oliveira <[email protected]> --------- Signed-off-by: Filipe Oliveira <[email protected]>

avoid running nullable checks if nullable=true

7675ae8

Signed-off-by: Filipe Oliveira <[email protected]>

filipeo2-mck changed the title ~~[BUGFIX] [PYSPARK] Avoid running nullable checks if nullable=true~~ [BUGFIX] [PYSPARK] Avoid running nullable checks if nullable==True Nov 2, 2023

filipeo2-mck changed the title ~~[BUGFIX] [PYSPARK] Avoid running nullable checks if nullable==True~~ [BUGFIX] [PYSPARK] Avoid running nullable checks if nullable=True Nov 2, 2023

filipeo2-mck marked this pull request as ready for review November 2, 2023 12:53

filipeo2-mck marked this pull request as draft November 2, 2023 12:57

add corresponding test cases for nullable fields

38e0b1b

Signed-off-by: Filipe Oliveira <[email protected]>

filipeo2-mck marked this pull request as ready for review November 2, 2023 15:09

NeerajMalhotra-QB self-requested a review November 3, 2023 20:04

filipeo2-mck mentioned this pull request Nov 6, 2023

[PySpark] Performance issues during validation #1409

Open

3 tasks

check schema-level information from both pyspark df and pandera shcem…

4f6ce43

…a before applying nullable check Signed-off-by: Filipe Oliveira <[email protected]>

kasperjanehag approved these changes Nov 9, 2023

View reviewed changes

filipeo2-mck mentioned this pull request Nov 9, 2023

[PySpark] Improve validation performance by enabling cache()/unpersist() toggles #1414

Merged

cosmicBboy approved these changes Nov 10, 2023

View reviewed changes

cosmicBboy merged commit 58a3309 into unionai-oss:main Nov 10, 2023

mattB1989 mentioned this pull request Dec 16, 2023

Performance issue with nulls in pandas dataframe with multi-index validation #1449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] [PYSPARK] Avoid running nullable checks if `nullable=True` #1403

[BUGFIX] [PYSPARK] Avoid running nullable checks if `nullable=True` #1403

filipeo2-mck commented Nov 2, 2023 •

edited

Loading

codecov bot commented Nov 2, 2023 •

edited

Loading

filipeo2-mck commented Nov 2, 2023

NeerajMalhotra-QB commented Nov 3, 2023 •

edited

Loading

NeerajMalhotra-QB commented Nov 3, 2023 •

edited

Loading

filipeo2-mck commented Nov 6, 2023

maxispeicher commented Nov 7, 2023

kasperjanehag commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023

maxispeicher commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023

kasperjanehag commented Nov 7, 2023

kasperjanehag commented Nov 7, 2023

kasperjanehag commented Nov 7, 2023

kasperjanehag commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023 •

edited

Loading

NeerajMalhotra-QB commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023 •

edited

Loading

maxispeicher commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023

maxispeicher commented Nov 9, 2023

kasperjanehag commented Nov 9, 2023

kasperjanehag left a comment

filipeo2-mck commented Nov 9, 2023

filipeo2-mck commented Nov 10, 2023 •

edited

Loading

[BUGFIX] [PYSPARK] Avoid running nullable checks if nullable=True #1403

[BUGFIX] [PYSPARK] Avoid running nullable checks if nullable=True #1403

Conversation

filipeo2-mck commented Nov 2, 2023 • edited Loading

codecov bot commented Nov 2, 2023 • edited Loading

Codecov Report

filipeo2-mck commented Nov 2, 2023

NeerajMalhotra-QB commented Nov 3, 2023 • edited Loading

NeerajMalhotra-QB commented Nov 3, 2023 • edited Loading

filipeo2-mck commented Nov 6, 2023

maxispeicher commented Nov 7, 2023

kasperjanehag commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023

maxispeicher commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023

kasperjanehag commented Nov 7, 2023

kasperjanehag commented Nov 7, 2023

kasperjanehag commented Nov 7, 2023

kasperjanehag commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023 • edited Loading

NeerajMalhotra-QB commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023 • edited Loading

maxispeicher commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023

filipeo2-mck commented Nov 7, 2023

maxispeicher commented Nov 9, 2023

kasperjanehag commented Nov 9, 2023

kasperjanehag left a comment

Choose a reason for hiding this comment

filipeo2-mck commented Nov 9, 2023

filipeo2-mck commented Nov 10, 2023 • edited Loading

[BUGFIX] [PYSPARK] Avoid running nullable checks if `nullable=True` #1403

[BUGFIX] [PYSPARK] Avoid running nullable checks if `nullable=True` #1403

filipeo2-mck commented Nov 2, 2023 •

edited

Loading

codecov bot commented Nov 2, 2023 •

edited

Loading

NeerajMalhotra-QB commented Nov 3, 2023 •

edited

Loading

NeerajMalhotra-QB commented Nov 3, 2023 •

edited

Loading

filipeo2-mck commented Nov 7, 2023 •

edited

Loading

filipeo2-mck commented Nov 7, 2023 •

edited

Loading

filipeo2-mck commented Nov 10, 2023 •

edited

Loading