-
-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multi type (Unions) in schemas and validation #1152
Comments
@vianmixtkz Great writeup. This is something that would be great for Pandera to support. |
Thanks @vianmixtkz this is an interesting use case: the way pandas handles mixed-type columns is to represent the data in an One thing we should clarify in the semantics of this feature is the following: we can interpret
Do we need special syntax to differentiate between these two cases, or is that something that we leave to the pandera type engine to handle? I.e.:
|
Here what I described is matching case 2. That's is in a given column, I'll have for example str on some rows and floats on other rows. With something like: Case 1 class InputSchema(pa.DataFrameModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
comment : Union[Series[str], Series[float]] = pa.Field() # comment is either only str or only float in a given DataFrame Case 2 class InputSchema(pa.DataFrameModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
comment : Series[Union[str,float]] = pa.Field() # comment is a column containing str on some rows and float on other rows And yeah, I think the behavior you are describing is what users would expect
|
fix: unionai-oss#1152 I would like pandera to support Union Type. That is the validation of a Series/Column should allow multiple types. 1. Add a new PythonUnion type. 2. Add a new test to for the new UnionType. Signed-off-by: karajan1001 <[email protected]>
Just bumping this thread. Any consensus how to proceed? Seem like the #1227 is stale. |
Revisiting this issue and thinking about it a little bit, here's another proposal for this issue: from pandera.engines.pandas_engine import Object
from typing import Annotated
class Model(pa.DataFrameModel):
union_column : Union[str, float] # the column data type must be either a str or float
object_column: Object = pa.Field(dtype_kwargs={"allowable_types": [str, float]})
# or use the annotated types
object_column: Annotated[Object, [str, float]] This syntax is less ambiguous as to what the actual type of the column is vs. the values within it are. However, it does require importing a special I'm still open to the more ambiguous behavior where |
Re: this proposal: #1152 (comment) Unfortunately |
I'm not a fan of this case
|
I am pretty sure I need this as well. I'm trying to create a DataFrameModel that expects a timestamp in the index. My ideal would be to validate that it has a timezone, but not specify which timezone. Another acceptable type hint would be to accept any timestamp, tz-naive or tz-aware, and then add custom checks around timezone manually. But right now, if I use |
Propose something that isnt exactly just check valid type(s): Would be if it doesnt fall into the set datatype there is a set of values that is acceptable. float but field can contain "NA" Cant think of a personal use case to allow for all strings but where maybe data wasnt provided or invalid. Still want to check all the values in the column but dont want to have to edit all the strings to null or 0. |
Is your feature request related to a problem? Please describe.
I would like pandera to support Union Type. That is the validation of a Series/Column should allow multiple types.
Pydantic allows it.
Here an example of my issue
Describe the solution you'd like
I think it is the desired behavior for now to not allow Unions. But could you consider an option to allow it in the future ?
Describe alternatives you've considered
Split the Union columns into multiple columns, one for each type but this is not really something that I can control. Cf next section.
Additional context
I have a valid use case for this. I am using pandas to handle CSVs where some columns contain hybrid data types.
I am using pandas for the preprocessing and pydantic for the validation, and I would like to use pandera to make this process (processing + validation) more robust
The text was updated successfully, but these errors were encountered: