-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add pandera.io.to_pyarrow_schema
#1047
base: main
Are you sure you want to change the base?
Conversation
I didn't run those The other thing to consider is that this may all be moot. I see the PR for import pyarrow
pyarrow.Schema.from_pandas(dataframe_schema.empty()) |
What's the status of this PR? I have a use-case that requires this. Is there a different supported way or are we still waiting on this? |
will need to resolve the merge conflicts and probably rebase this onto the current @the-matt-morris not sure if you want to pick this up again. I do think leveraging an empty method would make sense to fulfill this use case. However, the PR that implements the I do think a workaround for this would be: import pyarrow
schema = pa.DataFrameSchema(..., coerce=True)
empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))
pyarrow.Schema.from_pandas(empty_df) |
@cosmicBboy this doesn't work for me. The schema infers all types as type class TodoList(pa.DataFrameModel):
int16: Series[pdt.Int16] = pa.Field()
int_list: Series[list[int]] = pa.Field()
str_list: Series[list[str]] = pa.Field()
int16_list: Series[list[pdt.Int16]] = pa.Field()
int16_List: Series[List[pdt.Int16]] = pa.Field()
def test_to_arrow():
import pandas as pd
import pyarrow
schema = TodoList.to_schema()
empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))
schema = pyarrow.Schema.from_pandas(empty_df)
logger.info(schema) Output:
|
yeah, tried this out and I think the approach in this PR (i.e. a dedicated happy to review this or another PR that takes a crack at this, not sure if you want to continue tackling this @the-matt-morris |
FYI - I have a local copy of this where I am modifying it to work for my use-case. I probably need some guidance though as I had to do some custom reflection to handle the Happy to contribute this back if @the-matt-morris isn't available to finish this PR. |
It wasn't clear to me if i am suppsed to use class TodoItem(NamedTuple):
name: str
priority: int
pd_uint8: pdt.UInt8 I instead am using reflection and mapping based on the python types. I see this in the original PR: pandas_types = {
pd.BooleanDtype(): pa.bool_(),
pd.Int8Dtype(): pa.int8(),
pd.Int16Dtype(): pa.int16(),
pd.Int32Dtype(): pa.int32(),
pd.Int64Dtype(): pa.int64(),
pd.UInt8Dtype(): pa.uint8(),
pd.UInt16Dtype(): pa.uint16(),
pd.UInt32Dtype(): pa.uint32(),
pd.UInt64Dtype(): pa.uint64(),
pd.Float32Dtype(): pa.float32(), # type: ignore[attr-defined]
pd.Float64Dtype(): pa.float64(), # type: ignore[attr-defined]
pd.StringDtype(): pa.string(),
} I am just doing this: elif python_type is pdt.UInt8:
return pa.uint8()
elif python_type is pdt.UInt16:
return pa.uint16()
elif python_type is pdt.UInt32:
return pa.uint32()
elif python_type is pdt.UInt64:
return pa.uint64()
elif python_type is pdt.Int8:
return pa.int8()
elif python_type is pdt.Int16:
return pa.int16()
elif python_type is pdt.Int32:
return pa.int32()
elif python_type is pdt.Int64:
return pa.int64()
elif python_type is pdt.Float32:
return pa.float32()
elif python_type is pdt.Float64:
return pa.float64()
elif python_type is pdt.String:
return pa.string()
elif python_type is pdt.Bool: Not sure what the trade-offs are. |
The mapping approach is faster and simpler (it's O(1) since it's a lookup table). This would probably work for most of the the simple types. For things like lists and namedtuple types you'll have to use the if statements. In any case, feel free to create a new PR and we can iterate there. |
Hey @cosmicBboy I am sorry about this one. It has been a long time and I have a new github account (yes the name is nearly exactly the same :) anyways, I can take a look at this one again, rebase and get the tests to pass. I must have gotten distracted but looks like there is at least some interest in getting this working, |
@the-matt-morris i recently forked and continued this work and have tested in production. Just didn't find the time to contribute it back. I'd be happy to open a PR or share a gist here with where I landed |
Oh that's great, thanks for picking it up! If you're nearly there I will stay out of your way, but let me know if you want me to contribute to it at all or have any questions on the approach I was taking. |
@sam-goodwin Any pointers you could share on your approach? |
hey @sam-goodwin friendly ping! would you mind sharing the changes you made on your fork?
|
Here's what I ended up with. There are known problems but it did work for our use-cases. https://gist.github.com/sam-goodwin/9b5ae19cc59f1349f362823454e31376 |
closes #689
Introduces
pandera.io.to_pyarrow_schema
@cosmicBboy , one thing I was unsure about was this type hint. mypy correctly identifies that if conflicts with this type hint. However, I'm not sure under any circumstances when a key in
DataFrameSchema.columns
is not a string? I'm making the assumption in this function that it is always a string. Tests pass, but perhaps there is a situation under which this would be a problem that aren't covered in the unit tests?Other assumptions:
pyarrow.date64()
type is used when thepandera
date data type cannot be inferred bypyarrow
DataFrameSchema
with field(s) that are not typed. I supposed we could potentially force those to something, saypyarrow.string()
, but I don't like the feel of doing something like that.geopandas
GeometriesFloat128
. We could potentially implement this, would just have to make an assumption about the precision and roll with itComplex64
,Complex128
,Complex256
preserve_index
topandera.io.to_pyarrow_schema
functions similarly topreserve_index
argument topyarrow.Schema.from_pandas
pyarrow.lint_(pyarrow.float64())
.Let me know if you feel like I missed any use cases in the unit tests.