Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Feature request] coerce_number_to_str - like optional flag while reading data to handle known datatype inconsistencies #238

Open
DataEnggNerd opened this issue Oct 1, 2024 · 5 comments

Comments

@DataEnggNerd
Copy link

DataEnggNerd commented Oct 1, 2024

While fetching data with find_polars_all, find_pandas_all, find_arrow_all from pymongoarrow.api, the schema is being inferred based on first document. If the same key is having different datatype, it is inferred as null.

MongoDB documentation

[
    {
        "name": "test",
        "code": "1"
    },
    {
        "name": "test",
        "code": 1
    }
]

Current implementation

from pymongoarrow.api import find_polars_all

query_result_df = find_polars_all(
            collection=client,
            query=query
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id                             ┆ name ┆ code │
# │ ---                             ┆ ---  ┆ ---  │
# │ binary                          ┆ str  ┆ str  │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ null │
# └─────────────────────────────────┴──────┴──────┘

In case of such known discrepancies where the first document have pyarrow.str() and subsequent documents have pyarrow.int*(), which can be inferred as pyarrow.str() by adding an optional parameter coerce_number_to_str for all find_* apis.

Expected implementation

from pymongoarrow.api import find_polars_all

query_result_df = find_polars_all(
            collection=client,
            query=query,
            coerce_number_to_str=True
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id                             ┆ name ┆ code │
# │ ---                             ┆ ---  ┆ ---  │
# │ binary                          ┆ str  ┆ str  │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# └─────────────────────────────────┴──────┴──────┘

Reference - coerce_numbers_to_str in https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field

@aclark4life
Copy link
Contributor

Thank you! Tracking in JIRA https://jira.mongodb.org/browse/ARROW-252

@DataEnggNerd
Copy link
Author

@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?

@aclark4life
Copy link
Contributor

@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?

Yes! Are you able to send a PR with the proposed changes?

@DataEnggNerd
Copy link
Author

@aclark4life I would like to discuss the design before getting into implementation.
In Jira I have observed that there is a suggestion of a new data type, which I am fine with.
But, on such implementation, schema is expected to be passed only for such field. And how to pass schema for nested keys?

Any help is appreciated.

@aclark4life
Copy link
Contributor

No problem! Does this help at all? https://mongo-arrow.readthedocs.io/en/1.3.0/schemas.html#nested-data-with-schema I believe we're in agreement that we could support adding a new field type StrToIntField or IntToStrField as @ShaneHarvey suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants