[ Feature request] `coerce_number_to_str` - like optional flag while reading data to handle known datatype inconsistencies #238

DataEnggNerd · 2024-10-01T14:27:24Z

While fetching data with find_polars_all, find_pandas_all, find_arrow_all from pymongoarrow.api, the schema is being inferred based on first document. If the same key is having different datatype, it is inferred as null.

MongoDB documentation

[
    {
        "name": "test",
        "code": "1"
    },
    {
        "name": "test",
        "code": 1
    }
]

Current implementation

from pymongoarrow.api import find_polars_all

query_result_df = find_polars_all(
            collection=client,
            query=query
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id                             ┆ name ┆ code │
# │ ---                             ┆ ---  ┆ ---  │
# │ binary                          ┆ str  ┆ str  │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ null │
# └─────────────────────────────────┴──────┴──────┘

In case of such known discrepancies where the first document have pyarrow.str() and subsequent documents have pyarrow.int*(), which can be inferred as pyarrow.str() by adding an optional parameter coerce_number_to_str for all find_* apis.

Expected implementation

from pymongoarrow.api import find_polars_all

query_result_df = find_polars_all(
            collection=client,
            query=query,
            coerce_number_to_str=True
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id                             ┆ name ┆ code │
# │ ---                             ┆ ---  ┆ ---  │
# │ binary                          ┆ str  ┆ str  │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# └─────────────────────────────────┴──────┴──────┘

Reference - coerce_numbers_to_str in https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field

The text was updated successfully, but these errors were encountered:

aclark4life · 2024-10-01T19:06:14Z

Thank you! Tracking in JIRA https://jira.mongodb.org/browse/ARROW-252

DataEnggNerd · 2024-10-22T07:57:06Z

@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?

aclark4life · 2024-10-22T16:17:54Z

@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?

Yes! Are you able to send a PR with the proposed changes?

DataEnggNerd · 2024-10-24T07:06:21Z

@aclark4life I would like to discuss the design before getting into implementation.
In Jira I have observed that there is a suggestion of a new data type, which I am fine with.
But, on such implementation, schema is expected to be passed only for such field. And how to pass schema for nested keys?

Any help is appreciated.

aclark4life · 2024-10-25T16:19:36Z

No problem! Does this help at all? https://mongo-arrow.readthedocs.io/en/1.3.0/schemas.html#nested-data-with-schema I believe we're in agreement that we could support adding a new field type StrToIntField or IntToStrField as @ShaneHarvey suggested.

keanamo added the linked-to-jira label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Feature request] `coerce_number_to_str` - like optional flag while reading data to handle known datatype inconsistencies #238

[ Feature request] `coerce_number_to_str` - like optional flag while reading data to handle known datatype inconsistencies #238

DataEnggNerd commented Oct 1, 2024 •

edited

Loading

aclark4life commented Oct 1, 2024

DataEnggNerd commented Oct 22, 2024

aclark4life commented Oct 22, 2024

DataEnggNerd commented Oct 24, 2024

aclark4life commented Oct 25, 2024

[ Feature request] coerce_number_to_str - like optional flag while reading data to handle known datatype inconsistencies #238

[ Feature request] coerce_number_to_str - like optional flag while reading data to handle known datatype inconsistencies #238

Comments

DataEnggNerd commented Oct 1, 2024 • edited Loading

MongoDB documentation

Current implementation

Expected implementation

aclark4life commented Oct 1, 2024

DataEnggNerd commented Oct 22, 2024

aclark4life commented Oct 22, 2024

DataEnggNerd commented Oct 24, 2024

aclark4life commented Oct 25, 2024

[ Feature request] `coerce_number_to_str` - like optional flag while reading data to handle known datatype inconsistencies #238

[ Feature request] `coerce_number_to_str` - like optional flag while reading data to handle known datatype inconsistencies #238

DataEnggNerd commented Oct 1, 2024 •

edited

Loading