Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deactivating extension types or making ObjectId mapped type polars-compliant? #236

Open
Bonnevie opened this issue Sep 26, 2024 · 3 comments

Comments

@Bonnevie
Copy link

I have a use-case where I want to extract data as an arrow table, save as parquet, and then later load it with polars.
My problem is that I cannot figure out how to avoid extension types, in particular for ObjectId. Status quo now is that the parquet gets stored with the extension type, which polars cannot read. Pymongoarrow functions like find_polars_all somehow manage this casting, but can it be achieved with find_arrow_all?

Is it possible to specify a datatype for ObjectId fields in the pymongoarrow schema, so that it gets recorded in the arrow table and parquet file as a binary data type that polars can read out of the box?

@Bonnevie Bonnevie changed the title Deactivating extension types or making ObjectId mapped type polars-compliant?? Deactivating extension types or making ObjectId mapped type polars-compliant? Sep 26, 2024
@ShaneHarvey
Copy link
Collaborator

ShaneHarvey commented Sep 26, 2024

Here are the casts we do internally:

def _cast_away_extension_type(field: pa.field) -> pa.field:
    if isinstance(field.type, pa.ExtensionType):
        field_without_extension = pa.field(field.name, field.type.storage_type)
    elif isinstance(field.type, pa.StructType):
        field_without_extension = pa.field(
            field.name,
            pa.struct([_cast_away_extension_type(nested_field) for nested_field in field.type]),
        )
    elif isinstance(field.type, pa.ListType):
        field_without_extension = pa.field(
            field.name, pa.list_(_cast_away_extension_type(field.type.value_field))
        )
    else:
        field_without_extension = field

    return field_without_extension

def _arrow_to_polars(arrow_table: pa.Table):
    """Helper function that converts an Arrow Table to a Polars DataFrame.

    Note: Polars lacks ExtensionTypes. We cast them  to their base arrow classes.
    """
    if pl is None:
        msg = "polars is not installed. Try pip install polars."
        raise ValueError(msg)

    schema_without_extensions = pa.schema(
        [_cast_away_extension_type(field) for field in arrow_table.schema]
    )
    arrow_table_without_extensions = arrow_table.cast(schema_without_extensions)

    return pl.from_arrow(arrow_table_without_extensions)

def find_polars_all(collection, query, *, schema=None, **kwargs):
    return _arrow_to_polars(find_arrow_all(collection, query, schema=schema, **kwargs))

def _cast_away_extension_type(field: pa.field) -> pa.field:

We could offer a feature to customize the BSON type conversion that way a cast wouldn't be needed.

@Bonnevie
Copy link
Author

Bonnevie commented Sep 27, 2024

Hi @ShaneHarvey, amazing, thanks for the snippet!
Just tried to check what I get with an ObjectID extension type if I do field.type.storage_type and I get the pa.binary(12) type as far as I can tell. But if I try to do a schema with {"_id": pa.binary(12)} I get the following error:

Unsupported data type in schema for field "_id" of type "fixed_size_binary[12]"

Is that expected? Should I not be able to specify a schema that casts an ObjectId field directly to its storage type?

This might be what you were getting at with your final comment.

@ShaneHarvey
Copy link
Collaborator

That is expected as we have not implemented direct support for pa.binary() yet (https://jira.mongodb.org/browse/ARROW-214). Looking into it more, I believe it should be more straightforward to implement support for fixed size binary so I opened a new ticket here: https://jira.mongodb.org/browse/ARROW-251

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants