You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are 2 parameters that would be useful to be user-configurable in the mongo CollectionLoaders:
A projection parameter for find_raw_batches/find, which allows to optionally limit which data will be exported from mongo (e.g. to remove at the source columns with sensitive data/reduce data size if not all columns are needed)
A pymongoarrow_schema to be used in PyMongoArrowContext to enforce a schema in process_bson_stream in case it is needed. This means that instead of the current call with the schema set to None as done in context = PyMongoArrowContext.from_schema(None, codec_options=self.collection.codec_options) , one would be able to use a pymongoarrow schema:
Without this schema, in one of our use cases data_item_format = "arrow" fails with the error extraction of resource transaction in generator collection_documents caused an exception: value too large to convert to int32_t. This error is due to the fact that the schema is wrongly inferred to be int32, but setting pyarrow type pa.float64() in the pymongoarrow_schema things work as expected
Are you a dlt user?
I'm considering using dlt, but this bug is preventing this.
Do you ready to contribute this extension?
Yes, I'm ready.
dlt destination
duckdb/s3
Additional information
No response
The text was updated successfully, but these errors were encountered:
Careful about using pymongoarrow: we have had a few problems trying to use it, particularly with the translation of ObjectId and arrays of ObjectIds. There is at least the documented problem with nested extension types.
We tried to write pyarrow dataframes using DuckDB's import from Apache Arrow, but it threw an error saying that the type was not supported. We ended up writing pyarrow dataframes straight to parquet files using the pyarrow.parquet.write_table() function, which translates ObjectId to blob (and arrays of blobs respectively), which we then cast using DuckDB's HEX Blob function (which is currently missing from the documentation) to get the id as a string.
@esciara thanks for the heads up! We have indeed noticed similar type-related issues in the past (e.g. mongodb-labs/mongo-arrow#236 (comment)).
pymongoarrow is however still very beneficial to us in terms of performances in several use cases.
For ObjectId columns like _id, we are able to use dlt to move data from mongo to duckdb as follows:
We define a pymongoarrow_schema (as explained in the issue description) where ObjectId columns have type pymongoarrow.types.ObjectIdType()
@esciara we have not tried this as it's not needed for our current use case. However based on the limitations of pymongoarrow we both encountered it might not work out of the box.
Source name
mongodb
Describe the data you'd like to see
There are 2 parameters that would be useful to be user-configurable in the mongo
CollectionLoaders
:projection
parameter forfind_raw_batches
/find
, which allows to optionally limit which data will be exported from mongo (e.g. to remove at the source columns with sensitive data/reduce data size if not all columns are needed)pymongoarrow_schema
to be used in PyMongoArrowContext to enforce a schema inprocess_bson_stream
in case it is needed. This means that instead of the current call with the schema set toNone
as done incontext = PyMongoArrowContext.from_schema(None, codec_options=self.collection.codec_options)
, one would be able to use a pymongoarrow schema:data_item_format = "arrow"
fails with the errorextraction of resource transaction in generator collection_documents caused an exception: value too large to convert to int32_t
. This error is due to the fact that the schema is wrongly inferred to be int32, but setting pyarrow typepa.float64()
in thepymongoarrow_schema
things work as expectedAre you a dlt user?
I'm considering using dlt, but this bug is preventing this.
Do you ready to contribute this extension?
Yes, I'm ready.
dlt destination
duckdb/s3
Additional information
No response
The text was updated successfully, but these errors were encountered: