-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Labels
Description
While fetching data with find_polars_all, find_pandas_all, find_arrow_all from pymongoarrow.api, the schema is being inferred based on first document. If the same key is having different datatype, it is inferred as null.
MongoDB documentation
[
{
"name": "test",
"code": "1"
},
{
"name": "test",
"code": 1
}
]Current implementation
from pymongoarrow.api import find_polars_all
query_result_df = find_polars_all(
collection=client,
query=query
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id ┆ name ┆ code │
# │ --- ┆ --- ┆ --- │
# │ binary ┆ str ┆ str │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ null │
# └─────────────────────────────────┴──────┴──────┘In case of such known discrepancies where the first document have pyarrow.str() and subsequent documents have pyarrow.int*(), which can be inferred as pyarrow.str() by adding an optional parameter coerce_number_to_str for all find_* apis.
Expected implementation
from pymongoarrow.api import find_polars_all
query_result_df = find_polars_all(
collection=client,
query=query,
coerce_number_to_str=True
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id ┆ name ┆ code │
# │ --- ┆ --- ┆ --- │
# │ binary ┆ str ┆ str │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# └─────────────────────────────────┴──────┴──────┘Reference - coerce_numbers_to_str in https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field
aclark4life