Skip to content

Conversation

CloseChoice
Copy link

@CloseChoice CloseChoice commented Oct 8, 2025

Fixes #7765

The problem here is that polars uses pyarrow large_string for images, while pandas and others just use the string type. This PR solves that and adds a test.

import polars as pl
from datasets import Dataset
import pandas as pd
import pyarrow as pa
from pathlib import Path

shared_datadir = Path("tests/features/data")
image_path = str(shared_datadir / "test_image_rgb.jpg")

# Load via polars
df_polars = pl.DataFrame({"image_path": [image_path]})
dataset_polars = Dataset.from_polars(df_polars)
print("Polars DF is large string:", pa.types.is_large_string(df_polars.to_arrow().schema[0].type))
print("Polars DF is string:", pa.types.is_string(df_polars.to_arrow().schema[0].type))

# Load via pandas
df_pandas = pd.DataFrame({"image_path": [image_path]})
dataset_pandas = Dataset.from_pandas(df_pandas)
arrow_table_pd = pa.Table.from_pandas(df_pandas)
print("Pandas DF is large string", pa.types.is_large_string(arrow_table_pd.schema[0].type))
print("Pandas DF is string", pa.types.is_string(arrow_table_pd.schema[0].type))

Outputs:

Polars DF is large string: True
Polars DF is string: False
Pandas DF is large string False
Pandas DF is string True

@CloseChoice CloseChoice marked this pull request as ready for review October 8, 2025 10:02
@lhoestq
Copy link
Member

lhoestq commented Oct 10, 2025

The Image() type is set to have a storage of string for "path" and not large_string. Therefore while your change does work to do the conversion, it can create issues in other places. For example I'm pretty sure you wouldn't be able to concatenate the resulting dataset with a dataset with Image() using string.

Maybe we can convert large_string data to string somehow to make this work ?

@CloseChoice
Copy link
Author

CloseChoice commented Oct 12, 2025

@lhoestq thanks for the review. Just to be thorough I checked the concat example and this seems to work:

import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))

import pandas as pd
import polars as pl
from datasets import Dataset, Image, concatenate_datasets
import pyarrow as pa

image_path = "tests/features/data/test_image_rgb.jpg"


df_pl = pl.DataFrame({"image": [image_path]})
dset_pl = Dataset.from_polars(df_pl).cast_column("image", Image())


df_pd = pd.DataFrame({"image": [image_path]})
dset_pd = Dataset.from_pandas(df_pd).cast_column("image", Image())


concatenated = concatenate_datasets([dset_pl, dset_pd])
print(concatenated._data)

outputs:

ConcatenationTable
image: struct<bytes: binary, path: string>
  child 0, bytes: binary
  child 1, path: string
----
image: [
  -- is_valid: all not null
  -- child 0 type: binary
[null]
  -- child 1 type: string
["tests/features/data/test_image_rgb.jpg"],
  -- is_valid: all not null
  -- child 0 type: binary
[null]
  -- child 1 type: string
["tests/features/data/test_image_rgb.jpg"]]

(not quite sure though if this is a really what you meant). I agree that there could be pro a lot of problems if we rely on implicit conversion therefore I updated the PR. I also checked the exception handling locally and it works, am unsure though if we want to create such large objects in the CI, if desired I can add a test for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

polars dataset cannot cast column to Image/Audio/Video

2 participants