Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: failure Loading json columns into an arrow table #2273

Closed
wants to merge 2 commits into from

Conversation

neuromantik33
Copy link
Contributor

@neuromantik33 neuromantik33 commented Feb 7, 2025

Issue Summary

When loading a table with json columns into an Arrow table, the process fails due to nullable json payloads. Since the dataset is sparse, the first row in the result set is often None, leading to py_type being inferred as NoneType. As a result, the column is not converted into a JSON string array, causing the following error: Expected bytes, got a 'dict' object.

Steps to Reproduce

  • Load a SQL table containing a json column into an Arrow table.
  • Ensure the first row of the result set contains None
  • Observe that the Arrow conversion fails with the above error message.

Expected Behavior

  • The json column should be properly converted to a json string array, even if the first value in the dataset is NULL.
  • The inference mechanism should recognize the json type instead of defaulting to NoneType.

sql_database workaround

A query adapter can be used to explicitly cast JSON columns to TEXT:

def json_to_text(_: Select, table: sa.Table) -> Select:
    """FIXME Workaround to convert all JSON types to text until pyarrow adapter is fixed"""
    columns = [
        sa.cast(col, sqltypes.Text).label(col.name) if isinstance(col.type, sqltypes.JSON) else col
        for col in table.columns
    ]
    return sa.select(*columns)

But seeing as I also use this module in my pg_replication source plugin I preferred to correct it at the source, no pun intended 😉

Copy link

netlify bot commented Feb 7, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 0f57b92
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/67a55a0dbe50aa00082f3cc2

@rudolfix
Copy link
Collaborator

@neuromantik33 FYI we are on it here #2295
@zilto please look at test case in this PR. this is the thing I mentioned in PR. the current idea is that we launch the conversion when we get an exception from arrow, so we prevent looping on Python rows in most cases....

@neuromantik33
Copy link
Contributor Author

Thanks for tackling this issue, feel free to close this whenever you have something and if you need a beta tester ... 😉

@sh-rp
Copy link
Collaborator

sh-rp commented Feb 24, 2025

@zilto was this fixed in your PR?

@zilto
Copy link
Collaborator

zilto commented Feb 24, 2025

@zilto was this fixed in your PR?

The arrow-related code no longer depends on inspecting the first row of the data and should be more robust. It should have fixed this issue, but it's hard to say 100% because it can depend on the SQL backend.

@neuromantik33 you could try the most recent devel branch and see if it fixes the problem?

pip install git+https://github.com/dlt-hub/dlt.git@devel

The arrow improvements will be available in the next release!

@neuromantik33
Copy link
Contributor Author

@zilto thanks I'll give it a test run

@zilto
Copy link
Collaborator

zilto commented Mar 5, 2025

The changes are now available in dlt==1.8.0. Let us know if it solves your issue :)

@neuromantik33
Copy link
Contributor Author

Sorry I didn't get back to you .. I tested it on multiple batches small enough to finally get a NULL json and it seems to have worked :) you can close this PR and if I encounter another issue (I did when testing it my replication source) I'll create another issue . thanks ! 🙏

@zilto zilto closed this Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants