-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: failure Loading json columns into an arrow table #2273
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
@neuromantik33 FYI we are on it here #2295 |
Thanks for tackling this issue, feel free to close this whenever you have something and if you need a beta tester ... 😉 |
@zilto was this fixed in your PR? |
The arrow-related code no longer depends on inspecting the first row of the data and should be more robust. It should have fixed this issue, but it's hard to say 100% because it can depend on the SQL backend. @neuromantik33 you could try the most recent
The arrow improvements will be available in the next release! |
@zilto thanks I'll give it a test run |
The changes are now available in |
Sorry I didn't get back to you .. I tested it on multiple batches small enough to finally get a NULL json and it seems to have worked :) you can close this PR and if I encounter another issue (I did when testing it my replication source) I'll create another issue . thanks ! 🙏 |
Issue Summary
When loading a table with
json
columns into an Arrow table, the process fails due to nullablejson
payloads. Since the dataset is sparse, the first row in the result set is oftenNone
, leading topy_type
being inferred asNoneType
. As a result, the column is not converted into a JSON string array, causing the following error:Expected bytes, got a 'dict' object
.Steps to Reproduce
json
column into an Arrow table.None
Expected Behavior
json
column should be properly converted to ajson
string array, even if the first value in the dataset is NULL.json
type instead of defaulting toNoneType
.sql_database
workaroundA query adapter can be used to explicitly cast JSON columns to TEXT:
But seeing as I also use this module in my pg_replication source plugin I preferred to correct it at the source, no pun intended 😉