Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Change default .arrow compression to "uncompressed" #656

Merged
merged 2 commits into from
Dec 31, 2024

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Dec 31, 2024

Fixes #655

The file size of flights-200k.arrow has a 7.2x 2.4x increase.
However, this is still roughly half 0.16x the size of flights-200k.json.

Also, adds support for overriding any write_options on a per-Spec basis.
This may be helpful if we later run into issues with vega/vega#3961

Fixes #655

Adds support for overriding any defaults on a per-`Spec` basis.
This may be helpful if we later run into issues with vega/vega#3961

**The file size of `flights-200k.arrow` has a 7.2x increase.**
However, this is still roughly half the size of `flights-200k.json`.
@dangotbanned dangotbanned marked this pull request as ready for review December 31, 2024 12:18
@dangotbanned
Copy link
Member Author

@jheer

I'm not sure how to reproduce this check on the buffer you have in flechette:
https://github.com/uwdata/flechette/blob/0b1b7a1a2facc89746cf541fca37b06878c316f3/src/decode/record-batch.js#L11-L20

I think this PR should resolve the issue.
However, I'd feel more confident merging if you were able to verify the new version can be decoded on your end 👍

@jheer
Copy link
Member

jheer commented Dec 31, 2024

@jheer

I'm not sure how to reproduce this check on the buffer you have in flechette: https://github.com/uwdata/flechette/blob/0b1b7a1a2facc89746cf541fca37b06878c316f3/src/decode/record-batch.js#L11-L20

I think this PR should resolve the issue. However, I'd feel more confident merging if you were able to verify the new version can be decoded on your end 👍

* https://github.com/vega/vega-datasets/raw/refs/heads/remove-arrow-compression/data/flights-200k.arrow

I can confirm that the file parses successfully with both Arrow-JS and Flechette. However, the internal format has changed, which may affect downstream use:

  • The column data types have changed. In particular, 64-bit integers are now being used for the delay and distance fields. Previously I believe 16-bit integers were used. This change (1) bloats the file size significantly and (2) might break or complicate downstream use for some users as Arrow-JS will now return BigInt values rather than number values for those columns.
  • (Minor) The previous version consisted of multiple record batches. It appears the new version consists of a single record batch. If any tools were leveraging that (e.g., to test multi-batch decoding), that no longer holds. However, I don't think this is much of an issue.

I would think that the simplest fix would be to revert to how the file was originally constructed. Why make changes if they are not needed?

@dangotbanned dangotbanned marked this pull request as draft December 31, 2024 18:16
@dangotbanned
Copy link
Member Author

#656 (comment)

Thanks for testing it out @jheer

I'm going to take a deeper look at the previous version to try and more faithfully recreate it.

I think the datatypes should be easy to fix

@dangotbanned dangotbanned marked this pull request as ready for review December 31, 2024 18:38
@dangotbanned dangotbanned merged commit 7c2e67f into main Dec 31, 2024
3 checks passed
@dangotbanned dangotbanned deleted the remove-arrow-compression branch December 31, 2024 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Updated Arrow files unreadable by JS libs
3 participants