Parquet row_group_size has confusing units #46650
Replies: 1 comment 3 replies
-
Hi @ThermodynamicBeta, thanks for starting this discussion. The units are in terms of rows and I believe the docstring was written with that knowledge and intent. You can see this by looking at #34280 and #34435.
Do you say that "Mi" suggests bytes because Mi is a base 2 prefix and we don't tend to use base 2 when counting? I want to make sure I understand your take on this since others are likely to be confused too. I think its current use here is correct and would be even more correct if we used Let me know on the above question and, additionally, if you'd be interested in filing a PR with an improvement. Otherwise, I'd be happy to have a go. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
For parquet file writing, there is a
row_group_size
parameter:https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
The usage and first sentence of the docstring seem to imply that its supposed to be the number of rows:


However, it also mentions 1024 * 1024 and 64Mi, which suggests its units are bytes. From a user perspective, I don't know how big my rows are, it would be much more convenient to put a number of bytes here. Can someone explain this decision, and confirm if its supposed to be the number of rows?
Beta Was this translation helpful? Give feedback.
All reactions