Sorting and encoding of dictionary pages #8778
-
|
I am working on a schema for storing multi-dimensional signal (float32/float64 type) from a mass spectrometer in Parquet, using Rust for the prototypal implementation. In some cases, dictionary encoding has helped as one dimension's values are repeated many times as another dimension varies. Sometimes, that dictionary can be quite big, and it may not compress well on its own if stored with the plain encoding, based upon experiments where I collected the set of all unique values and just compressed them with Zstd. If I sorted them and byte shuffled them (as in the So far as I can tell, this implementation doesn't support writing out a sorted dictionary, or using any encoding other than If I understand what'd be involved, sorting the dictionary page wouldn't force new behavior on a reader, but using an encoding other than |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Dictionary pages must be encoded using Being able to sort the dictionary would be a nice addition, but I'm not sure what the level of effort would be. Please feel free to create an issue to request this feature. |
Beta Was this translation helpful? Give feedback.
Dictionary pages must be encoded using
PLAINencoding (ref). The Parquet community is exploring new encoding options right now (https://lists.apache.org/thread/djnbbcnft0fqm9ldby2q96nbtrwqz477), you might want to ask on the parquet dev list if there's a desire to expand the possibilities for dictionary encoding.Being able to sort the dictionary would be a nice addition, but I'm not sure what the level of effort would be. Please feel free to create an issue to request this feature.