-
Notifications
You must be signed in to change notification settings - Fork 1.5k
GH-3344: Adaptive compression for v2 page #3368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| definitionLevels, | ||
| data, | ||
| statistics, | ||
| 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not make sense to have a threshold greater than 1.0, does it?
In any case, the data of a column full of zeros is extremely compressible (you might want to disable dictionary encoding?), so the default value should work IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 1.0
| assertEquals( | ||
| "Data should be stored uncompressed when compression ratio exceeds threshold", | ||
| uncompressedSize, | ||
| compressedSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you want to read the data back and check it's as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
| * @param threshold the compression ratio threshold, default is {@value #DEFAULT_V2_PAGE_COMPRESS_THRESHOLD} | ||
| * @return this builder for method chaining | ||
| */ | ||
| public Builder withV2PageCompressThreshold(double threshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it makes sense to put "V2" in the API name? For users it doesn't seem be relevant (just keep it mentioned in the docstrings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updated
| converter.getEncoding(headerV2.getEncoding()), | ||
| BytesInput.from(pageLoad), | ||
| headerV2.is_compressed, | ||
| compressed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed a potential bug
| public static final boolean DEFAULT_SIZE_STATISTICS_ENABLED = true; | ||
|
|
||
| public static final boolean DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED = true; | ||
| public static final double DEFAULT_PAGE_COMPRESS_THRESHOLD = 0.98; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this magic number? Is it better to use a smaller number like 0.9 or 0.85?
| } | ||
| } | ||
|
|
||
| public static Builder build(BytesInputCompressor compressor, MessageType schema, ByteBufferAllocator allocator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to use a separate issue to implement this, though I'm fine to keep it as is.
Rationale for this change
Closes #3344
What changes are included in this PR?
Are these changes tested?
Yes, added UT.
Are there any user-facing changes?
Yes, add new configuration. No breaking changes.