Skip to content

Conversation

@ConeyLiu
Copy link
Contributor

@ConeyLiu ConeyLiu commented Dec 8, 2025

Rationale for this change

Closes #3344

What changes are included in this PR?

Are these changes tested?

Yes, added UT.

Are there any user-facing changes?

Yes, add new configuration. No breaking changes.

@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Dec 8, 2025

cc @pitrou @alamb @mapleFU

definitionLevels,
data,
statistics,
10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make sense to have a threshold greater than 1.0, does it?
In any case, the data of a column full of zeros is extremely compressible (you might want to disable dictionary encoding?), so the default value should work IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 1.0

assertEquals(
"Data should be stored uncompressed when compression ratio exceeds threshold",
uncompressedSize,
compressedSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you want to read the data back and check it's as expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

* @param threshold the compression ratio threshold, default is {@value #DEFAULT_V2_PAGE_COMPRESS_THRESHOLD}
* @return this builder for method chaining
*/
public Builder withV2PageCompressThreshold(double threshold) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it makes sense to put "V2" in the API name? For users it doesn't seem be relevant (just keep it mentioned in the docstrings).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated

@pitrou
Copy link
Member

pitrou commented Dec 8, 2025

I'll let Java experts review this @Fokko @wgtmac

converter.getEncoding(headerV2.getEncoding()),
BytesInput.from(pageLoad),
headerV2.is_compressed,
compressed,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a potential bug

public static final boolean DEFAULT_SIZE_STATISTICS_ENABLED = true;

public static final boolean DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED = true;
public static final double DEFAULT_PAGE_COMPRESS_THRESHOLD = 0.98;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this magic number? Is it better to use a smaller number like 0.9 or 0.85?

}
}

public static Builder build(BytesInputCompressor compressor, MessageType schema, ByteBufferAllocator allocator) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to use a separate issue to implement this, though I'm fine to keep it as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make compression adaptive with V2 data pages

3 participants