-
-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ZLIB compression support #4080
base: master
Are you sure you want to change the base?
Conversation
Congrats on this PR, I definitely think it's useful to have zlib in encode/decode string functions. On the topic of compression, what would be your opinion on giving servers the option of serving their resources in 1 or more zips that get decompressed by the client? This would save bandwidth and reduce download times, it would likely be a more efficient method of caching server resources. |
External web server with gzip support is the most efficient solution in my opinion. Internal http server is old and inefficient. If we want to implement compression of client side cache files on server level then I recommend creating pre-compressed versions (.gz) on resource check and serving it via internal http server (it already supports gzip / deflate decompression). |
Shared/sdk/SharedUtil.Crypto.h
Outdated
if (input[0] == 0x78) | ||
windowBits = (int)ZLibFormat::ZLIB; | ||
else if (input[0] == 0x1F) | ||
windowBits = (int)ZLibFormat::GZIP; | ||
else | ||
windowBits = (int)ZLibFormat::ZRAW; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you have these constants from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.zlib.net/manual.html#:~:text=of%20the%20library.-,The%20windowBits,-parameter%20is%20the
zlib wrapper uses 15 bits by default (this version can use values in range of 9-15), to get gzip we need to add 16 (so 31), negative value is used for raw deflate (no wrapper).
First bytes I googled and then validated myself during testing. zlib will return error if something is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it makes sense to allow format option to also accept numbers. This would give more low-level access to tune compression algorithm. But as far as I understand positive values lower than 15 will result in lower compression rate (and the same with raw format) so not really useful in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to trust random StackOverflow comments or observations here. There actually are formal and authoritative RFCs for both the Zlib and Gzip formats that nicely detail the overall structure of such files:
- https://datatracker.ietf.org/doc/html/rfc1950#page-4 (section 2.2)
- https://datatracker.ietf.org/doc/html/rfc1952#page-5 (section 2.2)
From such standards, it stands out that:
- Gzip streams can be easily and reliably identified by the presence of two signature ID1 and ID2 bytes in the beginning, 0x1F and 0x8B. (Checking both would be a good idea to reduce the possibility of false positives.)
- Zlib streams do not have dedicated signature bytes, so in theory any file that has reasonable values for the CMF and FLG fields can be a Zlib stream. Therefore, it makes sense to check for Gzip first, as its identification is better defined, to reduce false positives. (Raw DEFLATE should be checked last, because it's a bit-based format that tends to have bytes with even more "random" values.)
Edit: for identifying Zlib streams, the current code only works well for data compressed with DEFLATE using a 32 KiB window (CM = 0b1000 = 8, CINFO = 0b0111 = 7). Supporting lower CINFO values would be smart to also handle streams with smaller window sizes, which are sometimes seen in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to trust random StackOverflow comments or observations here
This is why I verified this information (I actually didn't find it on StackOverflow originally, it was just faster to send this link instead of looking at comments in zlib files which I originally used).
Zlib streams do not have dedicated signature bytes
The library that we use always writes a wrapper for zlib even if windowBits is lower than 15. But you are correct.
I will slightly modify this signature check. Automatic detection is not supposed to be reliable anyway. The only reason this part of code exists is because information from zlib documentation turned out to be wrong (it's mentioned there that if we pass windowBits = 0 then zlib will try to automatically detect the wrapper but it doesn't work like expected).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The library that we use always writes a wrapper for zlib even if windowBits is lower than 15. But you are correct.
The value your code is checking for identifying Zlib streams, 0x78
, matches a pretty common (but not unique) CMF byte value, which is composed of the CM and CINFO fields I described in the edit for my comment above. So if you have verified that this code works for identifying Zlib streams generated by this library, then no additional wrapper that's undefined in that RFC was added 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: for identifying Zlib streams, the current code only works well for data compressed with DEFLATE using a 32 KiB window (CM = 0b1000 = 8, CINFO = 0b0111 = 7). Supporting lower CINFO values would be smart to also handle streams with smaller window sizes, which are sometimes seen in the wild.
I checked again the output with window size set from 9 to 14 and for some data the header is indeed different. But it should be reliable for default window size (15).
So I think the best solution will be allowing to set specific window size (format option passed as number). And we should keep default values for generic use. So format will be either string (gzip, zlib, raw) or a number in range from 9..31, -9..-15.
This pull request adds zlib data compression and decompression support.
Functions (shared): encodeString / decodeString
New algorithm: zlib
Optional arguments:
gzip
,zlib
,raw
or window size number (default = gzip);default
,filtered
,huffman
,rle
,fixed
.Returns:
Encoded / decoded string or zlib error code.
If no arguments are passed then encodeString will use
gzip
with highest compression level (9) and decodeString will try to automatically determine compressed format (wrapper).For most use cases there is no reason to change compression (the size difference is negligible; 0 = no compression).
Strategy is a low-level option which based on documentation might tune compression algorithm to get better results in some cases. I tested it on different data types and the size difference is not significant or might be worse than default option.
Note
Please note that the header (wrapper) size for
gzip
is 18 bytes and only 6 bytes forzlib
(so it's more compact).raw
(deflate) has no wrapping at all and usually used by programs to read / write .zip files.Tip
Quick testing (runcode):
Test resource: zlib.zip