Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZLIB compression support #4080

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

samr46
Copy link
Contributor

@samr46 samr46 commented Mar 7, 2025

This pull request adds zlib data compression and decompression support.

Functions (shared): encodeString / decodeString
New algorithm: zlib

Optional arguments:

  • compression: 0-9 (default = 9);
  • format: gzip, zlib, raw or window size number (default = gzip);
  • strategy: default, filtered, huffman, rle, fixed.

Returns:

Encoded / decoded string or zlib error code.

If no arguments are passed then encodeString will use gzip with highest compression level (9) and decodeString will try to automatically determine compressed format (wrapper).

For most use cases there is no reason to change compression (the size difference is negligible; 0 = no compression).

Strategy is a low-level option which based on documentation might tune compression algorithm to get better results in some cases. I tested it on different data types and the size difference is not significant or might be worse than default option.

Note

Please note that the header (wrapper) size for gzip is 18 bytes and only 6 bytes for zlib (so it's more compact).
raw (deflate) has no wrapping at all and usually used by programs to read / write .zip files.

Tip

Quick testing (runcode):

decodeString("zlib", encodeString("zlib", "Hello MTA:SA World"))

decodeString("zlib", encodeString("zlib", "Hello MTA:SA World", {format = "gzip", compression = 9, strategy = "default"}))

Test resource: zlib.zip

@Fernando-A-Rocha
Copy link
Contributor

Fernando-A-Rocha commented Mar 7, 2025

Congrats on this PR, I definitely think it's useful to have zlib in encode/decode string functions.

On the topic of compression, what would be your opinion on giving servers the option of serving their resources in 1 or more zips that get decompressed by the client? This would save bandwidth and reduce download times, it would likely be a more efficient method of caching server resources.

@samr46
Copy link
Contributor Author

samr46 commented Mar 7, 2025

On the topic of compression, what would be your opinion on giving servers the option of serving their resources in 1 or more zips that get decompressed by the client? This would save bandwidth and reduce download times, it would likely be a more efficient method of caching server resources.

External web server with gzip support is the most efficient solution in my opinion. Internal http server is old and inefficient.
I recommend using nginx server for this. By default it uses on-the-fly gzip compression (so CPU resources will be used to compress content before sending it to client). It also supports static gzip mode (in this mode it's possible to serve pre-compressed versions of files to save CPU resources).

If we want to implement compression of client side cache files on server level then I recommend creating pre-compressed versions (.gz) on resource check and serving it via internal http server (it already supports gzip / deflate decompression).
It's actually fairly easy with newly added functions (although for most efficiency I'd create another internal FileCopy function that would copy file from resources folder into cache and simultaneously create compressed version of this file if it doesn't exist or was modified).
It doesn't make sense to send all files in compressed form because for images (i.e. png which already uses zlib, etc.) and some other data types compression is nearly useless (or even worse in terms of bandwidth and required CPU resources to decompress). So creating .zip version of all resource files to serve isn't a good idea. And it's still easier and better to send all files separately despite the number of download requests.

Comment on lines 238 to 243
if (input[0] == 0x78)
windowBits = (int)ZLibFormat::ZLIB;
else if (input[0] == 0x1F)
windowBits = (int)ZLibFormat::GZIP;
else
windowBits = (int)ZLibFormat::ZRAW;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you have these constants from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.zlib.net/manual.html#:~:text=of%20the%20library.-,The%20windowBits,-parameter%20is%20the
zlib wrapper uses 15 bits by default (this version can use values in range of 9-15), to get gzip we need to add 16 (so 31), negative value is used for raw deflate (no wrapper).
First bytes I googled and then validated myself during testing. zlib will return error if something is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to allow format option to also accept numbers. This would give more low-level access to tune compression algorithm. But as far as I understand positive values lower than 15 will result in lower compression rate (and the same with raw format) so not really useful in most cases.

Copy link
Member

@AlexTMjugador AlexTMjugador Mar 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to trust random StackOverflow comments or observations here. There actually are formal and authoritative RFCs for both the Zlib and Gzip formats that nicely detail the overall structure of such files:

From such standards, it stands out that:

  • Gzip streams can be easily and reliably identified by the presence of two signature ID1 and ID2 bytes in the beginning, 0x1F and 0x8B. (Checking both would be a good idea to reduce the possibility of false positives.)
  • Zlib streams do not have dedicated signature bytes, so in theory any file that has reasonable values for the CMF and FLG fields can be a Zlib stream. Therefore, it makes sense to check for Gzip first, as its identification is better defined, to reduce false positives. (Raw DEFLATE should be checked last, because it's a bit-based format that tends to have bytes with even more "random" values.)

Edit: for identifying Zlib streams, the current code only works well for data compressed with DEFLATE using a 32 KiB window (CM = 0b1000 = 8, CINFO = 0b0111 = 7). Supporting lower CINFO values would be smart to also handle streams with smaller window sizes, which are sometimes seen in the wild.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to trust random StackOverflow comments or observations here

This is why I verified this information (I actually didn't find it on StackOverflow originally, it was just faster to send this link instead of looking at comments in zlib files which I originally used).

Zlib streams do not have dedicated signature bytes

The library that we use always writes a wrapper for zlib even if windowBits is lower than 15. But you are correct.

I will slightly modify this signature check. Automatic detection is not supposed to be reliable anyway. The only reason this part of code exists is because information from zlib documentation turned out to be wrong (it's mentioned there that if we pass windowBits = 0 then zlib will try to automatically detect the wrapper but it doesn't work like expected).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library that we use always writes a wrapper for zlib even if windowBits is lower than 15. But you are correct.

The value your code is checking for identifying Zlib streams, 0x78, matches a pretty common (but not unique) CMF byte value, which is composed of the CM and CINFO fields I described in the edit for my comment above. So if you have verified that this code works for identifying Zlib streams generated by this library, then no additional wrapper that's undefined in that RFC was added 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: for identifying Zlib streams, the current code only works well for data compressed with DEFLATE using a 32 KiB window (CM = 0b1000 = 8, CINFO = 0b0111 = 7). Supporting lower CINFO values would be smart to also handle streams with smaller window sizes, which are sometimes seen in the wild.

I checked again the output with window size set from 9 to 14 and for some data the header is indeed different. But it should be reliable for default window size (15).
So I think the best solution will be allowing to set specific window size (format option passed as number). And we should keep default values for generic use. So format will be either string (gzip, zlib, raw) or a number in range from 9..31, -9..-15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants