-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2024-05-11 imghdr parity updates #75
2024-05-11 imghdr parity updates #75
Conversation
This reverts commit 18e48f3.
and tidied a python line
Mmmm, why do we have some weirdness with missing bytes in the .json, yet I did not touch those? Need to double check before committing. 🤔 Fixed it! Not sure what happened there, ready for the Pull |
Awesome!! I stumbled with this for 30 minutes today before realizing that this repo's active branch is
NebularNerd wants to merge 9 commits into cdgriffith:master from NebularNerd:2024-05-11-IMGHDR-Parity It is unintuitive but |
Normally @cdgriffith moves them over when he's ready to look at them, if he prefers, I can move it to develop when creating the pull. |
For #76 or #76.nextgen, I would be interested in creating parameterized tests just by reading Creating |
The magic.json may die in the future based on @cdgriffith's musings in #70 so it may at this point not be worth the time creating tests on what may not be in the future. The .json is pretty straightforward is we take a line: If there is a matching section in
This will then look for every pattern listed at the given byte offset, this can be a positive number, or a negative number to work backwards from the end of the file (see -128 for Once all matching is done the confidence scores are generated from the results list. As I was asking in #76 if you are trying to test the strings as an exact match they may fail. @cdgriffith's original goal (I assume) with the confidence method is to ensure that the best real-world match is given. From my personal approach to the PR's, I make use of official specs and test files as far as possible, which is why my PR's can sometimes be a bit wordy to explain choices and reasons behind them, For example, while TIFF uses |
Thank you for all this hard work @NebularNerd ! Feel free to set it directly to develop, as that should be latest and what will be in next release. Otherwise when I switch it may force you to do refactoring, and don't want double work! @cclauss Sorry for no clear documentation starting out for the magic data, never assumed anyone would actually work on this repo but me 😆 I still see commits coming in @NebularNerd when you are complete let me know and I can merge! |
I'll call it done for now. The main thing was to get parity with imghdr. 😎 |
You might have to reopen #2 after, did not know GitHub looks at the title and uses thins in there too close issues. One of my commits has a title that I cannot edit that may close it. |
The readthedocs has been decommed for a while, thank you, merging! |
- Adding #72 #75 #76 #81 `.what()` to be a drop in replacement for `imghdr.what()` (thanks to Christian Clauss and Andy - NebularNerd) - Adding #67 Test on Python 3.13 beta (thanks to Christian Clauss) - Adding #77 from __future__ import annotations (thanks to Christian Clauss - Fixing #66 Confidence sorting (thanks to Andy - NebularNerd) --------- Co-authored-by: Andy <[email protected]> Co-authored-by: Christian Clauss <[email protected]>
IMGHDR Parity update
Closes #68
These updates will ensure PureMagic has the ability to match anything imghdr could as well as, if not better in most cases.
.jpg (No changes):
b'\xff\xd8\xff\xdb'
/0xffd8ffdb
which is a 1:1 match with PureMagicb'JFIF'
andb'Exif'
at a fixed location. This is not present in every file due to headers, thumbnails etc...There are improvements we could make to .jpg such as combining the
JFIF
andEXIF
matches in regex's, but that can wait for post v2.0 to create more detailed/higher confidence matches..png (No changes):
b'\211PNG\r\n\032\n'
/0x89504e470d0a1a0a
Nothing to change, this matches PureMagic, all PNG's will have this header
.gif (No changes):
b'GIF87a'
/0x474946383761
andb'GIF89a'
/0x'474946383961'
Nothing to change, this matches PureMagic. All GIF's will have one or the other header.
.tiff /.tif (Tidying):
b'MM'
/0x4d4d
(Motorola format) andb'II'
/0x4949
(Intel format)PureMagic uses better matches already with
0x49492a00
and0x4d4d002a
which pretty much ensures it's a TIFF. There are actually loads of duplicate TIFF entries, I have removed the extraneous longer matches and duplicates. There seems to be another TIFF header of0x492049
which is in PureMagic and loads of other file ID lists, however, it's not in the official spec. More investigation is needed on that before potential removal as a duff entry..rgb SGI image (Enhanced):
b'\001\332'
/0x01da
which is quite small, PureMagic use0x01da01010003
which is too specific.The PureMagic match is specifically for an SGI RGB Image with the following properties: RLE Compressed, 1 bpc, Multiple 2D Images. As mentioned in #68, this would be a great format for rule-based matching as the header contains a lot of information and has a long 404 dummy bytes chunk at the end. For now, I shall use basic matches similar to my PCX/MP3 work to ensure we get a good baseline to build on for the future.
.pbm / .pgm / .ppm (Improvements):
and
's would make them overly specific.UPDATED
I've added better descriptions to allow for ASCII or BINARY variants, also added multi-matches for SPACE, TAB and WIN/NIX newlines, and repeated them for those also followed by an #, this will improve matches but they will always have a lowish confidence. I also remove a rogue PGM match that was floating around by itself.
.sun Sun Raster (Enhanced):
b'\x59\xA6\x6A\x95'
/0x59a66a95
this is a 1:1 match with PureMagicThe header above is in every SUN raster file, there is some variant specific info I have added to improve confidence/provide better info about the flavour of the file. Again, in the future there is a little more we could describe about the file if we wanted to.
.xbm X Bitmap (New!):
b'#define '
/0x23646566696e6520
A new format for PureMagic, every file uses the above header. Oddly despite Wikipedia giving it a mime type, it's not listed at IANA. Improvements could be possible later by regex-ing
width
andheight
.bmp and variants (No change to BMP yet, added some other headers):
b'BM'
/0x424d
the same as PureMagicThat's a tiny header, but improving matches appears to require reading the DIB header, then converting that in to more readable data from the DWORD32 string.
I've added some other headers from the format spec while I was there but again, they are small and need the more detailed matching to gain higher confidence.
Without looking into it too much right now, I believe we would need to do something as shown here
.webp (Enhanced and Tidied):
b'RIFF'
/0x52494646
then follows up withb'WEBP'
/0x57454250
The RIFF header is used by all manner of files (the acronym standing for Resource Interchange File Format). PureMagic supports .webp but with a mix of headers on their own, one for RIFF, one for WEBP and another which would have only matched the file it came from. The fix to this is to split the match into
RIFF
then multi-matchWEBPVP8
for lossy,WEBPVP8L
for lossless,WEBPVP8X
for extended andWEBP
for fringe cases.Moving forward we can look to improve all RIFF based files in a similar way with multi-matches and other potential v2.0 enhancements
.exr OpenEXR (Added mimetype):
b'\x76\x2f\x31\x01'
/0x762f3101
which is 1:1 with PureMagicNothing to change but added the mimetype from Wikipedia, potential in the future to expand details such as variants and versions (there seems to be at least 1.7 and 2.0's from an initial scan of the info).
BONUS Types:
A PR for just a couple of tweaks is no fun, lets add some more....
Quite OK Image Format .qoi:
A lightweight image format for games. Found it while looking at a port of Wipeout to various platforms
Quite OK Audio Format .qoa:
A lightweight audio format for games. Found it while looking at a port of Wipeout to various platforms
SimCity 2000 .sc2 maps:
I've been playing with the SC2KRender, always loved SimCity 2000 so why not add it. This is an IFF file so it uses the
FORM
we know and love that started the whole multi-match upgrade. LikeRIFF
above the IFFFORM
format has a lot of sub variants that will benefit from multi-matchings. This should understand MAC, Amiga and PC created maps.TZX Cassette image .tzx:
Primarily a ZX Spectrum emulator format, it's now used by a variety of 8bit emulators as the de-facto proper way to archive a tape.
PFM, Augmented PFM and PAM:
A couple of extra formats added while PBM/PGM/PPM fixing, these are extensions of the NETPBM format. PAM could be improved later with a regex for
ENDHDR
which would always be present but not at a fixed byte position.Links: