Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.wav files detected as audio/wave when maybe they should be audio/wav #104

Open
simonw opened this issue Nov 8, 2024 · 3 comments
Open

Comments

@simonw
Copy link

simonw commented Nov 8, 2024

As far as I can tell, the "correct" type to return for a .wav file (with 52 49 46 46 xx xx xx xx 57 41 56 45 66 6d 74 20 is audio/wav - but this library returns audio/wave.

I got very confused looking through the code because I came across these two lines:

["57415645", 8, ".wav", "audio/wave", "Waveform Audio File Format"],

["52494646", 0, ".wav", "audio/wav", "Resource Interchange File Format"],

I've found it hard to research the correct resolution though, as both audio/wav and audio/wave are entirely missing from what I thought was the official RFC for these! https://www.iana.org/assignments/media-types/media-types.xhtml#audio

MDN lists audio/wav https://developer.mozilla.org/en-US/docs/Web/HTTP/MIME_types/Common_types

I'm not sure there is a correct answer to this question.

@simonw
Copy link
Author

simonw commented Nov 8, 2024

Tried this:

python -c 'import puremagic, pprint, sys; pprint.pprint(puremagic.magic_stream(open(sys.argv[-1], "rb")))' output.wav

And got:

[PureMagicWithConfidence(byte_match=b'RIFFH\xe0\x02\x00WAVE', offset=8, extension='.wav', mime_type='audio/wave', name='Waveform Audio File Format', confidence=0.8),
 PureMagicWithConfidence(byte_match=b'WAVEfmt ', offset=8, extension='.wav', mime_type='audio/x-wav', name='Windows audio file ', confidence=0.8),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.4xm', mime_type='', name='4X Movie video', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cdr', mime_type='', name='CorelDraw document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.avi', mime_type='video/avi', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cda', mime_type='', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.qcp', mime_type='audio/vnd.qcelp', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.rmi', mime_type='audio/mid', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.wav', mime_type='audio/wav', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ds4', mime_type='', name='Micrografx Designer graphic', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ani', mime_type='application/x-navi-animation', name='Windows animated cursor', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.dat', mime_type='video/mpeg', name='Video CD MPEG movie', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cmx', mime_type='', name='Corel Presentation Exchange metadata', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.webp', mime_type='image/webp', name='RIFF WebP', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'WAVE', offset=8, extension='.wav', mime_type='audio/x-wav', name='WAV audio', confidence=0.4)]

@simonw
Copy link
Author

simonw commented Nov 8, 2024

I had a similar issue on llm-gemini where puremagic was returning audio/mpeg for MP3 files but the Gemini AI wanted audio/mp3:

It turned out in that case puremagic was correct and Gemini was wrong - the official mimetype for MP3 is indeed audio/mpeg.

@NebularNerd
Copy link
Contributor

Tried this:

python -c 'import puremagic, pprint, sys; pprint.pprint(puremagic.magic_stream(open(sys.argv[-1], "rb")))' output.wav

And got.....

.wav files are part of the RIFF family, so their detection is currently split across the initial RIFF at byte 0, then the WAVE at byte 8 (IFF and PK (aka Zip) based formats are similar). Due to the size of the .json some duplication and occasional hiccups appear. @cdgriffith is working on a v2 update which will improve detection and collectively we should be able to remove duplication. Looking at the results, next time I do a PR I personally would look to roll the standalone WAVEfmt entry into RIFF family to reduce duplication but like you say which MIMETYPE should we go for?

The MIMETYPE's are a nightmare, according to The Library of Congress: .wav page, both are valid and they even mention another few.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants