Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

audio/wave .wav files not supported #603

Closed
NightMachinery opened this issue Nov 3, 2024 · 11 comments
Closed

audio/wave .wav files not supported #603

NightMachinery opened this issue Nov 3, 2024 · 11 comments
Labels
attachments bug Something isn't working

Comments

@NightMachinery
Copy link

NightMachinery commented Nov 3, 2024

I'm recording audio from my microphone using sox and saving the recordings as .wav files. When I try to attach these files to the gemini-1.5-flash-8b-latest model, I receive this error:

Error: This model does not support attachments of type 'audio/wave', only application/pdf, image/png, image/jpeg, image/webp, image/heic, image/heif, audio/wav, audio/mp3, audio/aiff, audio/aac, audio/ogg, audio/flac, audio/mpeg, video/mp4, video/mpeg, video/mov, video/avi, video/x-flv, video/mpg, video/webm, video/wmv, video/3gpp

I suspect the issue is simply that llm doesn't recognize that audio/wave and audio/wav are actually the same MIME type. Is this correct?

@simonw simonw added bug Something isn't working attachments labels Nov 6, 2024
@simonw
Copy link
Owner

simonw commented Nov 8, 2024

Yup, that's a bug - thanks. You can workaround it with the --at option which lets you specify the type directly:

 llm -m gemini-1.5-flash-latest --at output.wav audio/wav transcribe

Thanks for the tip about sox by the way, this worked for me on macOS:

brew install sox
sox -d output.wav                                                 
# Hit Ctrl+C when done

@simonw
Copy link
Owner

simonw commented Nov 8, 2024

It looks like audio/wav is indeed the correct content type here. Not clear where audio/wave came from, but the library I'm using for content type detection - https://pypi.org/project/puremagic/ - apparently supports both wave https://github.com/cdgriffith/puremagic/blob/763349ec4d02ba930fb1142c6eb684afdf06c6ab/puremagic/magic_data.json#L103 and wav https://github.com/cdgriffith/puremagic/blob/763349ec4d02ba930fb1142c6eb684afdf06c6ab/puremagic/magic_data.json#L1118 and it looks like it detects audio/wave in preference for some reason.

@simonw
Copy link
Owner

simonw commented Nov 8, 2024

puremagic uses data from https://www.garykessler.net/library/file_sigs.html - it lists two byte sequences for WAV

CleanShot 2024-11-07 at 16 21 18@2x

The first of those matches the puremagic definition of audio/wave, the second matches its audio/wav.

@simonw
Copy link
Owner

simonw commented Nov 8, 2024

Interesting, the output.wav file I created using sox looks like this:

hexdump -C output.wav | head -n 4
00000000  52 49 46 46 48 e0 02 00  57 41 56 45 66 6d 74 20  |RIFFH...WAVEfmt |
00000010  28 00 00 00 fe ff 01 00  44 ac 00 00 10 b1 02 00  |(.......D.......|
00000020  04 00 20 00 16 00 20 00  04 00 00 00 01 00 00 00  |.. ... .........|
00000030  00 00 10 00 80 00 00 aa  00 38 9b 71 66 61 63 74  |.........8.qfact|

Which is BOTH of the lines in the file_sigs.html thing, so maybe I misinterpreted that and there is only one audio/wave file format and it's that?

In which case, why does puremagic have those two sequences listed separately in their magic_data.json file?

@simonw
Copy link
Owner

simonw commented Nov 8, 2024

This file in the puremagic tests has the same header: https://github.com/cdgriffith/puremagic/blob/master/test/resources/audio/test.wav

That's one of four audio files in the tests https://github.com/cdgriffith/puremagic/tree/master/test/resources/audio - and the only assertion it runs is that the file extension .wav is correctly determined: https://github.com/cdgriffith/puremagic/blob/763349ec4d02ba930fb1142c6eb684afdf06c6ab/test/test_common_extensions.py#L43-L49

@simonw
Copy link
Owner

simonw commented Nov 8, 2024

Filed an issue here:

But seeing as IANA doesn't list either audio/wav or audio/wave on https://www.iana.org/assignments/media-types/media-types.xhtml#audio it's not clear that there IS a correct answer here!

@simonw
Copy link
Owner

simonw commented Nov 8, 2024

Also relevant:

python -c 'import puremagic, pprint, sys; pprint.pprint(puremagic.magic_stream(open(sys.argv[-1], "rb")))' output.wav
[PureMagicWithConfidence(byte_match=b'RIFFH\xe0\x02\x00WAVE', offset=8, extension='.wav', mime_type='audio/wave', name='Waveform Audio File Format', confidence=0.8),
 PureMagicWithConfidence(byte_match=b'WAVEfmt ', offset=8, extension='.wav', mime_type='audio/x-wav', name='Windows audio file ', confidence=0.8),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.4xm', mime_type='', name='4X Movie video', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cdr', mime_type='', name='CorelDraw document', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.avi', mime_type='video/avi', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cda', mime_type='', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.qcp', mime_type='audio/vnd.qcelp', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.rmi', mime_type='audio/mid', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.wav', mime_type='audio/wav', name='Resource Interchange File Format', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ds4', mime_type='', name='Micrografx Designer graphic', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.ani', mime_type='application/x-navi-animation', name='Windows animated cursor', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.dat', mime_type='video/mpeg', name='Video CD MPEG movie', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.cmx', mime_type='', name='Corel Presentation Exchange metadata', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'RIFF', offset=0, extension='.webp', mime_type='image/webp', name='RIFF WebP', confidence=0.4),
 PureMagicWithConfidence(byte_match=b'WAVE', offset=8, extension='.wav', mime_type='audio/x-wav', name='WAV audio', confidence=0.4)]

@simonw
Copy link
Owner

simonw commented Nov 8, 2024

For the moment I'm going to take the opinion that audio/wav is correct and have LLM treat audio/wave as audio/wav in core. I'll change that if it turns out to be a mistake in the future.

@simonw simonw closed this as completed in 5d1d723 Nov 8, 2024
@simonw
Copy link
Owner

simonw commented Nov 8, 2024

This works:

llm -m gemini-1.5-flash-latest -a output.wav transcribe

This is a quick test that I'm doing

@NightMachinery
Copy link
Author

Thanks! ❤️ So llm detects the MIME type and hardcodes it for the API call? How does llm know if the API accepts some MIME or not?

@simonw
Copy link
Owner

simonw commented Nov 9, 2024

Each plugin defines the list of accepted mime type like this:

self.attachment_types = set()
if vision:
self.attachment_types.update(
{
"image/png",
"image/jpeg",
"image/webp",
"image/gif",
}
)
if audio:
self.attachment_types.update(
{
"audio/wave",
"audio/mpeg",
}
)

Full docs here: https://llm.datasette.io/en/stable/plugins/advanced-model-plugins.html#attachments-for-multi-modal-models

simonw added a commit that referenced this issue Nov 13, 2024
simonw added a commit that referenced this issue Nov 13, 2024
simonw added a commit that referenced this issue Nov 14, 2024
…els (#613)

- #507 (comment)

* register_model is now async aware

Refs #507 (comment)

* Refactor Chat and AsyncChat to use _Shared base class

Refs #507 (comment)

* fixed function name

* Fix for infinite loop

* Applied Black

* Ran cog

* Applied Black

* Add Response.from_row() classmethod back again

It does not matter that this is a blocking call, since it is a classmethod

* Made mypy happy with llm/models.py

* mypy fixes for openai_models.py

I am unhappy with this, had to duplicate some code.

* First test for AsyncModel

* Still have not quite got this working

* Fix for not loading plugins during tests, refs #626

* audio/wav not audio/wave, refs #603

* Black and mypy and ruff all happy

* Refactor to avoid generics

* Removed obsolete response() method

* Support text = await async_mock_model.prompt("hello")

* Initial docs for llm.get_async_model() and await model.prompt()

Refs #507

* Initial async model plugin creation docs

* duration_ms ANY to pass test

* llm models --async option

Refs #613 (comment)

* Removed obsolete TypeVars

* Expanded register_models() docs for async

* await model.prompt() now returns AsyncResponse

Refs #613 (comment)

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
simonw added a commit that referenced this issue Nov 14, 2024
simonw added a commit that referenced this issue Nov 17, 2024
simonw added a commit that referenced this issue Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
attachments bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants