You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You may wonder why to quantize - I am running several models simultaneously in an AI assistant that uses an LLM (openchat quantized), multimodal visual model (LLAVA or moondream), wakeword model (openwakeword). It will run on my device 24Gb VRAM but I wanted to share with as many users as possible so to keep the VRAM usage low.
I was looking to quantize the large v3 model as it had the lowest word error rate and was the second fastest or perhaps the medium.en model..
can anyone point me in the direction of a quantised version of distil-whisper or how I can generate one and use it for inference?
The text was updated successfully, but these errors were encountered:
Hi I was wondering if there would be any speed gain and size reduction for quantizing distil whisper?
i.e bits and bytes, Onyx, GPTQ
There is a gain from quantizing the whisper model itself without much quality loss - see here
https://medium.com/@daniel-klitzke/quantizing-openais-whisper-with-the-huggingface-optimum-library-30-faster-inference-64-36d9815190e0
You may wonder why to quantize - I am running several models simultaneously in an AI assistant that uses an LLM (openchat quantized), multimodal visual model (LLAVA or moondream), wakeword model (openwakeword). It will run on my device 24Gb VRAM but I wanted to share with as many users as possible so to keep the VRAM usage low.
I was looking to quantize the large v3 model as it had the lowest word error rate and was the second fastest or perhaps the medium.en model..
can anyone point me in the direction of a quantised version of distil-whisper or how I can generate one and use it for inference?
The text was updated successfully, but these errors were encountered: