feat: add external providers for audio transcription and LLM-based post-processing [post processing is similer to PR #355]#466
Conversation
shortcut.rs (1 warning) Unused import: APPLE_INTELLIGENCE_DEFAULT_MODEL_ID — removed from imports signal_handle.rs (6 warnings) Unused import: crate::actions::ACTION_MAP — added #[cfg(unix)] Unused import: crate::ManagedToggleState — added #[cfg(unix)] Unused imports: debug , info, warn from log — added #[cfg(unix)] Unused import: std::thread — added #[cfg(unix)] Unused imports: AppHandle, Manager from tauri — added #[cfg(unix)] (These were all only used in unix-specific code, so adding #[cfg(unix)] prevents the warnings on Windows) settings.rs (1 warning) Unused mut: providers variable — added #[allow(unused_mut)] (it's only mutated on macOS aarch64) lib.rs (1 warning) Unused mut: builder variable — added #[allow(unused_mut)] (it's only mutated on macOS with the nspanel plugin)
|
To be honest with you, I'm not exactly sure what to say. There's been many discussions around this topic and why it hasn't been approved before. Have you read them? What makes this PR different? I actually think you've done the best job of implementing the UI so far, but that doesn't really change my overall stance on API models. I am not particularly interested in having them in Handy for a variety of reasons at the moment and you can go ahead and read the existing lengthy conversations on this. You're welcome to continue to challenge my opinion and position as well as collect feedback from the community. Clear and obvious feedback, support, and discussion from the community on the best way forward would probably sway my opinion. I may consider this PR if you remove the part with API models for transcription and let's just consider the post-processing part for now. Also if you are implementing API models you should use the support Unfortunately I have to be a bit defensive of what makes it into the codebase and I really need to have someone who has a strong enough opinion or community support to have things make it into the repo. Not every PR can be accepted otherwise the entire app turns into an unmaintainable mess and it's already close to that in my opinion. |
|
You gotta clean this PR up. It's not in any state to be merged. There's a bunch of breaking changes, things like changing tauri.conf.json and other stuff is not gonna fly. Review every file for the changes, and only submit what is absolutely necessary. I skimmed the PR and don't have time to review a bunch of random changes to files that don't seem meaningful. |
….tsx formatting changes - Reverted shortcut.rs extra blank line - Reverted tauri.conf.json to restore Windows signing command and original formatting
|
Hi @cjpais, Thank you for the detailed and thoughtful feedback! I apologize for not reading through the previous discussions (#77, #279, #222, Discussion #168) before submitting. I've now gone through them and understand your position on keeping Handy primarily a local transcription app. What I've done with this PR:
Regarding transcribe-rs and future API support:I checked transcribe-rs and noticed its I'll raise a PR to transcribe-rs to:
Once that's merged, I can create a proper PR for Handy that uses transcribe-rs for API transcription, following your preferred approach. Why this matters to me (and others):I have a low-end PC where local transcription is quite slow. Providers like Groq and SambaNova offer extremely fast transcription (often under 1 second) via their OpenAI-compatible APIs, which makes Handy usable for people like me who don't have powerful hardware. I believe there are others in the community with similar needs (like the user in #77). About AI assistance:Yes, I used AI tools (Claude opus 4.5) to help implement this. My intent was to:
I should have been more thoughtful about how this fits with the project's architecture and your vision. here is the updated ui for post processing post proessing disabled:
post processing enabled:
post processing tab:
The history remains same as above |
This is fundamentally the same thing as the earlier discussion. My stance has not significantly changed. Please do not submit a PR for it without gathering significant support in Discussions. As well as coming up with a way forward which does not require someone to use an API key. Or hides this functionality in a nice way, which enables power users to use it. As for this PR. Please describe what you have done exactly in your own language. No AI descriptions. How does this improve Handy? I can see some details in the screenshot but I want you to be explicit about it, so I can validate what the code is doing based on what you wanted to do. Post processing is not a 'general' feature right now. There's a reason it's in debug and I will be moving it to a new menu when I think the feature is good enough for prime time. I'm happy to accept other UI/UX improvements to the feature itself in the meantime |
|
It's okay, since i need this feature desperately i'll modify and build it or my use only. i will close this pr as there is not much left without the online api support. post processing is already in debug state so there is not much point in implementing it. |
|
Sounds good, and if you do want to get the API support in the mainline build please go collect support so we can have an open discussion there! |
|
A good use case would be to use https://github.com/speaches-ai/speaches locally. Parakeet models are really not that good, and support for other Whisper models is currently broken. The ability to specify an API endpoint would give the user the possibility to run any desired model locally and frees you up from supporting whatever new models yourself. |
|
Here are some other use cases: A small company with thin client hardware could purchase one beefy box equipped with a GPU, enabling all clients to utilize it for transcriptions. A family with one GPU server running, everyone accessing it via Tailscale wherever they are. |
|
You mentioned in another thread that you envisioned users sharing their GPU capabilities to others in order to improve transcriptions. To my knowledge, sharing that node in a tailnet is by far the easiest and most secure way to accomplish that. All we would need is the ability to specify an API endpoint in Handy. |
|
Handy is not going to support STT API providers. People are welcome to fork if they want that support. It will eventually be a local provider itself. |
|
Hi @User-3090 , I’ve created Babbl: Babbl is a fork of Handy, extended with API integration and post-processing features. Please check it out and let me know your thoughts. |



Summary
This PR adds support for external AI providers for both audio transcription and post-processing, enabling users to leverage cloud-based services for faster and more accurate speech-to-text conversion. This speeds up the processing and post processing significantly using provider like groq or cerebras even in my potato pc.
Features
🎤 Online Audio Transcription Providers
📝 Post-Processing with LLM Providers
🔧 Backend Changes
UI Changes
Related
Similar to PR #355 but focuses on settings-based configuration without hotkey integration.
Screenshots