Skip to content

feat: add external providers for audio transcription and LLM-based post-processing [post processing is similer to PR #355]#466

Closed
avijitbhuin21 wants to merge 13 commits intocjpais:mainfrom
avijitbhuin21:main
Closed

feat: add external providers for audio transcription and LLM-based post-processing [post processing is similer to PR #355]#466
avijitbhuin21 wants to merge 13 commits intocjpais:mainfrom
avijitbhuin21:main

Conversation

@avijitbhuin21
Copy link

@avijitbhuin21 avijitbhuin21 commented Dec 16, 2025

Summary

This PR adds support for external AI providers for both audio transcription and post-processing, enabling users to leverage cloud-based services for faster and more accurate speech-to-text conversion. This speeds up the processing and post processing significantly using provider like groq or cerebras even in my potato pc.

Features

🎤 Online Audio Transcription Providers

  • Added support for multiple external transcription providers:
    • OpenAI (Whisper)
    • Groq (Whisper Large V3 / V3 Turbo)
    • Gemini (2.0/2.5 Flash, Flash Lite, Pro models)
    • SambaNova (Whisper Large V3)
  • Per-provider API key storage with secure masking
  • Per-provider model selection with persistence
  • Toggle to switch between local and online providers

📝 Post-Processing with LLM Providers

  • Extended post-processing to support multiple LLM providers:
    • OpenAI, OpenRouter, Gemini, Groq, Cerebrus, SambaNova, and Custom endpoints
  • Configurable base URL for self-hosted/custom endpoints
  • Model fetching and selection with refresh capability
  • Custom prompt management (create, edit, delete prompts)

🔧 Backend Changes

  • New Tauri commands for online provider settings management
  • API key and model storage per provider
  • Settings persistence for both transcription and post-processing providers

UI Changes

  • New "Online Providers" settings panel with provider/model/API key configuration
  • Enhanced "Post Processing" settings with multi-provider support
  • Consistent UI patterns with the existing settings design

Related

Similar to PR #355 but focuses on settings-based configuration without hotkey integration.

Screenshots

image image image image

avijitbhuin21 and others added 6 commits December 16, 2025 05:58
shortcut.rs
 (1 warning)
Unused import: APPLE_INTELLIGENCE_DEFAULT_MODEL_ID — removed from imports
signal_handle.rs
 (6 warnings)
Unused import: crate::actions::ACTION_MAP — added #[cfg(unix)]
Unused import: crate::ManagedToggleState — added #[cfg(unix)]
Unused imports:
debug
, info, warn from
log
 — added #[cfg(unix)]
Unused import: std::thread — added #[cfg(unix)]
Unused imports: AppHandle, Manager from tauri — added #[cfg(unix)]
(These were all only used in unix-specific code, so adding #[cfg(unix)] prevents the warnings on Windows)

settings.rs
 (1 warning)
Unused mut:
providers
 variable — added #[allow(unused_mut)] (it's only mutated on macOS aarch64)
lib.rs
 (1 warning)
Unused mut: builder variable — added #[allow(unused_mut)] (it's only mutated on macOS with the nspanel plugin)
@cjpais
Copy link
Owner

cjpais commented Dec 17, 2025

To be honest with you, I'm not exactly sure what to say. There's been many discussions around this topic and why it hasn't been approved before. Have you read them? What makes this PR different? I actually think you've done the best job of implementing the UI so far, but that doesn't really change my overall stance on API models. I am not particularly interested in having them in Handy for a variety of reasons at the moment and you can go ahead and read the existing lengthy conversations on this. You're welcome to continue to challenge my opinion and position as well as collect feedback from the community. Clear and obvious feedback, support, and discussion from the community on the best way forward would probably sway my opinion.

I may consider this PR if you remove the part with API models for transcription and let's just consider the post-processing part for now.

Also if you are implementing API models you should use the support transcribe-rs has for it already.. and if it is not sufficient we should add more support there. I know this is almost certainly a vibe coded PR so this probably wasn't thought of. If it was vibe coded it would probably be helpful to know the prompts because at least it shows a clear intent and purpose, because a bunch of LLM generated text as the PR is something I will basically always skip reading because I want to know a humans intention not a machines intention.

Unfortunately I have to be a bit defensive of what makes it into the codebase and I really need to have someone who has a strong enough opinion or community support to have things make it into the repo. Not every PR can be accepted otherwise the entire app turns into an unmaintainable mess and it's already close to that in my opinion.

@cjpais
Copy link
Owner

cjpais commented Dec 17, 2025

You gotta clean this PR up. It's not in any state to be merged. There's a bunch of breaking changes, things like changing tauri.conf.json and other stuff is not gonna fly.

Review every file for the changes, and only submit what is absolutely necessary. I skimmed the PR and don't have time to review a bunch of random changes to files that don't seem meaningful.

….tsx formatting changes - Reverted shortcut.rs extra blank line - Reverted tauri.conf.json to restore Windows signing command and original formatting
@avijitbhuin21
Copy link
Author

Hi @cjpais,

Thank you for the detailed and thoughtful feedback! I apologize for not reading through the previous discussions (#77, #279, #222, Discussion #168) before submitting. I've now gone through them and understand your position on keeping Handy primarily a local transcription app.

What I've done with this PR:

  • ✅ Removed the API transcription feature as you requested
  • ✅ Keep only the LLM post-processing feature which seems to have more community interest

Regarding transcribe-rs and future API support:

I checked transcribe-rs and noticed its openai feature currently uses a hardcoded enum that only supports 3 OpenAI models (whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe). This means providers like Groq and SambaNova won't work out of the box since they use different model names (whisper-large-v3, whisper-large-v3-turbo).

I'll raise a PR to transcribe-rs to:

  • Allow passing model names as strings instead of the fixed enum
  • Add helper configurations for providers like Groq, SambaNova, and others with OpenAI-compatible endpoints

Once that's merged, I can create a proper PR for Handy that uses transcribe-rs for API transcription, following your preferred approach.

Why this matters to me (and others):

I have a low-end PC where local transcription is quite slow. Providers like Groq and SambaNova offer extremely fast transcription (often under 1 second) via their OpenAI-compatible APIs, which makes Handy usable for people like me who don't have powerful hardware. I believe there are others in the community with similar needs (like the user in #77).

About AI assistance:

Yes, I used AI tools (Claude opus 4.5) to help implement this. My intent was to:

  • Enable fast transcription for users with slow hardware
  • Add flexible LLM post-processing with multiple providers

I should have been more thoughtful about how this fits with the project's architecture and your vision.

here is the updated ui for post processing

post proessing disabled:

image

post processing enabled:

image

post processing tab:

image

The history remains same as above

@cjpais
Copy link
Owner

cjpais commented Dec 18, 2025

I have a low-end PC where local transcription is quite slow. Providers like Groq and SambaNova offer extremely fast transcription (often under 1 second) via their OpenAI-compatible APIs, which makes Handy usable for people like me who don't have powerful hardware. I believe there are others in the community with similar needs (like the user in #77).

This is fundamentally the same thing as the earlier discussion. My stance has not significantly changed. Please do not submit a PR for it without gathering significant support in Discussions. As well as coming up with a way forward which does not require someone to use an API key. Or hides this functionality in a nice way, which enables power users to use it.

As for this PR. Please describe what you have done exactly in your own language. No AI descriptions. How does this improve Handy? I can see some details in the screenshot but I want you to be explicit about it, so I can validate what the code is doing based on what you wanted to do. Post processing is not a 'general' feature right now. There's a reason it's in debug and I will be moving it to a new menu when I think the feature is good enough for prime time. I'm happy to accept other UI/UX improvements to the feature itself in the meantime

@avijitbhuin21
Copy link
Author

It's okay, since i need this feature desperately i'll modify and build it or my use only. i will close this pr as there is not much left without the online api support. post processing is already in debug state so there is not much point in implementing it.
Thanks for the feedback man appreciate your time

@cjpais
Copy link
Owner

cjpais commented Dec 18, 2025

Sounds good, and if you do want to get the API support in the mainline build please go collect support so we can have an open discussion there!

@User-3090
Copy link

User-3090 commented Jan 20, 2026

A good use case would be to use https://github.com/speaches-ai/speaches locally. Parakeet models are really not that good, and support for other Whisper models is currently broken. The ability to specify an API endpoint would give the user the possibility to run any desired model locally and frees you up from supporting whatever new models yourself.

@User-3090
Copy link

Here are some other use cases:

A small company with thin client hardware could purchase one beefy box equipped with a GPU, enabling all clients to utilize it for transcriptions.

A family with one GPU server running, everyone accessing it via Tailscale wherever they are.

@User-3090
Copy link

You mentioned in another thread that you envisioned users sharing their GPU capabilities to others in order to improve transcriptions.

To my knowledge, sharing that node in a tailnet is by far the easiest and most secure way to accomplish that.

All we would need is the ability to specify an API endpoint in Handy.

@cjpais
Copy link
Owner

cjpais commented Jan 20, 2026

Handy is not going to support STT API providers. People are welcome to fork if they want that support. It will eventually be a local provider itself.

@avijitbhuin21
Copy link
Author

Hi @User-3090 ,

I’ve created Babbl:
https://github.com/avijitbhuin21/Babbl/releases/tag/v0.1.3

Babbl is a fork of Handy, extended with API integration and post-processing features.
So far, my friends and I have been using it with Groq for super-fast processing, and it has been working very well.

Please check it out and let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants