Merge pull request #5 from csheaff/dev

csheaff · web-flow · commit f67aefd90c8b · 2026-02-14T13:56:59.000-08:00
Add config file, server auto-start, and bug fixes
diff --git a/Makefile b/Makefile
@@ -12,7 +12,7 @@ install: deps venv
 
 # Install system dependencies (requires sudo)
 deps:
-	sudo apt install -y ydotool pipewire libnotify-bin python3-venv socat
+	sudo apt install -y ydotool ffmpeg pipewire libnotify-bin python3-venv socat
 
 # Create Python venv with faster-whisper (default backend)
 venv: .venv/.done
diff --git a/README.md b/README.md
@@ -1,23 +1,28 @@
 # talktype
 
-Push-to-talk speech-to-text for Linux. Bind a keyboard shortcut, press it to
-start recording, press it again to transcribe and type the text wherever your
-cursor is.
+Push-to-talk speech-to-text for Linux. Press a hotkey to start recording, press
+it again to transcribe and type the text wherever your cursor is. No GUI, no
+app to keep running — just a keyboard shortcut.
 
-Transcription is pluggable — ships with
-[faster-whisper](https://github.com/SYSTRAN/faster-whisper) by default, but you
-can swap in any model or tool that reads audio and prints text.
+- **Pluggable backends** — swap transcription models without changing anything else
+- **Works everywhere** — GNOME, Sway, Hyprland, i3, X11
+- **~100 lines of bash** — easy to read, easy to hack on
+
+Ships with [faster-whisper](https://github.com/SYSTRAN/faster-whisper) by
+default, plus optional [Parakeet](https://huggingface.co/nvidia/parakeet-ctc-1.1b)
+and [Moonshine](https://huggingface.co/UsefulSensors/moonshine-base) backends.
+Or bring your own — anything that reads a WAV and prints text works.
 
 > **Note:** This project is in early development — expect rough edges. If you
 > run into issues, please [open a bug](https://github.com/csheaff/talktype/issues).
 
 ## Requirements
 
 - Linux (Wayland or X11)
-- PipeWire (default on most modern distros)
+- Audio recorder: [ffmpeg](https://ffmpeg.org/) (preferred) or PipeWire (`pw-record`)
 - [ydotool](https://github.com/ReimuNotMoe/ydotool) for typing text
   (user must be in the `input` group — see Install)
-- [socat](https://linux.die.net/man/1/socat) (only needed for server mode)
+- [socat](https://linux.die.net/man/1/socat) (for server-backed transcription)
 
 For the default backend (faster-whisper):
 - NVIDIA GPU with CUDA (or use CPU mode — see Whisper backend options)
@@ -53,6 +58,22 @@ Then **reboot** for the group change to take effect.
 make model
 ```
 
+## Configuration
+
+talktype reads `~/.config/talktype/config` on startup (follows `$XDG_CONFIG_HOME`).
+This works everywhere — GNOME shortcuts, terminals, Sway, cron — no need to set
+environment variables in each context.
+
+```bash
+mkdir -p ~/.config/talktype
+cat > ~/.config/talktype/config << 'EOF'
+TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe"
+EOF
+```
+
+Any `TALKTYPE_*` variable can go in this file. Environment variables still work
+and are applied after the config file, so they override it.
+
 ## Setup
 
 Bind `talktype` to a keyboard shortcut:
@@ -75,21 +96,19 @@ bindsym $mod+d exec talktype
 
 ## Backends
 
-Three backends are included. Each has a one-shot script (loads model per
-invocation) and a server mode (loads model once, keeps it in memory).
+Three backends are included. Server backends auto-start on first use — the
+model loads once and stays in memory for fast subsequent transcriptions.
 
 ### Whisper (default)
 
-The default backend uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper).
-Best with a GPU.
+[faster-whisper](https://github.com/SYSTRAN/faster-whisper). Best with a GPU.
+Works out of the box after `make install` with no config needed.
 
-```bash
-# One-shot (default, no extra setup needed)
-talktype
+For faster repeated use, switch to server mode in your config:
 
-# Server mode (faster — model stays in memory)
-./transcribe-server start
-export TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe"
+```bash
+# ~/.config/talktype/config
+TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe"
 ```
 
 | Variable | Default | Description |
@@ -99,17 +118,19 @@ export TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe"
 | `WHISPER_DEVICE` | `cuda` | `cuda` or `cpu` |
 | `WHISPER_COMPUTE` | `float16` | `float16` (GPU), `int8` or `float32` (CPU) |
 
-### Parakeet (GPU, best accuracy)
+### Parakeet (GPU, best word accuracy)
 
 [NVIDIA Parakeet CTC 1.1B](https://huggingface.co/nvidia/parakeet-ctc-1.1b)
-via HuggingFace Transformers. 1.1B params, excellent accuracy.
+via HuggingFace Transformers. 1.1B params, excellent word accuracy.
+Note: CTC model — outputs lowercase text without punctuation.
 
 ```bash
 make parakeet
+```
 
-# Server mode (recommended — 4.2GB model)
-./backends/parakeet-server start
-export TALKTYPE_CMD="/path/to/talktype/backends/parakeet-server transcribe"
+```bash
+# ~/.config/talktype/config
+TALKTYPE_CMD="/path/to/talktype/backends/parakeet-server transcribe"
 ```
 
 ### Moonshine (CPU, lightweight)
@@ -119,25 +140,34 @@ Sensors. 61.5M params, purpose-built for CPU/edge inference.
 
 ```bash
 make moonshine
+```
 
-# One-shot (fine for this small model)
-export TALKTYPE_CMD="/path/to/talktype/backends/moonshine"
-
-# Or server mode
-./backends/moonshine-server start
-export TALKTYPE_CMD="/path/to/talktype/backends/moonshine-server transcribe"
+```bash
+# ~/.config/talktype/config
+TALKTYPE_CMD="/path/to/talktype/backends/moonshine-server transcribe"
 ```
 
 Set `MOONSHINE_MODEL=UsefulSensors/moonshine-tiny` for an even smaller 27M
 param model.
 
+### Manual server management
+
+The server starts automatically on first transcription. You can also manage
+it directly:
+
+```bash
+./backends/parakeet-server start   # start manually
+./backends/parakeet-server stop    # stop the server
+```
+
 ### Custom backends
 
 Set `TALKTYPE_CMD` to any command that takes a WAV file path as its last
 argument and prints text to stdout:
 
 ```bash
-export TALKTYPE_CMD="/path/to/my-transcriber"
+# ~/.config/talktype/config
+TALKTYPE_CMD="/path/to/my-transcriber"
 ```
 
 Your command will be called as: `$TALKTYPE_CMD /path/to/recording.wav`
diff --git a/backends/moonshine-server b/backends/moonshine-server
@@ -19,6 +19,11 @@ case "${1:-}" in
             echo "Already running (PID $(cat "$PIDFILE"))"
             exit 0
         fi
+        if [ ! -x "$VENV/bin/python3" ]; then
+            echo "Moonshine backend not installed. Run: make moonshine" >&2
+            exit 1
+        fi
+        rm -f "$PIDFILE" "$SOCK"
         echo "Starting moonshine server (loading $MODEL)..."
         "$VENV/bin/python3" "$SCRIPT_DIR/moonshine-daemon.py" "$SOCK" "$MODEL" &
         PID=$!
@@ -46,8 +51,7 @@ case "${1:-}" in
         ;;
     transcribe)
         if [ ! -S "$SOCK" ]; then
-            echo "Moonshine server not running. Start it with: backends/moonshine-server start" >&2
-            exit 1
+            "$0" start >&2 || exit 1
         fi
         echo "$2" | socat - UNIX-CONNECT:"$SOCK"
         ;;
diff --git a/backends/parakeet-daemon.py b/backends/parakeet-daemon.py
@@ -3,6 +3,7 @@
 import sys
 import socket
 import signal
+import torch
 import soundfile as sf
 from transformers import AutoProcessor, AutoModelForCTC
 
@@ -17,11 +18,13 @@
 
 def transcribe(audio_path):
     audio, sr = sf.read(audio_path)
-    inputs = processor(audio, sampling_rate=sr)
-    inputs.to(model.device, dtype=model.dtype)
-    predicted_ids = model.generate(**inputs)
-    texts = processor.batch_decode(predicted_ids, skip_special_tokens=True)
-    return texts[0].strip() if texts else ""
+    inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
+    inputs = inputs.to(model.device, dtype=model.dtype)
+    with torch.no_grad():
+        logits = model(**inputs).logits
+    predicted_ids = torch.argmax(logits, dim=-1)
+    text = processor.batch_decode(predicted_ids, skip_special_tokens=True)
+    return text[0].strip() if text else ""
 
 
 def cleanup(*_):
diff --git a/backends/parakeet-server b/backends/parakeet-server
@@ -18,12 +18,16 @@ case "${1:-}" in
             echo "Already running (PID $(cat "$PIDFILE"))"
             exit 0
         fi
+        if [ ! -x "$VENV/bin/python3" ]; then
+            echo "Parakeet backend not installed. Run: make parakeet" >&2
+            exit 1
+        fi
+        rm -f "$PIDFILE" "$SOCK"
         echo "Starting parakeet server (loading model)..."
         "$VENV/bin/python3" "$SCRIPT_DIR/parakeet-daemon.py" "$SOCK" &
         PID=$!
         disown "$PID"
         echo "$PID" > "$PIDFILE"
-        # Wait for socket to appear
         for i in $(seq 1 60); do
             [ -S "$SOCK" ] && break
             sleep 1
@@ -45,10 +49,8 @@ case "${1:-}" in
         fi
         ;;
     transcribe)
-        # Called by talktype — sends audio path to the server, prints result
         if [ ! -S "$SOCK" ]; then
-            echo "Parakeet server not running. Start it with: backends/parakeet-server start" >&2
-            exit 1
+            "$0" start >&2 || exit 1
         fi
         echo "$2" | socat - UNIX-CONNECT:"$SOCK"
         ;;
diff --git a/talktype b/talktype
@@ -13,9 +13,15 @@
 #
 set -euo pipefail
 
+# ── Load user config (works from GNOME shortcuts, cron, etc.) ──
+TALKTYPE_CONFIG="${TALKTYPE_CONFIG:-${XDG_CONFIG_HOME:-$HOME/.config}/talktype/config}"
+# shellcheck disable=SC1090
+[ -f "$TALKTYPE_CONFIG" ] && source "$TALKTYPE_CONFIG"
+
 TALKTYPE_DIR="${TALKTYPE_DIR:-${XDG_RUNTIME_DIR:-/tmp}/talktype}"
 PIDFILE="$TALKTYPE_DIR/rec.pid"
 AUDIOFILE="$TALKTYPE_DIR/rec.wav"
+NOTIFYFILE="$TALKTYPE_DIR/notify.id"
 
 mkdir -p "$TALKTYPE_DIR"
 
@@ -35,16 +41,33 @@ if [ -z "${TALKTYPE_CMD:-}" ]; then
     TALKTYPE_CMD="$VENV_DIR/bin/python3 $SCRIPT_DIR/transcribe $WHISPER_MODEL $WHISPER_LANG $WHISPER_DEVICE $WHISPER_COMPUTE"
 fi
 
+# ── Notification helper ──
+notify() {
+    local icon="$1" msg="$2"
+    local -a args=(-a TalkType -u critical -i "$icon" -p "TalkType" "$msg")
+    if [ -f "$NOTIFYFILE" ]; then
+        args+=(-r "$(cat "$NOTIFYFILE")")
+    fi
+    notify-send "${args[@]}" 2>/dev/null | head -1 > "$NOTIFYFILE" || true
+}
+
+notify_close() {
+    if [ -f "$NOTIFYFILE" ]; then
+        notify-send -a TalkType -r "$(cat "$NOTIFYFILE")" -e "TalkType" "" 2>/dev/null || true
+        rm -f "$NOTIFYFILE"
+    fi
+}
+
 # ── Check core dependencies ──
 check_deps() {
     local missing=()
     command -v ydotool    &>/dev/null || missing+=(ydotool)
-    command -v pw-record  &>/dev/null || missing+=(pipewire)
+    command -v ffmpeg &>/dev/null || command -v pw-record &>/dev/null || missing+=("ffmpeg or pipewire")
     command -v notify-send &>/dev/null || missing+=(libnotify-bin)
 
     if [ ${#missing[@]} -gt 0 ]; then
         echo "Missing: ${missing[*]}" >&2
-        notify-send -h string:x-canonical-private-synchronous:talktype -t 3000 -i dialog-error "TalkType" "Missing: ${missing[*]}" 2>/dev/null || true
+        notify-send -t 3000 -i dialog-error "TalkType" "Missing: ${missing[*]}" 2>/dev/null || true
         exit 1
     fi
 }
@@ -55,21 +78,26 @@ check_deps
 if [ -f "$PIDFILE" ]; then
     PID=$(cat "$PIDFILE")
     kill "$PID" 2>/dev/null || true
-    wait "$PID" 2>/dev/null || true
+    # Wait for recorder to finalize the file (not a child, so wait(1) won't work)
+    while kill -0 "$PID" 2>/dev/null; do sleep 0.05; done
     rm -f "$PIDFILE"
 
+    notify process-working "Transcribing..."
+
     # Run the transcription command with the audio file as last arg
     TEXT=$($TALKTYPE_CMD "$AUDIOFILE")
 
     rm -f "$AUDIOFILE"
 
     if [ -z "$TEXT" ]; then
-        notify-send -h string:x-canonical-private-synchronous:talktype -t 1500 -i dialog-warning "TalkType" "No speech detected" 2>/dev/null || true
+        notify dialog-warning "No speech detected"
         exit 0
     fi
 
-    # Type text at cursor via ydotool (works on any Wayland compositor)
-    ydotool type -- "$TEXT"
+    notify_close
+
+    # Type text at cursor via ydotool
+    ydotool type --key-delay 50 -- "$TEXT"
 
 # ── Otherwise → start recording ──
 else
@@ -84,5 +112,5 @@ else
     PID=$!
     disown "$PID"
     echo "$PID" > "$PIDFILE"
-    notify-send -h string:x-canonical-private-synchronous:talktype -t 1500 -i audio-input-microphone "TalkType" "Listening..." 2>/dev/null || true
+    notify audio-input-microphone "Listening..."
 fi
diff --git a/test/server.bats b/test/server.bats
@@ -73,12 +73,12 @@ start_mock_daemon() {
 
 # ── Server wrapper logic ──
 
-@test "transcribe fails with helpful message when server not running" {
-    # Test each server script's transcribe command without a running server
+@test "transcribe auto-start fails gracefully when backend not installed" {
+    # With no venv installed, transcribe should attempt auto-start and fail
     for server in transcribe-server backends/parakeet-server backends/moonshine-server; do
         run "$REPO_DIR/$server" transcribe /tmp/test.wav
         [ "$status" -eq 1 ]
-        [[ "$output" == *"not running"* ]]
+        [[ "$output" == *"not installed"* ]]
     done
 }
 
diff --git a/test/talktype.bats b/test/talktype.bats
@@ -5,6 +5,7 @@
 # with simple mocks so we can test the control flow in isolation.
 
 setup() {
+    export TALKTYPE_CONFIG="/dev/null"
     export TALKTYPE_DIR="$BATS_TEST_TMPDIR/talktype"
     export TALKTYPE_CMD="$BATS_TEST_DIRNAME/mock-transcribe"
 
diff --git a/transcribe-server b/transcribe-server
@@ -22,6 +22,11 @@ case "${1:-}" in
             echo "Already running (PID $(cat "$PIDFILE"))"
             exit 0
         fi
+        if [ ! -x "$VENV/bin/python3" ]; then
+            echo "Whisper backend not installed. Run: make install" >&2
+            exit 1
+        fi
+        rm -f "$PIDFILE" "$SOCK"
         echo "Starting whisper server (loading $WHISPER_MODEL model)..."
         "$VENV/bin/python3" "$SCRIPT_DIR/whisper-daemon.py" "$SOCK" "$WHISPER_MODEL" "$WHISPER_LANG" "$WHISPER_DEVICE" "$WHISPER_COMPUTE" &
         PID=$!
@@ -49,8 +54,7 @@ case "${1:-}" in
         ;;
     transcribe)
         if [ ! -S "$SOCK" ]; then
-            echo "Whisper server not running. Start it with: transcribe-server start" >&2
-            exit 1
+            "$0" start >&2 || exit 1
         fi
         echo "$2" | socat - UNIX-CONNECT:"$SOCK"
         ;;