From 12fa085d0cd853ff143c6f319724d56cc4763711 Mon Sep 17 00:00:00 2001 From: kimjin8 <2.6262616e+07+kimjin8@users.noreply.github.com> Date: Wed, 20 May 2026 18:53:39 +0000 Subject: [PATCH 1/2] feat: add comprehensive guide for AI transcription tool with Sapat and Daytona Closes daytonaio/content#13 - Full step-by-step guide for building a multi-provider AI transcription tool using Sapat (OpenAI Whisper, Groq, Azure OpenAI) in a Daytona workspace - Covers devcontainer setup, API configuration, CLI usage, provider comparison - Includes section on extending Sapat with new providers (AssemblyAI example) - Troubleshooting section for common issues - ~2,500 words, Daytona-focused throughout --- ...anscription_tool_with_sapat_and_daytona.md | 439 ++++++++++++++++++ 1 file changed, 439 insertions(+) create mode 100644 guides/20260520_guide_ai_transcription_tool_with_sapat_and_daytona.md diff --git a/guides/20260520_guide_ai_transcription_tool_with_sapat_and_daytona.md b/guides/20260520_guide_ai_transcription_tool_with_sapat_and_daytona.md new file mode 100644 index 00000000..4ba723fe --- /dev/null +++ b/guides/20260520_guide_ai_transcription_tool_with_sapat_and_daytona.md @@ -0,0 +1,439 @@ +--- +title: "Building an AI-Powered Video Transcription Tool with Sapat and Daytona" +description: "A comprehensive guide to building and running a multi-provider AI transcription tool using Sapat (OpenAI Whisper, Groq, and Azure OpenAI) inside a reproducible Daytona workspace." +date: 2026-05-20 +author: "Kim Jin" +tags: ["ai", "transcription", "whisper", "openai", "groq", "azure", "daytona", "python", "devcontainer"] +--- + +# Building an AI-Powered Video Transcription Tool with Sapat and Daytona + +Transcribing audio and video content manually is time-consuming and error-prone. Modern AI-powered speech recognition APIs — including OpenAI Whisper, Groq's ultra-fast inference, and Azure OpenAI — have made automated transcription fast, accurate, and affordable. But setting up a consistent, reproducible development environment for these tools can still be a challenge. + +This guide walks you through building and running [Sapat](https://github.com/nibzard/sapat), a Python-based multi-provider video transcription tool, inside a [Daytona](https://www.daytona.io/) workspace. By the end, you will have a fully configured, portable development environment that can transcribe video files using any of three major AI providers — with a single command. + +## TL;DR + +- What Sapat is and how its multi-provider architecture works +- Setting up a Daytona workspace with a pre-configured devcontainer +- Configuring API credentials for OpenAI, Groq, and Azure OpenAI +- Transcribing video files and entire directories with Sapat +- Comparing the three providers: speed, cost, and accuracy +- Extending Sapat with additional transcription providers + +## Prerequisites + +To follow this guide, you will need: + +- A basic understanding of [Python](../definitions/20240820_defintion_python.md) and command-line tools +- A [Daytona](https://www.daytona.io/docs/installation/installation/) installation (latest version) +- [Docker](https://www.docker.com/) installed and running +- An API key for at least one of the following services: + - [OpenAI](https://platform.openai.com/) (for Whisper API) + - [Groq Cloud](https://console.groq.com/) (for Whisper Large v3 Turbo — free tier available) + - [Azure OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service) (for Azure-hosted Whisper) +- A video file to transcribe (`.mp4`, `.mov`, `.avi`, or similar) + +> **Note:** Groq offers a generous free tier with fast inference, making it the easiest provider to get started with at no cost. + +## Understanding Sapat's Architecture + +Before diving into setup, it helps to understand how Sapat is structured. The tool is built around a clean provider abstraction: a `TranscriptionBase` class defines the interface, and each provider (OpenAI, Groq, Azure) implements it independently. + +``` +src/sapat/ +├── script.py # CLI entry point (Click-based) +└── transcription/ + ├── base.py # Abstract base class + ├── openai.py # OpenAI Whisper implementation + ├── groq.py # Groq Cloud implementation + └── azure.py # Azure OpenAI implementation +``` + +The `script.py` CLI accepts a video file or directory, converts it to MP3 using `ffmpeg`, sends it to the selected provider's API, and saves the transcription as a `.txt` file alongside the source video. Temporary MP3 files are cleaned up automatically. + +This architecture makes it straightforward to add new providers — which we will cover at the end of this guide. + +## Setting Up the Project Template + +### Step 1: Clone the Sapat Repository + +Start by cloning the Sapat repository to your local machine: + +```bash +git clone https://github.com/nibzard/sapat.git +cd sapat +``` + +### Step 2: Review the Existing devcontainer Configuration + +Sapat already includes a `.devcontainer/devcontainer.json` file. Open it to see what it configures: + +```json +{ + "name": "Video Transcription Tool", + "image": "mcr.microsoft.com/devcontainers/python:3.12", + "customizations": { + "vscode": { + "settings": { + "python.defaultInterpreterPath": "/usr/local/bin/python", + "python.linting.enabled": true, + "python.linting.pylintEnabled": true + }, + "extensions": [ + "ms-python.python", + "ms-python.vscode-pylance", + "njpwerner.autodocstring" + ] + } + }, + "postCreateCommand": { + "ffmpeg": "sudo apt install ffmpeg -y", + "requirements": "pip install -r requirements.txt" + } +} +``` + +This configuration: +- Uses a Python 3.12 devcontainer image +- Installs `ffmpeg` automatically (required for video-to-audio conversion) +- Installs all Python dependencies from `requirements.txt` +- Configures VS Code with Python linting and autocomplete + +### Step 3: Configure Your API Credentials + +Create a `.env` file in the project root by copying the provided example: + +```bash +cp .env.example .env +``` + +Now edit `.env` and fill in your credentials. You only need to configure the provider(s) you plan to use: + +```bash +# ── OpenAI ────────────────────────────────────────────────────────────────── +OPENAI_API_KEY=sk-your-openai-api-key-here +OPENAI_MODEL=whisper-1 +OPENAI_API_ENDPOINT=https://api.openai.com/v1/audio/transcriptions +OPENAI_MODEL_NAME_CHAT=gpt-4o + +# ── Groq Cloud ─────────────────────────────────────────────────────────────── +GROQCLOUD_API_KEY=gsk_your-groq-api-key-here +GROQCLOUD_MODEL=whisper-large-v3-turbo +GROQCLOUD_API_ENDPOINT=https://api.groq.com/openai/v1/audio/transcriptions +GROQCLOUD_MODEL_NAME_CHAT=llama3-8b-8192 + +# ── Azure OpenAI ───────────────────────────────────────────────────────────── +AZURE_OPENAI_API_KEY=your-azure-api-key-here +AZURE_OPENAI_ENDPOINT=https://YOUR-DEPLOYMENT.openai.azure.com +AZURE_OPENAI_DEPLOYMENT_NAME_WHISPER=whisper +AZURE_OPENAI_API_VERSION_WHISPER=2024-06-01 +AZURE_OPENAI_DEPLOYMENT_NAME_CHAT=gpt-4o +AZURE_OPENAI_API_VERSION_CHAT=2023-03-15-preview +``` + +> **Important:** Never commit your `.env` file to version control. The `.gitignore` in the Sapat repository already excludes it, but double-check before pushing any changes. + +## Creating a Daytona Workspace + +With the project template ready, you can now open it as a Daytona workspace. Daytona will automatically detect the `.devcontainer/devcontainer.json` configuration and provision a fully isolated, reproducible environment. + +### Step 4: Create the Workspace + +Run the following command to create a new Daytona workspace from the Sapat repository: + +```bash +daytona create https://github.com/nibzard/sapat +``` + +Daytona will: +1. Clone the repository into a new workspace +2. Build the devcontainer image (Python 3.12 + ffmpeg + dependencies) +3. Open the workspace in your configured IDE (VS Code by default) + +> **Tip:** If you have already cloned the repository locally, you can also create a workspace from a local path: `daytona create /path/to/sapat` + +### Step 5: Verify the Environment + +Once the workspace is open, verify that all dependencies are installed correctly by opening the integrated terminal and running: + +```bash +# Verify ffmpeg is available +ffmpeg -version | head -1 + +# Verify Python packages +pip show openai groq python-dotenv click +``` + +You should see version information for both `ffmpeg` and the Python packages. If anything is missing, the `postCreateCommand` in `devcontainer.json` can be re-run manually: + +```bash +sudo apt install ffmpeg -y && pip install -r requirements.txt +``` + +### Step 6: Install Sapat as a Package + +Install Sapat in development mode so the `sapat` CLI command is available: + +```bash +pip install -e . +``` + +Verify the installation: + +```bash +sapat --help +``` + +You should see output similar to: + +``` +Usage: sapat [OPTIONS] INPUT_PATH + + Transcribe video files using AI transcription services. + +Options: + --provider [openai|groq|azure] Transcription provider to use. + --temperature FLOAT Temperature for transcription. + --response-format TEXT Response format (json or text). + --help Show this message and exit. +``` + +## Transcribing Your First Video + +### Step 7: Transcribe a Single File + +Place a video file in your workspace (e.g., `demo.mp4`) and run: + +```bash +# Using Groq (fastest, free tier available) +sapat demo.mp4 --provider groq + +# Using OpenAI Whisper +sapat demo.mp4 --provider openai + +# Using Azure OpenAI +sapat demo.mp4 --provider azure +``` + +Sapat will: +1. Convert `demo.mp4` to a temporary `demo.mp3` using `ffmpeg` +2. Send the audio to the selected provider's API +3. Save the transcription as `demo.txt` in the same directory +4. Delete the temporary `demo.mp3` + +### Step 8: Transcribe an Entire Directory + +If you have a folder of video files to process in bulk, pass the directory path instead: + +```bash +sapat ./videos/ --provider groq +``` + +Sapat will process every video file in the directory sequentially, creating a corresponding `.txt` transcription file for each one. + +### Step 9: Customize Temperature and Response Format + +The `--temperature` and `--response-format` options allow fine-grained control over the transcription output: + +```bash +# Get verbose JSON output with timestamps +sapat demo.mp4 --provider openai --response-format verbose_json + +# Lower temperature for more deterministic output +sapat demo.mp4 --provider groq --temperature 0.0 +``` + +The `verbose_json` format (supported by OpenAI and Azure) includes word-level timestamps, which is useful for generating subtitles or syncing transcriptions with video playback. + +## Comparing the Three Providers + +Each provider has different strengths depending on your use case: + +| Feature | OpenAI Whisper | Groq Cloud | Azure OpenAI | +|---|---|---|---| +| **Model** | `whisper-1` | `whisper-large-v3-turbo` | `whisper` (Azure-hosted) | +| **Speed** | Moderate | Very fast (LPU inference) | Moderate | +| **Free Tier** | No | Yes (generous limits) | No | +| **Cost** | $0.006/min | Free tier + paid | Pay-as-you-go | +| **Max File Size** | 25 MB | 25 MB | 25 MB | +| **Best For** | General use | Development, prototyping | Enterprise, compliance | +| **Verbose JSON** | Yes | No | Yes | + +> **Recommendation for development:** Use Groq during development and testing — it is the fastest and has a free tier. Switch to OpenAI or Azure for production workloads that require verbose JSON output or enterprise SLAs. + +## Understanding the Code: How Sapat Works Internally + +To extend Sapat with new providers, it helps to understand the base class: + +```python +# src/sapat/transcription/base.py (simplified) +from abc import ABC, abstractmethod + +class TranscriptionBase(ABC): + max_file_size_mb: int = 25 + + def _validate_audio_file(self, audio_file: str): + """Checks file exists and is under the size limit.""" + ... + + @abstractmethod + def transcribe_audio(self, audio_file: str, **kwargs) -> dict | str: + """Transcribes an audio file and returns the result.""" + ... +``` + +Each provider subclass implements `transcribe_audio` by calling its respective API. The Groq implementation, for example, uses the official `groq` Python SDK: + +```python +# src/sapat/transcription/groq.py (simplified) +from groq import Groq + +class GroqCloudTranscription(TranscriptionBase): + def transcribe_audio(self, audio_file: str, **kwargs): + self._validate_audio_file(audio_file) + client = Groq(api_key=self.api_key) + with open(audio_file, "rb") as f: + response = client.audio.transcriptions.create( + file=f, + model=self.model, + temperature=self.temperature, + response_format=self.response_format, + ) + return response +``` + +## Extending Sapat: Adding a New Provider + +The bounty for this issue specifically asks contributors to add support for additional APIs. Here is how to add a new provider — using [AssemblyAI](https://www.assemblyai.com/) as an example: + +### Step 10: Create the Provider Class + +Create a new file `src/sapat/transcription/assemblyai.py`: + +```python +import os +import requests +from dotenv import load_dotenv +from .base import TranscriptionBase + +load_dotenv(".env") + +class AssemblyAITranscription(TranscriptionBase): + """AssemblyAI implementation for transcription.""" + + def __init__(self, temperature: float = 0.0, response_format: str = "json"): + self.api_key = os.getenv('ASSEMBLYAI_API_KEY') + self.endpoint = "https://api.assemblyai.com/v2" + self.temperature = temperature + self.response_format = response_format + self.max_file_size_mb = 100 # AssemblyAI supports larger files + + def transcribe_audio(self, audio_file: str, **kwargs) -> dict: + self._validate_audio_file(audio_file) + headers = {"authorization": self.api_key} + + # Step 1: Upload the audio file + with open(audio_file, "rb") as f: + upload_response = requests.post( + f"{self.endpoint}/upload", + headers=headers, + data=f, + ) + upload_url = upload_response.json()["upload_url"] + + # Step 2: Request transcription + transcript_response = requests.post( + f"{self.endpoint}/transcript", + headers=headers, + json={"audio_url": upload_url}, + ) + transcript_id = transcript_response.json()["id"] + + # Step 3: Poll for completion + import time + while True: + result = requests.get( + f"{self.endpoint}/transcript/{transcript_id}", + headers=headers, + ).json() + if result["status"] == "completed": + return result + elif result["status"] == "error": + raise RuntimeError(f"AssemblyAI transcription failed: {result['error']}") + time.sleep(2) +``` + +### Step 11: Register the Provider in the CLI + +Update `src/sapat/script.py` to include the new provider in the `--provider` option: + +```python +# Add to the provider choices +@click.option( + "--provider", + type=click.Choice(["openai", "groq", "azure", "assemblyai"]), + default="openai", +) +``` + +And add the import and instantiation logic: + +```python +elif provider == "assemblyai": + from sapat.transcription.assemblyai import AssemblyAITranscription + transcriber = AssemblyAITranscription(temperature=temperature, response_format=response_format) +``` + +### Step 12: Add the Environment Variable + +Add the new API key to `.env.example`: + +```bash +# AssemblyAI +ASSEMBLYAI_API_KEY=your-assemblyai-api-key-here +``` + +And add `assemblyai` to `requirements.txt`: + +``` +assemblyai +``` + +## Troubleshooting Common Issues + +**`ffmpeg: command not found`** +The `postCreateCommand` in `devcontainer.json` installs ffmpeg automatically. If it is missing, run `sudo apt install ffmpeg -y` manually inside the workspace terminal. + +**`File size exceeds maximum limit`** +All three providers have a 25 MB audio limit. For longer videos, split them first using ffmpeg: +```bash +ffmpeg -i long_video.mp4 -t 600 -c copy part1.mp4 # First 10 minutes +``` + +**`API key not found` errors** +Ensure your `.env` file is in the project root (the same directory as `requirements.txt`), not inside `src/`. Sapat uses `load_dotenv(".env")` which looks for the file relative to the current working directory. + +**Groq rate limit errors** +Groq's free tier has rate limits. If you hit them, add a short delay between files when processing directories, or upgrade to a paid Groq plan. + +## Conclusion + +In this guide, you have: + +- Set up a reproducible Daytona workspace for the Sapat transcription tool +- Configured API credentials for OpenAI Whisper, Groq Cloud, and Azure OpenAI +- Transcribed single video files and entire directories using the `sapat` CLI +- Compared the three providers across speed, cost, and features +- Extended Sapat with a new provider (AssemblyAI) following the existing architecture + +Daytona's devcontainer integration makes it trivial to share this environment with teammates or reproduce it on any machine — no manual dependency installation required. The combination of Sapat's clean provider abstraction and Daytona's reproducible workspaces creates a solid foundation for building production-grade transcription pipelines. + +## References + +- [Sapat GitHub Repository](https://github.com/nibzard/sapat) +- [Daytona Documentation](https://www.daytona.io/docs) +- [OpenAI Whisper API Reference](https://platform.openai.com/docs/api-reference/audio) +- [Groq Cloud API Documentation](https://console.groq.com/docs/speech-text) +- [Azure OpenAI Whisper Documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/whisper-quickstart) +- [AssemblyAI API Documentation](https://www.assemblyai.com/docs) From c832e671e9f181e982a1aa4939c53f5f375f4ced Mon Sep 17 00:00:00 2001 From: kimjin8 <2.6262616e+07+kimjin8@users.noreply.github.com> Date: Wed, 20 May 2026 19:23:16 +0000 Subject: [PATCH 2/2] feat: add guide for running Omni and Claude Engineers inside Daytona --- ...mni_and_claude_engineers_inside_daytona.md | 143 ++++++++++++++++++ 1 file changed, 143 insertions(+) create mode 100644 guides/20260520_guide_omni_and_claude_engineers_inside_daytona.md diff --git a/guides/20260520_guide_omni_and_claude_engineers_inside_daytona.md b/guides/20260520_guide_omni_and_claude_engineers_inside_daytona.md new file mode 100644 index 00000000..9f65a24b --- /dev/null +++ b/guides/20260520_guide_omni_and_claude_engineers_inside_daytona.md @@ -0,0 +1,143 @@ +--- +title: "How to Run Omni and Claude Engineers Inside Daytona" +description: "A step-by-step guide to running powerful AI coding assistants like Omni Engineer and Claude Engineer inside Daytona workspaces using devcontainers." +author: "Manus AI" +date: "2026-05-20" +tags: ["AI", "DevEnvironment", "Tutorial", "Python"] +--- + +# How to Run Omni and Claude Engineers Inside Daytona + +AI coding assistants have evolved from simple autocomplete tools into autonomous agents capable of managing files, running commands, and executing complex workflows. Two of the most powerful open-source agents available today are **Omni Engineer** and **Claude Engineer**, both created by Doriandarko. + +However, running autonomous AI agents on your local machine can be risky. They execute code, modify files, and install dependencies. This is where **Daytona** shines. By running these AI engineers inside a Daytona workspace, you get a secure, isolated, and perfectly configured environment instantly. + +In this guide, we will show you step-by-step how to run both Omni Engineer and Claude Engineer inside Daytona using `devcontainer.json`. + +--- + +## Why Run AI Engineers in Daytona? + +1. **Security & Isolation**: AI agents can make mistakes. Running them in a Daytona sandbox ensures that if an agent accidentally deletes a directory or runs a malicious package, your local machine remains completely safe. +2. **Zero Setup**: With the `devcontainer.json` configuration, Daytona automatically installs Python, sets up the virtual environment, and installs all required dependencies (`requirements.txt`) before you even open the terminal. +3. **Consistent Environments**: No more "it works on my machine" issues. The AI gets the exact same environment every time. + +--- + +## Step 1: The Devcontainer Configuration + +To make these repositories instantly runnable in Daytona, we need a `devcontainer.json` file. We have contributed these configurations to both the [Omni Engineer](https://github.com/Doriandarko/omni-engineer) and [Claude Engineer](https://github.com/Doriandarko/claude-engineer) repositories. + +Here is the configuration used for Omni Engineer: + +```json +{ + "name": "Omni Engineer", + "image": "mcr.microsoft.com/devcontainers/python:1-3.11-bullseye", + "features": { + "ghcr.io/devcontainers/features/git:1": {} + }, + "customizations": { + "vscode": { + "settings": { + "python.defaultInterpreterPath": "/usr/local/bin/python" + }, + "extensions": [ + "ms-python.python", + "ms-python.vscode-pylance" + ] + } + }, + "postCreateCommand": "pip install -r requirements.txt", + "remoteEnv": { + "OPENAI_API_KEY": "${localEnv:OPENAI_API_KEY}", + "ANTHROPIC_API_KEY": "${localEnv:ANTHROPIC_API_KEY}", + "TAVILY_API_KEY": "${localEnv:TAVILY_API_KEY}" + } +} +``` + +### What this does: +- Uses a stable Python 3.11 image. +- Automatically installs all dependencies via `postCreateCommand`. +- Passes your local API keys securely into the Daytona workspace via `remoteEnv`. + +--- + +## Step 2: Running Omni Engineer in Daytona + +Omni Engineer is the newer, more versatile tool that supports OpenAI's O1 models alongside Anthropic's Claude. + +### 1. Create the Workspace +Run the following command in your terminal to create a new Daytona workspace from the repository: + +```bash +daytona create https://github.com/Doriandarko/omni-engineer +``` + +### 2. Set Your API Keys +Omni Engineer requires API keys to function. You can set these securely in Daytona so they are automatically injected into your workspace: + +```bash +daytona env set OPENAI_API_KEY="your-openai-key" +daytona env set ANTHROPIC_API_KEY="your-anthropic-key" +daytona env set TAVILY_API_KEY="your-tavily-key" +``` + +### 3. Start the Engineer +Open the workspace in your preferred IDE (e.g., VS Code) using Daytona: + +```bash +daytona code +``` + +Once inside the terminal, simply run: + +```bash +python main.py +``` + +You will be greeted by the Omni Engineer console. You can now ask it to create a new React application, analyze a Python script, or debug an error—all safely contained within your Daytona workspace. + +--- + +## Step 3: Running Claude Engineer in Daytona + +Claude Engineer is specifically optimized for Anthropic's Claude 3.5 Sonnet and Haiku models, featuring a beautiful terminal UI and specialized tools. + +### 1. Create the Workspace +```bash +daytona create https://github.com/Doriandarko/claude-engineer +``` + +### 2. Set Your API Keys +```bash +daytona env set ANTHROPIC_API_KEY="your-anthropic-key" +daytona env set TAVILY_API_KEY="your-tavily-key" +``` + +### 3. Start the Engineer +Open the workspace: +```bash +daytona code +``` + +Run the application: +```bash +python main.py +``` + +### Example Use Case: Building a Web Scraper +Once Claude Engineer is running inside Daytona, try giving it this prompt: + +> *"Create a Python script that scrapes the front page of Hacker News and saves the top 10 article titles and links to a CSV file. Run the script to test it."* + +Because it is running in Daytona, Claude Engineer can safely create the file, install `requests` and `beautifulsoup4`, execute the script, and save the CSV—without you ever leaving the chat interface. + +--- + +## Conclusion + +By combining the autonomous coding power of Omni and Claude Engineers with the secure, reproducible environments of Daytona, you get the ultimate AI development workflow. You can let the AI agents run wild, knowing your host machine is safe and your environment is perfectly configured. + +Try it out today by running `daytona create https://github.com/Doriandarko/omni-engineer`!