Tool to automatically generate text descriptions (captions) for images using Ollama vision models (LLaVA, Qwen3-VL, Llama Vision). Available as a web application (recommended) or CLI.
Key feature: Fully customizable system prompts to precisely control the output format. Includes built-in presets optimized for Stable Diffusion (with parentheses weight syntax), Z-Image/Midjourney (detailed structured descriptions), Flux, and more.
Perfect for AI image generation training! This tool is designed to help you create caption files for training LoRA (Low-Rank Adaptation) models on image generation AI like Stable Diffusion, Flux, Z-Image, or other diffusion models.
When training a LoRA, each image in your dataset needs an accompanying .txt file with a description. This tool automates that process by:
- Analysis of each image using a visual AI model, with your personalized instructions in natural language
- Generating detailed, consistent descriptions tailored to your target model
- Git - Version control
- uv - Python package manager (handles Python installation automatically)
- Ollama - Must be installed separately and running
Check if Git is already installed:
git --versionIf not installed, download it from git-scm.com and follow the installation instructions for your OS.
uv is a fast Python package manager. First, check if you already have it installed:
uv --versionIf not installed, run one of these commands:
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | shAfter installation, restart your terminal to ensure uv is available in your PATH.
Download and install Ollama from ollama.com, then pull a vision model:
ollama pull qwen3-vl:4b# Clone the repository
git clone https://github.com/hydropix/ollama-image-describer.git
cd ollama-image-describer
# Install Python dependencies with uv
uv syncNote:
uv syncinstalls the Python dependencies (including theollamaPython client library). The Ollama server itself must be installed separately as described above.
Create a .env file from the example:
cp config/.env.example .envConfigure your Ollama server:
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen3-vl:4bThe config/config.yaml file contains prompt presets and default settings. The system prompt is fully customizable, allowing you to precisely control the output format to match your specific needs.
The tool includes presets optimized for different image generation models:
| Preset | Target Model | Description |
|---|---|---|
| Z-Image | Z-Image | Very detailed, structured descriptions with markdown formatting. Focuses on composition, lighting, textures, and atmosphere with poetic precision. |
| Stable Diffusion | SD, SDXL, Forge | Tag-based prompts with weight syntax (element:1.2). Uses parentheses for emphasis and quality boosters like (masterpiece:1.2), (best quality). |
| Simple | General use | Concise, straightforward descriptions without special formatting. |
(masterpiece:1.2), (best quality), 1girl, long flowing red hair, (emerald green eyes:1.3),
elegant black dress, standing in flower field, soft golden hour lighting, (bokeh:1.1),
depth of field, vibrant colors, digital painting style, highly detailed
## Subject
**Young woman** with flowing auburn hair, wearing a vintage emerald dress
## Composition & Setting
Wide shot capturing a sunlit meadow with wildflowers in the foreground
## Lighting & Atmosphere
*Golden hour lighting* casting warm shadows, *soft diffused glow* from the setting sunAdd your own presets in config/config.yaml to tailor outputs for your specific workflow:
prompts:
# Your custom preset
flux:
name: "Flux"
markdownFormat: false
prompt: |
Generate a natural language description optimized for Flux models.
Focus on clear, descriptive sentences without weight syntax.
Describe the scene as if telling a story...
my_custom:
name: "My Custom Style"
markdownFormat: false
prompt: |
Your custom instructions here...
Be specific about the output format you want.
defaults:
temperature: 0.7
model: "qwen3-vl:4b"The markdownFormat option controls whether the output uses markdown styling (headers, bold, italics) or plain text.
Important: All commands must be run from the project directory (
ollama-image-describer).
Launch the web interface for an easier experience:
cd ollama-image-describer
uv run python -m image_describer --webuv run python -m image_describer <image_folder> [options]| Option | Short | Description |
|---|---|---|
--web |
-w |
Launch the web interface |
--config |
-c |
Path to YAML config file |
--prefix |
-p |
Text prepended to each description |
--suffix |
-s |
Text appended to each description (e.g., ", By Artist") |
--overwrite/--no-overwrite |
Overwrite existing .txt files (default: overwrite) | |
--verbose |
-v |
Verbose mode |
# Launch web interface
uv run python -m image_describer --web
# Basic CLI usage
uv run python -m image_describer ./my_images
# With prefix and suffix
uv run python -m image_describer ./my_images --prefix "A photo of " --suffix ", By Kristof"
# Verbose mode with custom config
uv run python -m image_describer ./my_images -v -c custom_config.yaml
⚠️ Important: This tool requires a vision model capable of analyzing images. Standard text-only models (likellama3,mistral, etc.) will not work.Browse all available vision models: Ollama Vision Models
Recommended vision models:
- Qwen3-VL (recommended):
qwen3-vl:4b - LLaVA:
llava,llava:13b - Llama Vision:
llama3.2-vision
MIT
