Skip to content

Latest commit

 

History

History
107 lines (78 loc) · 3.08 KB

README.md

File metadata and controls

107 lines (78 loc) · 3.08 KB

promptfoo-evals-starter

A simple repo that helps you get started with AI evals using promptfoo.

Every wondered how to run evals on your own prompts? Then this repo gets you from 0 to hero in no time.

promptfoo-evals-starter

If you find this repo useful, please give it a star and follow me on X (@s_streichsbier) for more content like this.

Getting started

Install promptfoo

npm install -g promptfoo@latest

Set up your API keys

Note: You can remove providers you don't care about in the promptfooconfig.yaml file.

  • OpenAI: export OPENAI_API_KEY=<your-key>
  • Anthropic: export ANTHROPIC_API_KEY=<your-key>
  • Google: export GOOGLE_API_KEY=<your-key>
  • DeepSeek: export DEEPSEEK_API_KEY=<your-key>
  • Groq: export GROQ_API_KEY=<your-key>
  • xai: export XAI_API_KEY=<your-key>
  • OpenRouter: export OPEN_ROUTER_API_KEY=<your-key>

Run the evals

# Navigate to the directory of the eval you want to run
cd counting_characters_in_words

# Run the eval
promptfoo eval

# Run the eval without cache
promptfoo eval --no-cache

View the results

# Default port is 15500,
# -y opens the browser automatically
promptfoo view -y 

# Specify a different port
promptfoo view -p 1337 -y

Directory structure

.
├── counting_characters_in_words
│   ├── promptfooconfig.yaml
│   ├── system_instructions.md
│   ├── test-r-in-raspberry.yaml
│   └── test-r-in-strawberry.yaml
  • counting_characters_in_words: The directory containing the eval
  • promptfooconfig.yaml: The configuration file for promptfoo containing the prompts and models to use
  • system_instructions.md: The system instructions for the eval
  • prompt.json: contains the prompt format for the eval
  • test-*.yaml: The test files for the eval

Best practices

  • Use XML for the prompt format, it works best for LLMs. See this video for details BEST Prompt Format: Markdown, XML, or Raw? for an example.
  • Use JSON responses for the eval, this makes it easier to parse the results and run assertions.
  • Add a defaultTest section to the promptfooconfig.yaml file with the expected JSON schema.

Models

OpenAI

  • gpt-3.5-turbo
  • gpt-4o-mini
  • gpt-4o

the o1 models don't support system messages at the moment.

Anthropic

  • claude-3-5-haiku-20241022
  • claude-3-5-sonnet-20241022

Google

  • gemini-1.5-flash-002
  • gemini-1.5-flash-8b
  • gemini-1.5-pro
  • gemini-2.0-flash-exp
  • gemini-2.0-flash-thinking-exp-1219
  • gemini-exp-1206

DeepSeek

  • deepseek-v3

Groq

  • llama-3.3-70b-versatile

xai

  • grok-2

OpenRouter

Also check out OpenRouter for more models.