This guide explains how to evaluate the Visual Search system's retrieval performance.
The evaluation module measures how well the system retrieves relevant images for given queries. It computes standard information retrieval metrics against ground truth data.
Create a JSON file with the following structure:
{
"queries": [
{
"query_id": "q001",
"image_path": "data/queries/query_001.jpg",
"ground_truth": ["img_001", "img_002", "img_003"]
},
{
"query_id": "q002",
"image_path": "data/queries/query_002.jpg",
"ground_truth": ["img_010", "img_011"]
}
]
}Fields:
| Field | Type | Description |
|---|---|---|
query_id |
string | Unique identifier for the query |
image_path |
string | Path to query image (optional if using precomputed embeddings) |
ground_truth |
list[string] | List of relevant image IDs |
from visual_search.evaluation.evaluator import ModelEvaluator, EvaluationDataset
# Load dataset
dataset = EvaluationDataset.from_json("evaluation_data.json")
# Create evaluator
evaluator = ModelEvaluator(search_service=search_service)
# Run evaluation
result = evaluator.evaluate(
dataset=dataset,
k_values=[1, 5, 10, 20],
output_path="results.json",
)
# Print summary
print(result.summary())For faster evaluation without recomputing embeddings:
import numpy as np
# Precomputed query embeddings
precomputed = {
"q001": np.array([...]), # 512-dim vector
"q002": np.array([...]),
}
result = evaluator.evaluate(
dataset=dataset,
k_values=[1, 5, 10],
precomputed_embeddings=precomputed,
)from visual_search.evaluation.evaluator import evaluate_model
result = evaluate_model(
search_service=search_service,
dataset_path="evaluation_data.json",
k_values=[1, 5, 10],
output_path="results.json",
)Proportion of relevant items found in top-K results.
Example:
- Ground truth: [img1, img2, img3]
- Retrieved@5: [img1, img4, img2, img5, img6]
- Recall@5 = 2/3 = 0.667
Proportion of top-K results that are relevant.
Example:
- Ground truth: [img1, img2, img3]
- Retrieved@5: [img1, img4, img2, img5, img6]
- Precision@5 = 2/5 = 0.4
Average of AP scores across all queries, where AP considers ranking quality.
Average of reciprocal ranks of first relevant result.
Example:
- Query 1: First relevant at rank 1 → RR = 1.0
- Query 2: First relevant at rank 3 → RR = 0.333
- MRR = (1.0 + 0.333) / 2 = 0.667
Measures ranking quality with position-weighted relevance.
Evaluation results are saved as JSON:
{
"num_queries": 100,
"k_values": [1, 5, 10, 20],
"recall_at_1": 0.45,
"recall_at_5": 0.72,
"recall_at_10": 0.85,
"recall_at_20": 0.93,
"precision_at_1": 0.45,
"precision_at_5": 0.28,
"precision_at_10": 0.18,
"precision_at_20": 0.12,
"map": 0.52,
"mrr": 0.61,
"per_query_results": [
{
"query_id": "q001",
"recall_at_k": {"1": 1.0, "5": 1.0, "10": 1.0},
"precision_at_k": {"1": 1.0, "5": 0.6, "10": 0.3},
"average_precision": 0.85,
"reciprocal_rank": 1.0
}
],
"timestamp": "2024-01-15T10:30:00Z",
"model_name": "clip-ViT-B-32"
}- Select diverse query images
- For each query, identify all relevant images in the index
- Create JSON file with ground truth
# Use high-confidence retrievals as pseudo ground truth
def create_pseudo_ground_truth(search_service, query_images, threshold=0.8):
dataset = {"queries": []}
for query_id, query_vec in query_images.items():
results = search_service.search(query_vec, query_id, k=100)
# Filter by similarity threshold
relevant = [r.image_id for r in results.results if r.score >= threshold]
dataset["queries"].append({
"query_id": query_id,
"ground_truth": relevant,
})
return dataset- Minimum 50-100 queries for reliable metrics
- Diverse query categories
- Balanced ground truth sizes
Choose K values relevant to your use case:
- User-facing: K=10, 20 (typical page sizes)
- Analysis: K=1, 5, 10, 50, 100
Always compare against baselines:
- Random retrieval
- Previous model version
- Simpler embeddings (e.g., color histograms)
For robust evaluation:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True)
for train_idx, test_idx in kfold.split(dataset.queries):
# Evaluate on test fold
...#!/usr/bin/env python
"""Evaluate visual search model."""
import argparse
from visual_search.evaluation.evaluator import evaluate_model
from visual_search.prediction.nearest_neighbor import NearestNeighborSearch
from visual_search.indexing.index_table import VectorIndex
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--index", required=True, help="Path to FAISS index")
parser.add_argument("--dataset", required=True, help="Evaluation dataset JSON")
parser.add_argument("--output", required=True, help="Output path for results")
parser.add_argument("--k", nargs="+", type=int, default=[1, 5, 10, 20])
args = parser.parse_args()
# Load index
index = VectorIndex(dimension=512)
index.load(args.index)
# Create search service
search_service = NearestNeighborSearch(index=index)
# Run evaluation
result = evaluate_model(
search_service=search_service,
dataset_path=args.dataset,
k_values=args.k,
output_path=args.output,
)
print(result.summary())
if __name__ == "__main__":
main()Run with:
python evaluate.py --index data/index --dataset eval.json --output results.json