Skip to content

Latest commit

 

History

History
279 lines (204 loc) · 6.48 KB

File metadata and controls

279 lines (204 loc) · 6.48 KB

Model Evaluation Guide

This guide explains how to evaluate the Visual Search system's retrieval performance.

Overview

The evaluation module measures how well the system retrieves relevant images for given queries. It computes standard information retrieval metrics against ground truth data.

Evaluation Dataset Format

Create a JSON file with the following structure:

{
  "queries": [
    {
      "query_id": "q001",
      "image_path": "data/queries/query_001.jpg",
      "ground_truth": ["img_001", "img_002", "img_003"]
    },
    {
      "query_id": "q002",
      "image_path": "data/queries/query_002.jpg",
      "ground_truth": ["img_010", "img_011"]
    }
  ]
}

Fields:

Field Type Description
query_id string Unique identifier for the query
image_path string Path to query image (optional if using precomputed embeddings)
ground_truth list[string] List of relevant image IDs

Running Evaluation

Basic Usage

from visual_search.evaluation.evaluator import ModelEvaluator, EvaluationDataset

# Load dataset
dataset = EvaluationDataset.from_json("evaluation_data.json")

# Create evaluator
evaluator = ModelEvaluator(search_service=search_service)

# Run evaluation
result = evaluator.evaluate(
    dataset=dataset,
    k_values=[1, 5, 10, 20],
    output_path="results.json",
)

# Print summary
print(result.summary())

With Precomputed Embeddings

For faster evaluation without recomputing embeddings:

import numpy as np

# Precomputed query embeddings
precomputed = {
    "q001": np.array([...]),  # 512-dim vector
    "q002": np.array([...]),
}

result = evaluator.evaluate(
    dataset=dataset,
    k_values=[1, 5, 10],
    precomputed_embeddings=precomputed,
)

Using the Convenience Function

from visual_search.evaluation.evaluator import evaluate_model

result = evaluate_model(
    search_service=search_service,
    dataset_path="evaluation_data.json",
    k_values=[1, 5, 10],
    output_path="results.json",
)

Metrics

Recall@K

Proportion of relevant items found in top-K results.

$$\text{Recall@K} = \frac{|\text{Retrieved}_K \cap \text{Relevant}|}{|\text{Relevant}|}$$

Example:

  • Ground truth: [img1, img2, img3]
  • Retrieved@5: [img1, img4, img2, img5, img6]
  • Recall@5 = 2/3 = 0.667

Precision@K

Proportion of top-K results that are relevant.

$$\text{Precision@K} = \frac{|\text{Retrieved}_K \cap \text{Relevant}|}{K}$$

Example:

  • Ground truth: [img1, img2, img3]
  • Retrieved@5: [img1, img4, img2, img5, img6]
  • Precision@5 = 2/5 = 0.4

Mean Average Precision (mAP)

Average of AP scores across all queries, where AP considers ranking quality.

$$\text{AP} = \frac{1}{|\text{Relevant}|} \sum_{k=1}^{K} P(k) \cdot \text{rel}(k)$$

Mean Reciprocal Rank (MRR)

Average of reciprocal ranks of first relevant result.

$$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$$

Example:

  • Query 1: First relevant at rank 1 → RR = 1.0
  • Query 2: First relevant at rank 3 → RR = 0.333
  • MRR = (1.0 + 0.333) / 2 = 0.667

NDCG@K (Normalized Discounted Cumulative Gain)

Measures ranking quality with position-weighted relevance.

Output Format

Evaluation results are saved as JSON:

{
  "num_queries": 100,
  "k_values": [1, 5, 10, 20],
  "recall_at_1": 0.45,
  "recall_at_5": 0.72,
  "recall_at_10": 0.85,
  "recall_at_20": 0.93,
  "precision_at_1": 0.45,
  "precision_at_5": 0.28,
  "precision_at_10": 0.18,
  "precision_at_20": 0.12,
  "map": 0.52,
  "mrr": 0.61,
  "per_query_results": [
    {
      "query_id": "q001",
      "recall_at_k": {"1": 1.0, "5": 1.0, "10": 1.0},
      "precision_at_k": {"1": 1.0, "5": 0.6, "10": 0.3},
      "average_precision": 0.85,
      "reciprocal_rank": 1.0
    }
  ],
  "timestamp": "2024-01-15T10:30:00Z",
  "model_name": "clip-ViT-B-32"
}

Creating Evaluation Datasets

Manual Annotation

  1. Select diverse query images
  2. For each query, identify all relevant images in the index
  3. Create JSON file with ground truth

Semi-Automated

# Use high-confidence retrievals as pseudo ground truth
def create_pseudo_ground_truth(search_service, query_images, threshold=0.8):
    dataset = {"queries": []}
    
    for query_id, query_vec in query_images.items():
        results = search_service.search(query_vec, query_id, k=100)
        
        # Filter by similarity threshold
        relevant = [r.image_id for r in results.results if r.score >= threshold]
        
        dataset["queries"].append({
            "query_id": query_id,
            "ground_truth": relevant,
        })
    
    return dataset

Best Practices

Dataset Size

  • Minimum 50-100 queries for reliable metrics
  • Diverse query categories
  • Balanced ground truth sizes

K Values

Choose K values relevant to your use case:

  • User-facing: K=10, 20 (typical page sizes)
  • Analysis: K=1, 5, 10, 50, 100

Baseline Comparison

Always compare against baselines:

  • Random retrieval
  • Previous model version
  • Simpler embeddings (e.g., color histograms)

Cross-Validation

For robust evaluation:

from sklearn.model_selection import KFold

kfold = KFold(n_splits=5, shuffle=True)

for train_idx, test_idx in kfold.split(dataset.queries):
    # Evaluate on test fold
    ...

Example Evaluation Script

#!/usr/bin/env python
"""Evaluate visual search model."""

import argparse
from visual_search.evaluation.evaluator import evaluate_model
from visual_search.prediction.nearest_neighbor import NearestNeighborSearch
from visual_search.indexing.index_table import VectorIndex

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--index", required=True, help="Path to FAISS index")
    parser.add_argument("--dataset", required=True, help="Evaluation dataset JSON")
    parser.add_argument("--output", required=True, help="Output path for results")
    parser.add_argument("--k", nargs="+", type=int, default=[1, 5, 10, 20])
    args = parser.parse_args()

    # Load index
    index = VectorIndex(dimension=512)
    index.load(args.index)

    # Create search service
    search_service = NearestNeighborSearch(index=index)

    # Run evaluation
    result = evaluate_model(
        search_service=search_service,
        dataset_path=args.dataset,
        k_values=args.k,
        output_path=args.output,
    )

    print(result.summary())

if __name__ == "__main__":
    main()

Run with:

python evaluate.py --index data/index --dataset eval.json --output results.json