A comprehensive framework for evaluating Large Language Models' ability to identify other LLMs based on their responses. This project tests "situational awareness" in AI systems by examining whether models can recognize the distinctive patterns and characteristics of different AI families.
This framework evaluates how well LLMs can identify:
- Model Families (GPT, Claude, Gemini, DeepSeek, Qwen, Llama)
- Exact Models (GPT-4o, Claude-3.5-Sonnet, etc.)
- Response Patterns across different task types
The evaluation covers multiple domains:
- Code - Programming tasks and completions
- Math - Mathematical problem solving
- ChatBot Arena - Comparative evaluations
- Jailbreaking - Safety and robustness testing
pip install anthropic openai google-generativeai datasets pandas numpy scikit-learn matplotlib seaborn tqdm torch transformers togetherCreate api_keys.json:
{
"anthropic": "your-anthropic-api-key",
"openai": "your-openai-api-key",
"gemini": "your-google-api-key",
"deepseek": "your-deepseek-api-key",
"together": "your-together-api-key"
}# Generate responses for CCP dataset
python unified_response_generator.py --dataset_type ccp --target_model claude-3-7-sonnet --num_samples 100
# Generate responses for code tasks
python unified_response_generator.py --dataset_type code --target_model gpt-4o --num_samples 50# Evaluate model identification accuracy
python unified_evaluation.py --dataset_type ccp --identifier_model claude-3-7-sonnet --target_model gpt-4o
# Cross-model evaluation
python unified_evaluation.py --dataset_type math --identifier_model deepseek-v3 --target_model claude-3-5-haiku├── base_inference.py # Base classes and API clients
├── evaluation_utils.py # Utility functions for metrics
├── unified_evaluation.py # Main evaluation script
├── configs.py # Model configurations
├── prompts.py # Prompt templates
└── sampled_data_indicies.py # Data sampling utilities
api_keys.json- API credentials (not in repo)configs.py- Model configurations and endpointsprompts.py- Evaluation prompts and templates
from base_inference import BaseLLMIdentityInference
from evaluation_utils import compute_accuracy_metrics
# Initialize framework
evaluator = BaseLLMIdentityInference()
# Generate response from target model
response = await evaluator.generate_response(
model_name="claude-3-7-sonnet",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# Use identifier model to classify the response
classification = await evaluator.generate_response(
model_name="gpt-4o",
messages=[{"role": "user", "content": f"Identify this response: {response}"}]
)from unified_evaluation import UnifiedEvaluator
evaluator = UnifiedEvaluator(
dataset_type="ccp",
identifier_model="claude-3-7-sonnet",
target_model="gpt-4o"
)
results = await evaluator.run_inference()
print(f"Accuracy: {results['metrics']['family_accuracy']['accuracy']:.3f}")from evaluation_utils import compute_multilabel_auc, plot_confusion_matrix
# Compute detailed metrics
auc_results = compute_multilabel_auc(y_true, y_pred_proba)
accuracy_results = compute_accuracy_metrics(y_true, y_pred)
# Visualize results
plot_confusion_matrix(
accuracy_results['confusion_matrix'],
accuracy_results['labels'],
title="Model Identification Results"
)