A comprehensive memory testing framework for evaluating conversational memory agents using the LoCoMo (Long-term Conversational Memory) benchmark.
This system evaluates an AI agent's ability to:
- Process conversational sessions - Extract and store character memories from multi-turn conversations
- Answer questions - Respond to questions about the conversations using stored memories
- Evaluate accuracy - Assess answer correctness across different question categories
- Analyze errors - Generate detailed error reports for incorrect answers
The system uses specialized agents:
- MemAgent: Manages memory creation and updates from conversation sessions
- ResponseAgent: Retrieves relevant memories and generates answers to questions
- EvaluateAgent: Evaluates answer accuracy and performs error analysis
- Python 3.8+
- OpenAI API access
pip install openai python-dotenvCreate a .env file in the project root:
# OpenAI API Configuration
OPENAI_API_KEY=your_openai_api_key_here
# Optional: Custom OpenAI endpoint
# OPENAI_BASE_URL=https://your-custom-endpoint.com/v1Place your LoCoMo test data file in the data/ directory:
data/locomo10.json
The data file should contain conversation sessions and QA pairs in the LoCoMo format.
Run the test with default settings:
python locomo_test.pyTest with specific model:
python locomo_test.py --chat-deployment gpt-4oTest first 3 samples only:
python locomo_test.py --sample-use "3"Test specific samples by index:
python locomo_test.py --sample-use "[0, 2, 5]"Filter by question category:
python locomo_test.py --category "1,2,3"Force regenerate memories:
python locomo_test.py --force-resumUse profile information in answers:
python locomo_test.py --use-profile searchSkip evaluation (only process sessions):
python locomo_test.py --no-evalAnalyze all questions (not just errors):
python locomo_test.py --analyze-on all| Argument | Type | Default | Description |
|---|---|---|---|
--data-file |
str | data/locomo10.json |
Path to test data file |
--sample-use |
str | None (all) | Sample indices: number (e.g., "5") or list (e.g., "[0,1,3]") |
--memory-dir |
str | memory |
Directory for storing memory files |
--chat-deployment |
str | gpt-4o-mini |
OpenAI model name |
--max-workers |
int | 5 | Number of parallel workers for QA processing |
--category |
str | None (all) | Question categories to test (e.g., "1" or "0,2,3") |
--use-image |
bool | True | Include image captions in conversations |
--use-profile |
str | none |
Profile usage: "none", "search", or "all" |
--force-resum |
flag | False | Force regenerate all memories from scratch |
--no-eval |
flag | False | Skip evaluation (only process sessions) |
--analyze-on |
str | wrong |
Detailed analysis: "all", "wrong", or "none" |
Controls which samples to process:
- Integer:
"5"- Process first 5 samples - List:
"[0, 2, 5, 8]"- Process samples at specific indices - Omit: Process all samples
Filter questions by category:
- Single:
"1"- Only category 1 questions - Multiple:
"0,2,3"- Categories 0, 2, and 3 - Omit: All categories
Control profile information usage:
none- Don't use profile information (default)search- Search profiles when relevantall- Always include profile information
Control detailed error analysis:
wrong- Analyze only incorrect answers (default)all- Analyze all answersnone- Skip detailed analysis
The system generates several output files:
File: enhanced_memory_test_results_TIMESTAMP.json
Contains:
- Complete test configuration and arguments
- Summary statistics (accuracy, timing, etc.)
- Category-wise performance breakdown
- Sample-wise detailed results
File: qa_error_log_TIMESTAMP.txt
Contains for each incorrect answer:
- Question text
- Generated answer vs. standard answer
- Retrieved memory content
- Evidence from conversations
- Basic evaluation explanation
- Comprehensive error analysis (if enabled)
File: qa_error_log_TIMESTAMP_CAT_<category>.txt
Separate error logs for each question category, making it easier to analyze category-specific issues.
Directory: memory/
Contains character memory files:
<Character>_profile.txt- Character profile information<Character>_events.txt- Extracted event memories
The test generates a comprehensive JSON file (enhanced_memory_test_results_TIMESTAMP.json) with the following structure:
{
"args": {
"data_file": "data/locomo10.json",
"sample_use": null,
"memory_dir": "memory",
"chat_deployment": "gpt-4o-mini",
"max_workers": 20,
"category": null,
"use_image": true,
"use_profile": "none",
"force_resum": false,
"no_eval": false,
"analyze_on": "wrong"
},
"script": "locomo_test.py --memory-dir=memory --max-workers=20",
"success": true,
"summary": {
"total_samples": 10,
"successful_samples": 10,
"total_sessions": 272,
"sessions_processed": 272,
"sessions_skipped": 0,
"total_questions": 1986,
"total_correct": 1754,
"overall_accuracy": 0.8831,
"category_stats": {
"1": {"total": 282, "correct": 249},
"2": {"total": 321, "correct": 297},
"3": {"total": 96, "correct": 74},
"4": {"total": 841, "correct": 798},
"5": {"total": 2, "correct": 2}
},
"category_accuracies": {
"1": 0.8830,
"2": 0.9252,
"3": 0.7708,
"4": 0.9489,
"5": 1.0
},
"total_time": 2866.21,
"avg_time_per_sample": 286.62
},
"sample_wise_stats": {
"1": {
"1": {"total": 32, "correct": 30, "accuracy": 0.9375},
"2": {"total": 37, "correct": 37, "accuracy": 1.0},
"3": {"total": 13, "correct": 13, "accuracy": 1.0},
"4": {"total": 70, "correct": 69, "accuracy": 0.9857},
"5": {"total": 2, "correct": 2, "accuracy": 1.0}
},
...
},
"detailed_results": [...]
}During execution, you'll see:
- Real-time progress - Sample and QA processing status
- Real-time statistics - Accuracy updates after each sample
- Final summary - Complete results with category breakdown
Example console output:
ENHANCED MEMORY TEST RESULTS - UNIFIED MEMAGENT
================================================================
Samples processed: 10/10
Total sessions: 272
Sessions processed: 272
Total questions: 1986
Total correct: 1754
Overall accuracy: 88.31%
CATEGORY-WISE ACCURACY
================================================================
Category 1 249/282 (88.3%) [33 errors]
Category 2 297/321 (92.5%) [24 errors]
Category 3 74/ 96 (77.1%) [22 errors]
Category 4 798/841 (94.9%) [43 errors]
Category 5 2/ 2 (100.0%) [0 errors]
- Overall Accuracy: Percentage of correctly answered questions
- Category-wise Accuracy: Breakdown by question type
- Error Count: Number of incorrect answers per category
Error: OpenAI API key not configuredSolution: Ensure OPENAI_API_KEY is set in your .env file
Memory files already exist for characters: [...]Solution: Use --force-resum to regenerate memories, or delete memory files manually
If you encounter rate limits, reduce --max-workers:
python locomo_test.py --max-workers 1FileNotFoundError: data/locomo10.jsonSolution: Ensure data file exists at the specified path or use --data-file to specify correct path
python locomo_test.py \
--data-file data/locomo10.json \
--sample-use "5" \
--chat-deployment gpt-4o \
--memory-dir custom_memory \
--max-workers 3 \
--category "1,2,3" \
--use-profile search \
--analyze-on all# Test only on category 0 (factual questions)
python locomo_test.py --category "0"
# Quick test on first sample with detailed analysis
python locomo_test.py --sample-use "1" --analyze-on all
# Regenerate memories and test without evaluation
python locomo_test.py --force-resum --no-eval- Parallel Processing: Increase
--max-workers(3-10) for faster processing on powerful machines - Sample Selection: Use
--sample-useto test on subset during development - Category Focus: Use
--categoryto focus on specific question types - Skip Evaluation: Use
--no-evalwhen only updating memories
If you use this testing framework, please cite the LoCoMo benchmark paper:
@inproceedings{maharana2024locomo,
title = {Evaluating Very Long-Term Conversational Memory of LLM Agents},
author = {Adyasha Maharana and Dong-Ho Lee and Sergey Tulyakov and Mohit Bansal and Francesco Barbieri and Yuwei Fang},
booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)},
pages = {13851--13870},
year = {2024},
address = {Bangkok, Thailand},
publisher = {Association for Computational Linguistics},
doi = {10.18653/v1/2024.acl-long.747},
url = {https://aclanthology.org/2024.acl-long.747/}
}