Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
1217838
fix(bot): align auth with other endpoints using get_request_context
chenjw Mar 30, 2026
69768cb
refactor(memory): improve URI handling and memory extraction
chenjw Mar 31, 2026
0b66e9f
feat(session): add created_at support for add_message
chenjw Mar 31, 2026
0a2b54d
fix(telemetry): add span ended check and AsyncioInstrumentor for back…
chenjw Apr 1, 2026
821c802
update
chenjw Apr 1, 2026
a4e73a7
chore: remove unused openviking/telemetry/langfuse.py
chenjw Apr 1, 2026
d775c6d
rebase
chenjw Apr 2, 2026
3d9071b
feat: 在import_to_ov.py中打印trace_id用于追踪
chenjw Apr 2, 2026
5359acd
style: format Python files with ruff
chenjw Apr 2, 2026
9006225
fix: 从本地文件读取API key而非硬编码
chenjw Apr 2, 2026
2e56fed
fix: resolve lint errors in tracer.py
chenjw Apr 2, 2026
2a5f9c3
feat: 合并_apply_write到_apply_edit并添加content_template渲染日志
chenjw Apr 2, 2026
5593631
fix: 修复 lint 错误
chenjw Apr 2, 2026
ede6192
update
chenjw Apr 2, 2026
79eabe3
Revert "update"
chenjw Apr 2, 2026
d108af5
update
chenjw Apr 2, 2026
45ac1b7
format
chenjw Apr 2, 2026
d66b14b
feat: 添加 soul 和 identity 记忆模板,支持 init_value 初始化
chenjw Apr 2, 2026
81a0c8e
fix: 修复 create_default_registry import 路径
chenjw Apr 2, 2026
a3b41ae
feat: 添加 LoCoMo 评测时间上下文注入
chenjw Apr 3, 2026
bcc0fa7
chore: memory 相关优化
chenjw Apr 3, 2026
896da17
refactor: 统一 memory 文件写入逻辑,修复模板渲染问题
chenjw Apr 4, 2026
d940db4
update
chenjw Apr 4, 2026
f48bf35
feat: memory 性能优化与模板修复
chenjw Apr 5, 2026
47779b8
feat: run_eval.py 使用 question_id 作为 session_id 实现完全独立并行
chenjw Apr 5, 2026
6dadce8
refactor: 评测脚本默认路径改为 benchmark/locomo/data 目录
chenjw Apr 5, 2026
2593824
rebase
chenjw Apr 5, 2026
15e5fcc
rebase
chenjw Apr 5, 2026
f7514e3
feat: 评测脚本添加 --skip-import 开关,修复 is_invalid 匹配逻辑
chenjw Apr 6, 2026
61ad650
refactor: 优化 memory prompt 模板和 entities 处理
chenjw Apr 6, 2026
dd10df5
docs: 更新 LoCoMo 评测文档,添加单题测试说明
chenjw Apr 6, 2026
8c9b504
fix
chenjw Apr 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 32 additions & 9 deletions benchmark/locomo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,10 @@ benchmark/locomo/
│ ├── run_eval.py # 运行 QA 评估
│ ├── judge.py # LLM 裁判打分
│ ├── import_to_ov.py # 导入数据到 OpenViking
│ ├── stat_judge_result.py # 统计评分结果
│ ├── run_full_eval.sh # 一键运行完整评测流程
│ ├── test_data/ # 测试数据目录
│ ├── import_and_eval_one.sh # 单题/批量测试脚本
│ ├── stat_judge_result.py # 统计评分结果
│ ├── run_full_eval.sh # 一键运行完整评测流程
│ ├── data/ # 测试数据目录
│ └── result/ # 评测结果目录
└── openclaw/ # OpenClaw 评测脚本
└── eval.py # OpenClaw 评估脚本
Expand All @@ -28,11 +29,33 @@ benchmark/locomo/

```bash
cd benchmark/locomo/vikingbot
bash run_full_eval.sh
bash run_full_eval.sh # 完整流程
bash run_full_eval.sh --skip-import # 跳过导入,仅评测
```

该脚本会依次执行以下四个步骤:

### 单题/批量测试

使用 `import_and_eval_one.sh` 可以快速测试单个问题或批量测试某个 sample:

```bash
cd benchmark/locomo/vikingbot
```

**单题测试:**
```bash
./import_and_eval_one.sh 0 2 # sample 索引 0, question 2
./import_and_eval_one.sh conv-26 2 # sample_id conv-26, question 2
./import_and_eval_one.sh conv-26 2 --skip-import # 跳过导入
```

**批量测试单个 sample:**
```bash
./import_and_eval_one.sh conv-26 # conv-26 所有问题
./import_and_eval_one.sh conv-26 --skip-import
```

### 分步使用说明

#### 步骤 1: 导入对话数据
Expand All @@ -44,7 +67,7 @@ python import_to_ov.py --input <数据文件路径> [选项]
```

**参数说明:**
- `--input`: 输入文件路径(JSON 或 TXT 格式),默认 `./test_data/locomo10.json`
- `--input`: 输入文件路径(JSON 或 TXT 格式),默认 `./data/locomo10.json`
- `--sample`: 指定样本索引(0-based),默认处理所有样本
- `--sessions`: 指定会话范围,例如 `1-4` 或 `3`,默认所有会话
- `--parallel`: 并发导入数,默认 5
Expand All @@ -55,10 +78,10 @@ python import_to_ov.py --input <数据文件路径> [选项]
**示例:**
```bash
# 导入第一个样本的 1-4 会话
python import_to_ov.py --input ./test_data/locomo10.json --sample 0 --sessions 1-4
python import_to_ov.py --input ./data/locomo10.json --sample 0 --sessions 1-4

# 强制重新导入所有数据
python import_to_ov.py --input ./test_data/locomo10.json --force-ingest
python import_to_ov.py --input ./data/locomo10.json --force-ingest
```

#### 步骤 2: 运行 QA 评估
Expand All @@ -70,7 +93,7 @@ python run_eval.py <输入数据> [选项]
```

**参数说明:**
- `input`: 输入 JSON/CSV 文件路径,默认 `./test_data/locomo10.json`
- `input`: 输入 JSON/CSV 文件路径,默认 `./data/locomo10.json`
- `--output`: 输出 CSV 文件路径,默认 `./result/locomo_qa_result.csv`
- `--sample`: 指定样本索引
- `--count`: 运行的 QA 问题数量,默认全部
Expand All @@ -82,7 +105,7 @@ python run_eval.py <输入数据> [选项]
python run_eval.py

# 指定输入输出文件,使用 20 线程
python run_eval.py ./test_data/locomo_qa_1528.csv --output ./result/my_result.csv --threads 20
python run_eval.py ./data/locomo_qa_1528.csv --output ./result/my_result.csv --threads 20
```

#### 步骤 3: LLM 裁判打分
Expand Down
217 changes: 217 additions & 0 deletions benchmark/locomo/vikingbot/import_and_eval_one.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
#!/bin/bash
# 单题/批量测试脚本:导入对话 + 提问验证
#
# Usage:
# ./import_and_eval_one.sh 0 2 # sample 0, question 2 (单题)
# ./import_and_eval_one.sh conv-26 2 # sample_id conv-26, question 2 (单题)
# ./import_and_eval_one.sh conv-26 # sample_id conv-26, 所有问题 (批量)
# ./import_and_eval_one.sh conv-26 2 --skip-import # 跳过导入,直接评测
# ./import_and_eval_one.sh conv-26 --skip-import # 跳过导入,批量评测

set -e

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
SKIP_IMPORT=false

# 解析参数
for arg in "$@"; do
if [ "$arg" = "--skip-import" ]; then
SKIP_IMPORT=true
fi
done

# 过滤掉 --skip-import 获取实际参数
ARGS=()
for arg in "$@"; do
if [ "$arg" != "--skip-import" ]; then
ARGS+=("$arg")
fi
done

SAMPLE=${ARGS[0]}
QUESTION_INDEX=${ARGS[1]}
INPUT_FILE="$SCRIPT_DIR/../data/locomo10.json"

if [ -z "$SAMPLE" ]; then
echo "Usage: $0 <sample_index|sample_id> [question_index] [--skip-import]"
echo " sample_index: 数字索引 (0,1,2...) 或 sample_id (conv-26)"
echo " question_index: 问题索引 (可选),不传则测试该 sample 的所有问题"
echo " --skip-import: 跳过导入步骤,直接使用已导入的数据进行评测"
exit 1
fi

# 判断是数字还是 sample_id
if [[ "$SAMPLE" =~ ^-?[0-9]+$ ]]; then
SAMPLE_INDEX=$SAMPLE
SAMPLE_ID_FOR_CMD=$SAMPLE_INDEX
echo "Using sample index: $SAMPLE_INDEX"
else
# 通过 sample_id 查找索引
SAMPLE_INDEX=$(python3 -c "
import json
data = json.load(open('$INPUT_FILE'))
for i, s in enumerate(data):
if s.get('sample_id') == '$SAMPLE':
print(i)
break
else:
print('NOT_FOUND')
")
if [ "$SAMPLE_INDEX" = "NOT_FOUND" ]; then
echo "Error: sample_id '$SAMPLE' not found"
exit 1
fi
SAMPLE_ID_FOR_CMD=$SAMPLE
echo "Using sample_id: $SAMPLE (index: $SAMPLE_INDEX)"
fi

# 判断是单题模式还是批量模式
if [ -n "$QUESTION_INDEX" ]; then
# ========== 单题模式 ==========
echo "=== 单题模式: sample $SAMPLE, question $QUESTION_INDEX ==="

# 导入对话(只导入 question 对应的 session)
if [ "$SKIP_IMPORT" = "true" ]; then
echo "[1/3] Skipping import (--skip-import)"
else
echo "[1/3] Importing sample $SAMPLE_INDEX, question $QUESTION_INDEX..."
python benchmark/locomo/vikingbot/import_to_ov.py \
--input "$INPUT_FILE" \
--sample "$SAMPLE_INDEX" \
--question-index "$QUESTION_INDEX" \
--force-ingest

echo "Waiting for data processing..."
sleep 3
fi

# 运行评测
if [ "$SKIP_IMPORT" = "true" ]; then
echo "[1/2] Running evaluation (skip-import mode)..."
else
echo "[2/3] Running evaluation..."
fi
if [[ "$SAMPLE" =~ ^-?[0-9]+$ ]]; then
# 数字索引用默认输出文件
OUTPUT_FILE=./result/locomo_qa_result.csv
python benchmark/locomo/vikingbot/run_eval.py \
"$INPUT_FILE" \
--sample "$SAMPLE_ID_FOR_CMD" \
--question-index "$QUESTION_INDEX" \
--count 1
else
# sample_id 模式直接更新批量结果文件
OUTPUT_FILE=./result/locomo_${SAMPLE}_result.csv
python benchmark/locomo/vikingbot/run_eval.py \
"$INPUT_FILE" \
--sample "$SAMPLE_ID_FOR_CMD" \
--question-index "$QUESTION_INDEX" \
--count 1 \
--output "$OUTPUT_FILE" \
--update-mode
fi

# 运行 Judge 评分
if [ "$SKIP_IMPORT" = "true" ]; then
echo "[2/2] Running judge..."
else
echo "[3/3] Running judge..."
fi
python benchmark/locomo/vikingbot/judge.py --input "$OUTPUT_FILE" --parallel 1

# 输出结果
echo ""
echo "=== 评测结果 ==="
python3 -c "
import csv
import json

question_index = $QUESTION_INDEX

with open('$OUTPUT_FILE') as f:
reader = csv.DictReader(f)
rows = list(reader)

# 找到指定 question_index 的结果
row = None
for r in rows:
if int(r.get('question_index', -1)) == question_index:
row = r
break

if row is None:
# 没找到则用最后一条
row = rows[-1]

# 解析 evidence_text
evidence_text = json.loads(row.get('evidence_text', '[]'))
evidence_str = '\\n'.join(evidence_text) if evidence_text else ''

print(f\"问题: {row['question']}\")
print(f\"期望答案: {row['answer']}\")
print(f\"模型回答: {row['response']}\")
print(f\"证据原文:\\n{evidence_str}\")
print(f\"结果: {row.get('result', 'N/A')}\")
print(f\"原因: {row.get('reasoning', 'N/A')}\")
"

else
# ========== 批量模式 ==========
echo "=== 批量模式: sample $SAMPLE, 所有问题 ==="

# 获取该 sample 的问题数量
QUESTION_COUNT=$(python3 -c "
import json
data = json.load(open('$INPUT_FILE'))
sample = data[$SAMPLE_INDEX]
print(len(sample.get('qa', [])))
")
echo "Found $QUESTION_COUNT questions for sample $SAMPLE"

# 导入所有 sessions
if [ "$SKIP_IMPORT" = "true" ]; then
echo "[1/4] Skipping import (--skip-import)"
else
echo "[1/4] Importing all sessions for sample $SAMPLE_INDEX..."
python benchmark/locomo/vikingbot/import_to_ov.py \
--input "$INPUT_FILE" \
--sample "$SAMPLE_INDEX" \
--force-ingest

echo "Waiting for data processing..."
sleep 10
fi

# 运行评测(所有问题)
if [ "$SKIP_IMPORT" = "true" ]; then
echo "[1/3] Running evaluation for all questions (skip-import mode)..."
else
echo "[2/4] Running evaluation for all questions..."
fi
OUTPUT_FILE=./result/locomo_${SAMPLE}_result.csv
python benchmark/locomo/vikingbot/run_eval.py \
"$INPUT_FILE" \
--sample "$SAMPLE_ID_FOR_CMD" \
--output "$OUTPUT_FILE" \
--threads 5

# 运行 Judge 评分
if [ "$SKIP_IMPORT" = "true" ]; then
echo "[2/3] Running judge..."
else
echo "[3/4] Running judge..."
fi
python benchmark/locomo/vikingbot/judge.py --input "$OUTPUT_FILE" --parallel 5

# 输出统计结果
if [ "$SKIP_IMPORT" = "true" ]; then
echo "[3/3] Calculating statistics..."
else
echo "[4/4] Calculating statistics..."
fi
python benchmark/locomo/vikingbot/stat_judge_result.py --input "$OUTPUT_FILE"

echo ""
echo "=== 批量评测完成 ==="
echo "结果文件: $OUTPUT_FILE"
fi
Loading
Loading