FlagAI-Open
diff --git a/‎evaluation/qwen_eval/README.md‎
Lines changed: 17 additions & 3 deletions b/‎evaluation/qwen_eval/README.md‎
Lines changed: 17 additions & 3 deletions
diff --git a/‎evaluation/qwen_eval/__pycache__/evaluate_final.cpython-310.pyc‎
8.8 KB b/‎evaluation/qwen_eval/__pycache__/evaluate_final.cpython-310.pyc‎
8.8 KB
diff --git a/‎evaluation/qwen_eval/__pycache__/grader.cpython-310.pyc‎
6.73 KB b/‎evaluation/qwen_eval/__pycache__/grader.cpython-310.pyc‎
6.73 KB
diff --git a/‎evaluation/qwen_eval/__pycache__/model_utils.cpython-310.pyc‎
5.95 KB b/‎evaluation/qwen_eval/__pycache__/model_utils.cpython-310.pyc‎
5.95 KB
diff --git a/‎evaluation/qwen_eval/__pycache__/parser.cpython-310.pyc‎
12.7 KB b/‎evaluation/qwen_eval/__pycache__/parser.cpython-310.pyc‎
12.7 KB
diff --git a/‎evaluation/qwen_eval/__pycache__/python_executor.cpython-310.pyc‎
6.56 KB b/‎evaluation/qwen_eval/__pycache__/python_executor.cpython-310.pyc‎
6.56 KB
diff --git a/‎evaluation/qwen_eval/__pycache__/trajectory.cpython-310.pyc‎
5.24 KB b/‎evaluation/qwen_eval/__pycache__/trajectory.cpython-310.pyc‎
5.24 KB
diff --git a/‎evaluation/qwen_eval/__pycache__/utils.cpython-310.pyc‎
9.41 KB b/‎evaluation/qwen_eval/__pycache__/utils.cpython-310.pyc‎
9.41 KB
diff --git a/‎evaluation/qwen_eval/data/aime24/test.jsonl‎
Lines changed: 30 additions & 0 deletions b/‎evaluation/qwen_eval/data/aime24/test.jsonl‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎evaluation/qwen_eval/data/amc23/test.jsonl‎
Lines changed: 40 additions & 0 deletions b/‎evaluation/qwen_eval/data/amc23/test.jsonl‎
Lines changed: 40 additions & 0 deletions
@@ -9,6 +9,20 @@ pip install -r requirements.txt
 
 
 ## Usage
+### Generation & Evaluation
+
+To run the evaluation:
+
+1. **Configure the prompt type**: Add your custom prompt type in `utils.py` and specify it in `sh/run_evaluate.sh`
+
+2. **Set the model path**: Update the `MODEL_NAME_OR_PATH` variable in `sh/run_evaluate.sh` with your model's path
+
+3. **Run evaluation**: Execute the following command to generate predictions and evaluate results:
+   ```bash
+   bash sh/run_evaluate.sh
+   ```
+
+### Leaderboard
 ```bash
 BASE_DIR=./eval_example/
 
@@ -17,11 +31,11 @@ python evaluate_final.py --eval_path $BASE_DIR
 BASE_DIR is the directory that contains different datasets folders. The structure of the directory is as follows:
 ```bash
 BASE_DIR
-├── math500
+├── dataset1
 │   ├── example.jsonl
-├── minerva_math
+├── dataset2
 │   ├── example.jsonl
-├── olympiadbench
+├── dataset3
 │   ├── example.jsonl
 ```