Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions openseek/competition/pz/yuanboyang/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# 决赛代码(可完整运行的代码库)

## 文件结构
project_root/
├── README.md # 使用说明(本文件)
├── requirementsverl.txt # verl 训练环境依赖
├── requirementstest.txt # 测试/评测环境依赖
├── download.py # 训练集下载处理
├── verl/ # 修改过的 verl 源码
│ └── verl/utils/reward_score/geo3k.py # reward 函数修改
│ └── verl/examples/data_preprocess/gsm8k.py # 验证集下载处理

## 1. 数据下载与处理
- 训练集下载处理:`download.py`
- 验证集下载处理:`verl/examples/data_preprocess/gsm8k.py`

## 2. 代码修改说明
### 基于 [verl](https://github.com/volcengine/verl) 源码的修改
- 主要修改点:
- 对于数据源和prompt的修改:
- examples/data_preprocess/gsm8k.py:
- 将
```python
import datasets
...
data_source = "openai/gsm8k"
dataset = datasets.load_dataset(data_source, "main")
train_dataset = dataset["train"]
test_dataset = dataset["test"]
```
- 修改为
```python
from modelscope.msdatasets import MsDataset
...
data_source = "hiyouga/geometry3k" # 注意:这里的源地址可能是一个笔误,但加载代码本身是针对 modelscope/gsm8k 的
train_dataset = MsDataset.load('modelscope/gsm8k', subset_name='main', split='train', trust_remote_code=True)
test_dataset = MsDataset.load('modelscope/gsm8k', subset_name='main', split='test', trust_remote_code=True)
```
- 将
```python
instruction_following = 'Let\'s think step by step and output the final answer after "####".'
question = question_raw + " " + instruction_following
```
- 修改为
```python
instruction_following = instruction = r'Please reason step by step,and must put your final answer within \boxed{}.Question:'
question = instruction + " " + question_raw
```
- 对于trust_remote_code=True的修改:
- verl/model_merger/base_model_merger.py:
- 将
```python
with init_empty_weights():
model = auto_model_class.from_config(
self.model_config, torch_dtype=torch.bfloat16, trust_remote_code=self.config.trust_remote_code
)
```
- 修改为
```python
with init_empty_weights():
model = auto_model_class.from_config(
self.model_config, torch_dtype=torch.bfloat16, trust_remote_code=True
)
```
- verl/trainer/main_ppo.py:
- 将
```python
trust_remote_code = config.data.get("trust_remote_code", False)
```
- 修改为
```python
trust_remote_code = True
```
- verl/workers/fsdp_workers.py:
- 将
```python
trust_remote_code=trust_remote_code
```
- 修改为
```python
trust_remote_code=True
```

- 修改了 `verl/utils/reward_score/geo3k.py` 中的 reward 函数:
- verl/utils/reward_score/geo3k.py:
- 将
```python
pattern = re.compile(r"<think>.*</think>.*\\boxed\{.*\}.*", re.DOTALL)
```
- 修改为
```python
pattern = re.compile(r".*\\boxed\{.*\}.*", re.DOTALL)
```

### 基于 [transformers](https://github.com/huggingface/transformers) 源码的修改
- 修改文件:
- `/root/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/configuration_utils.py`
- 修改内容:
- 将第 917 行改为:
```python
json.dumps(config_dict, indent=2, sort_keys=False) + "\n"
```
Comment on lines +95 to +102
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

直接修改 site-packages 中的库文件是一种不良实践。这会使环境变得脆弱且难以复现。如果其他人尝试设置此项目,他们可能会忘记手动应用此补丁,导致行为不一致或出错。更好的方法是 fork transformers 仓库,应用您的修改,然后从您的 fork 安装,或者使用 .patch 文件和脚本来应用更改。


## 3. 环境依赖
```bash
# verl 环境
pip install -r requirementsverl.txt

# 测试环境
pip install -r requirementstest.txt
```
## 4. 运行指令
```bash
nohup env PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=/usr/train3.parquet \ # 需要自己修改位置
data.train_batch_size=264 \
data.max_prompt_length=2048 \
data.max_response_length=512 \
actor_rollout_ref.model.path=/root/.cache/modelscope/hub/models/BAAI/OpenSeek-Small-v1-SFT \ # 需要自己修改位置
actor_rollout_ref.actor.optim.lr=1e-5 \
actor_rollout_ref.actor.ppo_mini_batch_size=72 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.name=vllm \
+actor_rollout_ref.actor.fsdp_config.model_dtype=bf16 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
trainer.logger=tensorboard \
trainer.val_before_train=True \
trainer.n_gpus_per_node=6 \
trainer.nnodes=1 \
trainer.save_freq=200 \
trainer.test_freq=10 \
trainer.total_epochs=15 \
data.val_files=$HOME/data/gsm8k/test.parquet \
actor_rollout_ref.rollout.n=6 \
> train.log 2>&1 &
Comment on lines +114 to +139
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

运行指令中包含了多个硬编码的路径(例如 /usr/train3.parquet, /root/.cache/...)。这使得脚本在不同环境中难以直接运行。建议使用环境变量或命令行参数来传递这些路径,以提高脚本的可移植性。

例如:

# 在您的脚本或环境中设置环境变量
export TRAIN_FILES=/path/to/your/train3.parquet
export MODEL_PATH=/path/to/your/model

# 然后在命令中使用它们
nohup env PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
  data.train_files=$TRAIN_FILES \
  actor_rollout_ref.model.path=$MODEL_PATH \
  ...

```
## 5. 模型融合及评测
### 模型融合
```bash
python3 -m verl.model_merger merge \
--backend fsdp \
--local_dir /usr/checkpoints/verl_examples/gsm8k/global_step_8000/actor \
--target_dir /usr/checkpoints/verl_examples/gsm8k/global_step_8000/actor/huggingface
```
### 评测
- 使用官方代码'/OpenSeek/evaluation/qwen_eval/sh/run_evaluate.sh'
- 以上均需要自行修改模型位置
89 changes: 89 additions & 0 deletions openseek/competition/pz/yuanboyang/download.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
import argparse
import os
from modelscope.msdatasets import MsDataset

def main():
"""
主函数,从 ModelScope 加载数据集,进行处理,并保存为 Parquet 文件。
"""
parser = argparse.ArgumentParser(description="Convert Big-Math dataset from ModelScope to a verl-compatible PARQUET format.")
# 我们仍然保留 output_file 参数,以便您可以指定输出路径
parser.add_argument("--output_file", type=str, required=True, help="Path for the output PARQUET file (e.g., train.parquet).")
args = parser.parse_args()

# 数据集信息
dataset_name = 'open-r1/Big-Math-RL-Verified-Processed'
subset_name = 'all'
split = 'train'
data_source_name = "Big-Math" # 用于在数据中标记来源
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

变量 data_source_name 已被定义但从未在代码中使用。建议移除未使用的变量以保持代码整洁。


print(f"Loading dataset '{dataset_name}' from ModelScope...")

# 1. 使用 MsDataset.load 直接加载数据集
# 这一步就已经得到了一个结构化的数据集对象
dataset = MsDataset.load(dataset_name, subset_name=subset_name, split=split)

print(f"Loaded {len(dataset)} records. Starting preprocessing...")

# 2. 定义处理函数,将原始数据格式映射到目标格式
# 这个函数会被 .map() 方法应用到每一条记录上
def process_fn(example, idx):
# 从原始记录中提取需要的字段
# 注意:这里的键名 ('prompt', 'solution' 等) 需要根据您数据集的实际列名来定
# 请根据 'open-r1/Big-Math-RL-Verified-Processed' 数据集的实际情况调整
problem_raw = example.get("prompt", "")
answer_clean = example.get("solution", "")
domain = example.get("domain", [])
solve_rate = example.get("llama8b_solve_rate", None)

# 构建 prompt 内容
instruction = r'Please reason step by step,and must put your final answer within \boxed{}.Question:'
prompt_content = instruction+ " " + problem_raw

# 构建 reward_model 字段
reward_model_data = {
"style": "rule",
"ground_truth": str(answer_clean) # 确保是字符串
}

# 组装成最终的数据结构
processed_data = {
"data_source": 'hiyouga/geometry3k',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这里的 data_source 硬编码为 'hiyouga/geometry3k',但加载的数据集是 'open-r1/Big-Math-RL-Verified-Processed'。这似乎不一致,并且可能是一个错误,正如 README.md 中所提到的。为了清晰和避免潜在的错误,建议将此值作为参数或常量进行管理,并确保其与数据处理逻辑一致。

Suggested change
"data_source": 'hiyouga/geometry3k',
"data_source": data_source_name,

"prompt": [
{
"role": "user",
"content": prompt_content,
}
],
"ability": "math",
"reward_model": reward_model_data,
"extra_info": {
"index": idx,
"original_problem": problem_raw,
"domain": domain,
"llama8b_solve_rate": solve_rate,
},
}
return processed_data

# 3. 使用 .map() 方法应用处理函数
# MsDataset 的 .map() 实现通常非常稳健
processed_dataset = dataset.map(function=process_fn, with_indices=True)

print("Preprocessing complete.")

# 确保输出目录存在
output_dir = os.path.dirname(args.output_file)
if output_dir:
os.makedirs(output_dir, exist_ok=True)

# 4. 将处理好的数据集直接保存为 Parquet 文件
print(f"Saving output to '{args.output_file}'...")
processed_dataset.to_parquet(args.output_file)
# processed_dataset.to_json(args.output_file, lines=True, force_ascii=False)

print("Conversion finished successfully!")


if __name__ == "__main__":
main()
Loading