Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
272 changes: 272 additions & 0 deletions openseek/competition/pz/losercheems/final/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
# OpenSeek KTO Alignment – Technical Report

This document provides a detailed, end-to-end technical description of the KTO (Kahneman–Tversky Optimization style preference / safety / alignment fine-tuning) pipeline implemented under the `final/` directory. The workflow has four major stages:

1) Asset & dataset acquisition (`scripts/download.py`): download the SFT base model + tokenizer and pull the raw dataset.
2) Dataset transformation (`scripts/kto_datasets_process.py`): convert the raw dataset into a KTO-compatible preference format.
3) Alignment training (`trainer/kto.py`) using TRL's `KTOTrainer` with DeepSpeed ZeRO-2 (`recipes/accelerate_configs/zero2.yaml`) and training hyperparameters (`recipes/openseek/config.yaml`), launched by `train.sh`. Checkpoints saved every 1,000 steps.
4) Evaluation (`eval_example/`): contains benchmark outputs and aggregate metrics for the final checkpoint.

---
## Public Checkpoint

The KTO alignment checkpoint is released at: `JingzeShi/OpenSeek-1.4B-A0.4B-KTO`.

Typical load snippet:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "JingzeShi/OpenSeek-1.4B-A0.4B-KTO"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
```
---

## Directory Overview

```
final/
recipes/
accelerate_configs/
zero2.yaml # Accelerate + DeepSpeed ZeRO Stage 2 configuration
openseek/
config.yaml # KTO training hyperparameters (Trainer-compatible YAML)
scripts/
download.py # Download model/tokenizer + raw dataset (NuminaMath-CoT)
kto_datasets_process.py # Transform dataset → KTO preference format
trainer/
kto.py # Main training entry (KTOTrainer)
eval_example/ # Example evaluation results for final checkpoint
final_result.json # Aggregated metrics summary
<benchmark_name>/ # Per-benchmark JSONL + metrics
Comment on lines +39 to +40
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The indentation for final_result.json in the directory overview seems incorrect. It appears to be a file inside the <benchmark_name> directory, but based on the description, it should likely be at the same level as <benchmark_name>.

Suggested change
final_result.json # Aggregated metrics summary
<benchmark_name>/ # Per-benchmark JSONL + metrics
final_result.json # Aggregated metrics summary
<benchmark_name>/ # Per-benchmark JSONL + metrics

README.md # (This report)
```

---
## Stage 1: Download Base Assets (`scripts/download.py`)

Key actions:
- Downloads tokenizer & model from `BAAI/OpenSeek-Small-v1-SFT` (already SFT-prepared base for alignment).
- Saves them locally under `./models/OpenSeek-Small-v1-SFT` for reproducible offline reuse.
- Loads the raw dataset: `AI-MO/NuminaMath-CoT` from Hugging Face Hub.
- Persists dataset to disk: `./datasets/AI-MO/NuminaMath-CoT` (Arrow + metadata) to avoid repeated network fetches.

Environment variables (optional but recommended):
- `HF_ENDPOINT=https://hf-mirror.com` (for regional mirrors)
- `XDG_CACHE_HOME=./cache` (centralize HF cache)

Execution:
```bash
python scripts/download.py
```

Outputs:
- `./models/OpenSeek-Small-v1-SFT/` (model weights, tokenizer files, config)
- `./datasets/AI-MO/NuminaMath-CoT/` (train/validation splits as provided by source dataset)

---
## Stage 2: Dataset Transformation for KTO (`scripts/kto_datasets_process.py`)

Objective:
Convert the original multi-turn / message-style math reasoning dataset (`NuminaMath-CoT`) into a simplified preference alignment format required by KTO: each example should expose a prompt, a completion, and a binary label.

Implementation specifics:
- Loads the previously saved raw dataset from disk.
- For each sample, extracts the first two entries in `messages`:
- `messages[0]` → becomes a single-element list assigned to `prompt`.
- `messages[1]` → becomes a single-element list assigned to `completion`.
- Assigns `label = True` for all entries (i.e., all are treated as preferred / positive examples).
- Selects only the columns `["prompt", "completion", "label"]`.
- Saves the processed dataset to: `./datasets/AI-MO/NuminaMath-CoT-preference`.

Command:
```bash
python scripts/kto_datasets_process.py
```

Resulting dataset schema (per split):
```
{
"prompt": List[Any] # list-wrapped message dict(s) or text segment(s)
"completion": List[Any] # list-wrapped assistant answer
"label": bool # True → preferred sample
}
```

Example (illustrative, not verbatim):
```json
{
"prompt": [ {"role": "user", "content": "Solve: 2x + 3 = 7"} ],
"completion": [ {"role": "assistant", "content": "x = 2"} ],
"label": true
}
```

Notes & Considerations:
- Current processing creates only positive (True) labels. KTO can also leverage implicit negatives or additional heuristics. If extending, introduce negative variants (e.g., alternative incorrect completions) with `label=False`.
- Left vs right padding: handled later in tokenizer setup (alignment models usually benefit from left padding in generation-oriented training to keep latest tokens aligned in GPU compute).

---
## Stage 3: Alignment Training (KTO) – `trainer/kto.py`

### 3.1 Launch Mechanism
Training is launched via Accelerate + DeepSpeed ZeRO Stage 2 for memory efficiency and multi-GPU scaling. The Slurm / shell entry is encapsulated in `train.sh`:
```bash
ACCELERATE_LOG_LEVEL=info accelerate launch \
--config_file recipes/accelerate_configs/zero2.yaml \
./trainer/kto.py --config recipes/openseek/config.yaml
```

### 3.2 Accelerate + DeepSpeed Config (`zero2.yaml`)
Key parameters:
- `distributed_type: DEEPSPEED` & `zero_stage: 2` → ZeRO-2 sharding optimizer states + gradients (parameter partition not as full as ZeRO-3, but lower overhead).
- `mixed_precision: bf16` → uses BF16 if supported (Ampere+); stable vs FP16 on many math-heavy workloads.
- `num_processes: 8` → should match the number of visible GPUs (adjust to your cluster allocation).
- No optimizer or parameter CPU offload (`offload_*: none`) to reduce PCIe pressure (requires enough GPU RAM).

### 3.3 Training Hyperparameters (`config.yaml`)
Extracted key fields:
- Logging & checkpointing: `logging_steps: 1`, `save_steps: 1000`, `save_total_limit: 1` (keeps only the latest checkpoint to save disk).
- Model source: `model_name_or_path: ./models/OpenSeek-Small-v1-SFT` (the SFT base from Stage 1).
- Attention backend: `attn_implementation: flash_attention_2` (ensure FlashAttention v2 build compatibility).
- Data: `dataset_name: /workspace/datasets/AI-MO/NuminaMath-CoT-preference` (adjust to your actual path if different); `max_length: 4096`.
- Optimization:
- `learning_rate: 2e-5`
- Scheduler: `cosine_with_min_lr` + `min_lr_rate: 0.1` (final LR = base_lr * 0.1 at tail).
- `warmup_ratio: 0.1`
- `weight_decay: 0.01`
- `gradient_accumulation_steps: 2` (effective batch = per_device * GPUs * accum).
- `gradient_checkpointing: true` + `use_reentrant: false` (saves memory at cost of extra compute).
- `max_grad_norm: 1.0` (gradient clipping).
- `bf16: True` (reinforces BF16 usage in Trainer config).
- Custom flags: `use_liger_kernel`, `use_liger_loss` (implies specialized fused ops or custom objective—ensure installed extensions if required).
- Epoch count: `num_train_epochs: 4`.

### 3.4 Tokenizer & Padding (`kto.py`)
- Tokenizer uses left padding (`tokenizer.padding_side = "left"`), typical for generation-focused alignment so most recent tokens align along the right edge in attention windows, improving efficiency for some kernels.
- If tokenizer lacks a `pad_token`, it falls back to `eos_token`.

### 3.5 Model & Reference Model
- Both `model` and `ref_model` are loaded from the same base. KTO uses the reference model to compute relative preference signals / calibration. Keeping them identical at initialization is standard.
- Quantization hooks (via `get_quantization_config`) are available but not explicitly set in the provided configs (would allow 4/8-bit experiments if desired).

### 3.6 Trainer Initialization
- Uses TRL `KTOTrainer` with:
- `train_dataset=dataset[script_args.dataset_train_split]` (defaults typically `train`)
- Optional eval dataset only if `eval_strategy != "no"` (currently disabled for speed).
- `peft_config=get_peft_config(model_args)` (enables LoRA/other parameter-efficient fine-tuning if configured in `ModelConfig`). If PEFT is not explicitly configured, it may default to full fine-tuning.
- `use_cache` is disabled during training if gradient checkpointing is on.

### 3.7 Checkpoint Artifacts
At each save step (every 1,000 steps):
- Model weights (BF16)
- Trainer state (optimizer, scheduler unless limited by DeepSpeed stage boundary)
- RNG states for reproducibility
Because `save_total_limit: 1`, only the latest checkpoint directory is retained (rolling deletion of older ones). If you intend to run model soup or regression comparisons, increase this limit.

### 3.8 Performance & Memory Tips
- If encountering OOM:
- Lower `per_device_train_batch_size`
- Increase `gradient_accumulation_steps`
- Reduce `max_length`
- Enable quantization (4-bit/8-bit) if latency acceptable
- If throughput is low:
- Ensure FlashAttention 2 is correctly installed (or switch to `sdpa` fallback)
- Disable unnecessary logging (though `logging_steps: 1` is useful during early debugging, raise later)

---
## Stage 4: Evaluation (`eval_example/`)

This stage reports benchmark results using a unified Chain-of-Thought prompting configuration.

```
PROMPT_TYPE="cot"
aime24: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
amc23: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
gsm8k: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
math500: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
minerva_math: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
olympiadbench: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
```

Directory layout (unchanged):
```
eval_example/
final_result.json # Aggregated metrics
<benchmark>/
<run_id>_metrics.json # Summary metrics
<run_id>_result.jsonl # Raw generations
```

`final_result.json` consolidates the per-benchmark metrics produced under the above consistent decoding / prompting setup.

---
## End-to-End Execution Summary

```bash
# 1. Download base model + raw dataset
python scripts/download.py

# 2. Transform dataset into KTO preference format
python scripts/kto_datasets_process.py

# 3. Launch KTO alignment training (DeepSpeed ZeRO-2)
sbatch train.sh # or run the accelerate command directly if not using Slurm

# 4. (After training) Evaluate checkpoint(s)
# (Use your evaluation tooling; results stored under eval_example/)
```

---
## Reproducibility

| Aspect | Mechanism | Notes |
|--------|-----------|-------|
| Random Seeds | `seed: 233` + `set_seed()` | Multi-worker data map & packing can still introduce slight nondeterminism. |
| Checkpointing | Every 1,000 steps | Only last retained unless `save_total_limit` increased. |
| Determinism | Not fully enforced | For stricter determinism: set CUDA deterministic flags (may degrade performance). |

Recommendations:
- Pin versions of `transformers`, `datasets`, `trl`, `accelerate`, `torch`.
- Archive `zero2.yaml` + `config.yaml` with final model for auditability.

---
## Extending / Modifying the Pipeline

| Goal | Change |
|------|--------|
| Introduce negative preferences | Modify `kto_datasets_process.py` to generate paired positive/negative samples (set some `label=False`). |
| Multi-reference completions | Convert single-element lists to multiple alternatives in `completion`; adjust trainer consumption. |
| Curriculum alignment | Stage multiple processed datasets; fine-tune sequentially. |
| Longer context | Increase `max_length`; ensure GPU memory headroom and FlashAttention support. |
| Retain multiple checkpoints | Increase `save_total_limit`; optionally add model averaging or smoothing post-training. |
| Parameter-efficient tuning | Configure LoRA / prefix tuning in the `ModelConfig` (passed via KTOTrainer). |

---
## Troubleshooting

| Issue | Symptoms | Mitigation |
|-------|----------|------------|
| OOM (GPU) | CUDA out of memory | Reduce batch size / seq length, enable gradient checkpointing (already on), use quantization. |
| Divergent loss | Loss spikes or NaNs | Lower LR, disable exotic kernels, check BF16 support, verify data labels. |
| Slow startup | Long dataset load | Ensure dataset is saved locally (disk I/O), increase `num_proc` during preprocessing only (not in training). |
| FlashAttention errors | Kernel build failures | Switch `attn_implementation` to `sdpa` or install matching CUDA toolkit & driver. |
| Checkpoints not saving | Missing directories | Verify write permissions & disk quota; ensure `output_dir` exists & not readonly. |
| No evaluation | Metrics absent | Set `do_eval: True` & `eval_strategy: steps` or `epoch`; supply proper eval split. |

---
## Security & Integrity Notes
- Trust only vetted model repos when `trust_remote_code=True`.
- Validate dataset integrity (hash or size) before large-scale training.
- For multi-tenant clusters, restrict write paths and use namespace-isolated caches.

---
## License & Attribution
- Code headers indicate Apache 2.0 licensing.
- Base model: `BAAI/OpenSeek-Small-v1-SFT` (refer to its upstream license & usage terms).
- Dataset: `AI-MO/NuminaMath-CoT` (comply with its license and any redistribution constraints).

---
## Summary
This `final/` pipeline layers preference-style alignment (KTO) atop an SFT base using a memory-efficient ZeRO-2 BF16 stack. It emphasizes reproducible dataset transformation, lean checkpoint retention, and modular extensibility for future preference formats, negative sampling, or evaluation automation.

Happy aligning.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's a good practice to end files with a newline character. Some tools might have issues with files that don't end with a newline.

use_cpu: false

Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Logging and Output arguments
log_level: info
logging_strategy: steps
logging_steps: 1
report_to:
- tensorboard
save_strategy: steps
save_steps: 1000
save_total_limit: 1
output_dir: data/OpenSeek-1.4B-A0.4B-KTO
overwrite_output_dir: true

# Model arguments
model_name_or_path: ./models/OpenSeek-Small-v1-SFT
model_revision: main
trust_remote_code: True
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
dataset_name: /workspace/datasets/AI-MO/NuminaMath-CoT-preference
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The dataset_name is set to an absolute path /workspace/datasets/AI-MO/NuminaMath-CoT-preference. This is not portable and will cause the script to fail on other machines. It's better to use a relative path or make it configurable through an environment variable.

dataset_name: ./datasets/AI-MO/NuminaMath-CoT-preference

dataset_config: default
dataset_num_proc: 8
max_length: 4096

# KTO Trainer arguments
seed: 233
do_train: True
num_train_epochs: 4
per_device_train_batch_size: 8
do_eval: False
eval_strategy: 'no'
eval_steps: 100
per_device_eval_batch_size: 1
optim: adamw_torch
learning_rate: 2.0e-5
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
min_lr_rate: 0.1
warmup_ratio: 0.1
weight_decay: 0.01
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
max_grad_norm: 1.0
bf16: True
use_liger_kernel: True
use_liger_loss: True
16 changes: 16 additions & 0 deletions openseek/competition/pz/losercheems/final/scripts/download.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from datasets import load_dataset, load_from_disk, concatenate_datasets, Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch


# export HF_ENDPOINT=https://hf-mirror.com
# export XDG_CACHE_HOME=./cache

tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1-SFT", trust_remote_code=True)
tokenizer.save_pretrained("./models/OpenSeek-Small-v1-SFT")
model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1-SFT", trust_remote_code=True).to(torch.bfloat16)
model.save_pretrained("./models/OpenSeek-Small-v1-SFT")

numina_math_cot = load_dataset("AI-MO/NuminaMath-CoT", num_proc=4)
print(numina_math_cot)
numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT")
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@

from datasets import load_dataset, DatasetDict, concatenate_datasets, load_from_disk

def process(example):
# kto
example["prompt"] = [
example["messages"][0]
]
example["completion"] = [
example["messages"][1]
]

example["label"] = True
return example

numina_math_cot = load_from_disk("/root/code/small-doge/datasets/AI-MO/NuminaMath-CoT")
print(numina_math_cot)
numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"])
print(numina_math_cot)
print(numina_math_cot["train"][0])
numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT-preference")
Comment on lines +16 to +21
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script contains a hardcoded absolute path /root/code/small-doge/datasets/AI-MO/NuminaMath-CoT for loading the dataset and a hardcoded relative path for saving. This severely impacts portability and reproducibility. These paths should be parameterized using command-line arguments (e.g., with argparse) to make the script reusable.

Suggested change
numina_math_cot = load_from_disk("/root/code/small-doge/datasets/AI-MO/NuminaMath-CoT")
print(numina_math_cot)
numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"])
print(numina_math_cot)
print(numina_math_cot["train"][0])
numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT-preference")
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Process dataset for KTO training.")
parser.add_argument("--input_path", type=str, required=True, help="Path to the input dataset.")
parser.add_argument("--output_path", type=str, required=True, help="Path to save the processed dataset.")
args = parser.parse_args()
numina_math_cot = load_from_disk(args.input_path)
print(numina_math_cot)
numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"])
print(numina_math_cot)
print(numina_math_cot["train"][0])
numina_math_cot.save_to_disk(args.output_path)

10 changes: 10 additions & 0 deletions openseek/competition/pz/losercheems/final/train.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash

export HF_ENDPOINT=https://hf-mirror.com
export XDG_CACHE_HOME=cache
export WANDB_OFFLINE=true

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml ./trainer/kto.py --config recipes/openseek/config.yaml

# tmux new -s openseek
# tmux attach -t openseek
Loading