-
Notifications
You must be signed in to change notification settings - Fork 45
[PZ COMPETITION] code submission by losercheems #165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
50dffe6
ccc4b4a
61f24a6
311fea3
36a0ba0
7929014
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,272 @@ | ||
| # OpenSeek KTO Alignment – Technical Report | ||
|
|
||
| This document provides a detailed, end-to-end technical description of the KTO (Kahneman–Tversky Optimization style preference / safety / alignment fine-tuning) pipeline implemented under the `final/` directory. The workflow has four major stages: | ||
|
|
||
| 1) Asset & dataset acquisition (`scripts/download.py`): download the SFT base model + tokenizer and pull the raw dataset. | ||
| 2) Dataset transformation (`scripts/kto_datasets_process.py`): convert the raw dataset into a KTO-compatible preference format. | ||
| 3) Alignment training (`trainer/kto.py`) using TRL's `KTOTrainer` with DeepSpeed ZeRO-2 (`recipes/accelerate_configs/zero2.yaml`) and training hyperparameters (`recipes/openseek/config.yaml`), launched by `train.sh`. Checkpoints saved every 1,000 steps. | ||
| 4) Evaluation (`eval_example/`): contains benchmark outputs and aggregate metrics for the final checkpoint. | ||
|
|
||
| --- | ||
| ## Public Checkpoint | ||
|
|
||
| The KTO alignment checkpoint is released at: `JingzeShi/OpenSeek-1.4B-A0.4B-KTO`. | ||
|
|
||
| Typical load snippet: | ||
| ```python | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||
| model_id = "JingzeShi/OpenSeek-1.4B-A0.4B-KTO" | ||
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | ||
| model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) | ||
| ``` | ||
| --- | ||
|
|
||
| ## Directory Overview | ||
|
|
||
| ``` | ||
| final/ | ||
| recipes/ | ||
| accelerate_configs/ | ||
| zero2.yaml # Accelerate + DeepSpeed ZeRO Stage 2 configuration | ||
| openseek/ | ||
| config.yaml # KTO training hyperparameters (Trainer-compatible YAML) | ||
| scripts/ | ||
| download.py # Download model/tokenizer + raw dataset (NuminaMath-CoT) | ||
| kto_datasets_process.py # Transform dataset → KTO preference format | ||
| trainer/ | ||
| kto.py # Main training entry (KTOTrainer) | ||
| eval_example/ # Example evaluation results for final checkpoint | ||
| final_result.json # Aggregated metrics summary | ||
| <benchmark_name>/ # Per-benchmark JSONL + metrics | ||
| README.md # (This report) | ||
| ``` | ||
|
|
||
| --- | ||
| ## Stage 1: Download Base Assets (`scripts/download.py`) | ||
|
|
||
| Key actions: | ||
| - Downloads tokenizer & model from `BAAI/OpenSeek-Small-v1-SFT` (already SFT-prepared base for alignment). | ||
| - Saves them locally under `./models/OpenSeek-Small-v1-SFT` for reproducible offline reuse. | ||
| - Loads the raw dataset: `AI-MO/NuminaMath-CoT` from Hugging Face Hub. | ||
| - Persists dataset to disk: `./datasets/AI-MO/NuminaMath-CoT` (Arrow + metadata) to avoid repeated network fetches. | ||
|
|
||
| Environment variables (optional but recommended): | ||
| - `HF_ENDPOINT=https://hf-mirror.com` (for regional mirrors) | ||
| - `XDG_CACHE_HOME=./cache` (centralize HF cache) | ||
|
|
||
| Execution: | ||
| ```bash | ||
| python scripts/download.py | ||
| ``` | ||
|
|
||
| Outputs: | ||
| - `./models/OpenSeek-Small-v1-SFT/` (model weights, tokenizer files, config) | ||
| - `./datasets/AI-MO/NuminaMath-CoT/` (train/validation splits as provided by source dataset) | ||
|
|
||
| --- | ||
| ## Stage 2: Dataset Transformation for KTO (`scripts/kto_datasets_process.py`) | ||
|
|
||
| Objective: | ||
| Convert the original multi-turn / message-style math reasoning dataset (`NuminaMath-CoT`) into a simplified preference alignment format required by KTO: each example should expose a prompt, a completion, and a binary label. | ||
|
|
||
| Implementation specifics: | ||
| - Loads the previously saved raw dataset from disk. | ||
| - For each sample, extracts the first two entries in `messages`: | ||
| - `messages[0]` → becomes a single-element list assigned to `prompt`. | ||
| - `messages[1]` → becomes a single-element list assigned to `completion`. | ||
| - Assigns `label = True` for all entries (i.e., all are treated as preferred / positive examples). | ||
| - Selects only the columns `["prompt", "completion", "label"]`. | ||
| - Saves the processed dataset to: `./datasets/AI-MO/NuminaMath-CoT-preference`. | ||
|
|
||
| Command: | ||
| ```bash | ||
| python scripts/kto_datasets_process.py | ||
| ``` | ||
|
|
||
| Resulting dataset schema (per split): | ||
| ``` | ||
| { | ||
| "prompt": List[Any] # list-wrapped message dict(s) or text segment(s) | ||
| "completion": List[Any] # list-wrapped assistant answer | ||
| "label": bool # True → preferred sample | ||
| } | ||
| ``` | ||
|
|
||
| Example (illustrative, not verbatim): | ||
| ```json | ||
| { | ||
| "prompt": [ {"role": "user", "content": "Solve: 2x + 3 = 7"} ], | ||
| "completion": [ {"role": "assistant", "content": "x = 2"} ], | ||
| "label": true | ||
| } | ||
| ``` | ||
|
|
||
| Notes & Considerations: | ||
| - Current processing creates only positive (True) labels. KTO can also leverage implicit negatives or additional heuristics. If extending, introduce negative variants (e.g., alternative incorrect completions) with `label=False`. | ||
| - Left vs right padding: handled later in tokenizer setup (alignment models usually benefit from left padding in generation-oriented training to keep latest tokens aligned in GPU compute). | ||
|
|
||
| --- | ||
| ## Stage 3: Alignment Training (KTO) – `trainer/kto.py` | ||
|
|
||
| ### 3.1 Launch Mechanism | ||
| Training is launched via Accelerate + DeepSpeed ZeRO Stage 2 for memory efficiency and multi-GPU scaling. The Slurm / shell entry is encapsulated in `train.sh`: | ||
| ```bash | ||
| ACCELERATE_LOG_LEVEL=info accelerate launch \ | ||
| --config_file recipes/accelerate_configs/zero2.yaml \ | ||
| ./trainer/kto.py --config recipes/openseek/config.yaml | ||
| ``` | ||
|
|
||
| ### 3.2 Accelerate + DeepSpeed Config (`zero2.yaml`) | ||
| Key parameters: | ||
| - `distributed_type: DEEPSPEED` & `zero_stage: 2` → ZeRO-2 sharding optimizer states + gradients (parameter partition not as full as ZeRO-3, but lower overhead). | ||
| - `mixed_precision: bf16` → uses BF16 if supported (Ampere+); stable vs FP16 on many math-heavy workloads. | ||
| - `num_processes: 8` → should match the number of visible GPUs (adjust to your cluster allocation). | ||
| - No optimizer or parameter CPU offload (`offload_*: none`) to reduce PCIe pressure (requires enough GPU RAM). | ||
|
|
||
| ### 3.3 Training Hyperparameters (`config.yaml`) | ||
| Extracted key fields: | ||
| - Logging & checkpointing: `logging_steps: 1`, `save_steps: 1000`, `save_total_limit: 1` (keeps only the latest checkpoint to save disk). | ||
| - Model source: `model_name_or_path: ./models/OpenSeek-Small-v1-SFT` (the SFT base from Stage 1). | ||
| - Attention backend: `attn_implementation: flash_attention_2` (ensure FlashAttention v2 build compatibility). | ||
| - Data: `dataset_name: /workspace/datasets/AI-MO/NuminaMath-CoT-preference` (adjust to your actual path if different); `max_length: 4096`. | ||
| - Optimization: | ||
| - `learning_rate: 2e-5` | ||
| - Scheduler: `cosine_with_min_lr` + `min_lr_rate: 0.1` (final LR = base_lr * 0.1 at tail). | ||
| - `warmup_ratio: 0.1` | ||
| - `weight_decay: 0.01` | ||
| - `gradient_accumulation_steps: 2` (effective batch = per_device * GPUs * accum). | ||
| - `gradient_checkpointing: true` + `use_reentrant: false` (saves memory at cost of extra compute). | ||
| - `max_grad_norm: 1.0` (gradient clipping). | ||
| - `bf16: True` (reinforces BF16 usage in Trainer config). | ||
| - Custom flags: `use_liger_kernel`, `use_liger_loss` (implies specialized fused ops or custom objective—ensure installed extensions if required). | ||
| - Epoch count: `num_train_epochs: 4`. | ||
|
|
||
| ### 3.4 Tokenizer & Padding (`kto.py`) | ||
| - Tokenizer uses left padding (`tokenizer.padding_side = "left"`), typical for generation-focused alignment so most recent tokens align along the right edge in attention windows, improving efficiency for some kernels. | ||
| - If tokenizer lacks a `pad_token`, it falls back to `eos_token`. | ||
|
|
||
| ### 3.5 Model & Reference Model | ||
| - Both `model` and `ref_model` are loaded from the same base. KTO uses the reference model to compute relative preference signals / calibration. Keeping them identical at initialization is standard. | ||
| - Quantization hooks (via `get_quantization_config`) are available but not explicitly set in the provided configs (would allow 4/8-bit experiments if desired). | ||
|
|
||
| ### 3.6 Trainer Initialization | ||
| - Uses TRL `KTOTrainer` with: | ||
| - `train_dataset=dataset[script_args.dataset_train_split]` (defaults typically `train`) | ||
| - Optional eval dataset only if `eval_strategy != "no"` (currently disabled for speed). | ||
| - `peft_config=get_peft_config(model_args)` (enables LoRA/other parameter-efficient fine-tuning if configured in `ModelConfig`). If PEFT is not explicitly configured, it may default to full fine-tuning. | ||
| - `use_cache` is disabled during training if gradient checkpointing is on. | ||
|
|
||
| ### 3.7 Checkpoint Artifacts | ||
| At each save step (every 1,000 steps): | ||
| - Model weights (BF16) | ||
| - Trainer state (optimizer, scheduler unless limited by DeepSpeed stage boundary) | ||
| - RNG states for reproducibility | ||
| Because `save_total_limit: 1`, only the latest checkpoint directory is retained (rolling deletion of older ones). If you intend to run model soup or regression comparisons, increase this limit. | ||
|
|
||
| ### 3.8 Performance & Memory Tips | ||
| - If encountering OOM: | ||
| - Lower `per_device_train_batch_size` | ||
| - Increase `gradient_accumulation_steps` | ||
| - Reduce `max_length` | ||
| - Enable quantization (4-bit/8-bit) if latency acceptable | ||
| - If throughput is low: | ||
| - Ensure FlashAttention 2 is correctly installed (or switch to `sdpa` fallback) | ||
| - Disable unnecessary logging (though `logging_steps: 1` is useful during early debugging, raise later) | ||
|
|
||
| --- | ||
| ## Stage 4: Evaluation (`eval_example/`) | ||
|
|
||
| This stage reports benchmark results using a unified Chain-of-Thought prompting configuration. | ||
|
|
||
| ``` | ||
| PROMPT_TYPE="cot" | ||
| aime24: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 | ||
| amc23: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 | ||
| gsm8k: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 | ||
| math500: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 | ||
| minerva_math: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 | ||
| olympiadbench: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 | ||
| ``` | ||
|
|
||
| Directory layout (unchanged): | ||
| ``` | ||
| eval_example/ | ||
| final_result.json # Aggregated metrics | ||
| <benchmark>/ | ||
| <run_id>_metrics.json # Summary metrics | ||
| <run_id>_result.jsonl # Raw generations | ||
| ``` | ||
|
|
||
| `final_result.json` consolidates the per-benchmark metrics produced under the above consistent decoding / prompting setup. | ||
|
|
||
| --- | ||
| ## End-to-End Execution Summary | ||
|
|
||
| ```bash | ||
| # 1. Download base model + raw dataset | ||
| python scripts/download.py | ||
|
|
||
| # 2. Transform dataset into KTO preference format | ||
| python scripts/kto_datasets_process.py | ||
|
|
||
| # 3. Launch KTO alignment training (DeepSpeed ZeRO-2) | ||
| sbatch train.sh # or run the accelerate command directly if not using Slurm | ||
|
|
||
| # 4. (After training) Evaluate checkpoint(s) | ||
| # (Use your evaluation tooling; results stored under eval_example/) | ||
| ``` | ||
|
|
||
| --- | ||
| ## Reproducibility | ||
|
|
||
| | Aspect | Mechanism | Notes | | ||
| |--------|-----------|-------| | ||
| | Random Seeds | `seed: 233` + `set_seed()` | Multi-worker data map & packing can still introduce slight nondeterminism. | | ||
| | Checkpointing | Every 1,000 steps | Only last retained unless `save_total_limit` increased. | | ||
| | Determinism | Not fully enforced | For stricter determinism: set CUDA deterministic flags (may degrade performance). | | ||
|
|
||
| Recommendations: | ||
| - Pin versions of `transformers`, `datasets`, `trl`, `accelerate`, `torch`. | ||
| - Archive `zero2.yaml` + `config.yaml` with final model for auditability. | ||
|
|
||
| --- | ||
| ## Extending / Modifying the Pipeline | ||
|
|
||
| | Goal | Change | | ||
| |------|--------| | ||
| | Introduce negative preferences | Modify `kto_datasets_process.py` to generate paired positive/negative samples (set some `label=False`). | | ||
| | Multi-reference completions | Convert single-element lists to multiple alternatives in `completion`; adjust trainer consumption. | | ||
| | Curriculum alignment | Stage multiple processed datasets; fine-tune sequentially. | | ||
| | Longer context | Increase `max_length`; ensure GPU memory headroom and FlashAttention support. | | ||
| | Retain multiple checkpoints | Increase `save_total_limit`; optionally add model averaging or smoothing post-training. | | ||
| | Parameter-efficient tuning | Configure LoRA / prefix tuning in the `ModelConfig` (passed via KTOTrainer). | | ||
|
|
||
| --- | ||
| ## Troubleshooting | ||
|
|
||
| | Issue | Symptoms | Mitigation | | ||
| |-------|----------|------------| | ||
| | OOM (GPU) | CUDA out of memory | Reduce batch size / seq length, enable gradient checkpointing (already on), use quantization. | | ||
| | Divergent loss | Loss spikes or NaNs | Lower LR, disable exotic kernels, check BF16 support, verify data labels. | | ||
| | Slow startup | Long dataset load | Ensure dataset is saved locally (disk I/O), increase `num_proc` during preprocessing only (not in training). | | ||
| | FlashAttention errors | Kernel build failures | Switch `attn_implementation` to `sdpa` or install matching CUDA toolkit & driver. | | ||
| | Checkpoints not saving | Missing directories | Verify write permissions & disk quota; ensure `output_dir` exists & not readonly. | | ||
| | No evaluation | Metrics absent | Set `do_eval: True` & `eval_strategy: steps` or `epoch`; supply proper eval split. | | ||
|
|
||
| --- | ||
| ## Security & Integrity Notes | ||
| - Trust only vetted model repos when `trust_remote_code=True`. | ||
| - Validate dataset integrity (hash or size) before large-scale training. | ||
| - For multi-tenant clusters, restrict write paths and use namespace-isolated caches. | ||
|
|
||
| --- | ||
| ## License & Attribution | ||
| - Code headers indicate Apache 2.0 licensing. | ||
| - Base model: `BAAI/OpenSeek-Small-v1-SFT` (refer to its upstream license & usage terms). | ||
| - Dataset: `AI-MO/NuminaMath-CoT` (comply with its license and any redistribution constraints). | ||
|
|
||
| --- | ||
| ## Summary | ||
| This `final/` pipeline layers preference-style alignment (KTO) atop an SFT base using a memory-efficient ZeRO-2 BF16 stack. It emphasizes reproducible dataset transformation, lean checkpoint retention, and modular extensibility for future preference formats, negative sampling, or evaluation automation. | ||
|
|
||
| Happy aligning. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| compute_environment: LOCAL_MACHINE | ||
| debug: false | ||
| deepspeed_config: | ||
| deepspeed_multinode_launcher: standard | ||
| offload_optimizer_device: none | ||
| offload_param_device: none | ||
| zero3_init_flag: false | ||
| zero_stage: 2 | ||
| distributed_type: DEEPSPEED | ||
| downcast_bf16: 'no' | ||
| machine_rank: 0 | ||
| main_training_function: main | ||
| mixed_precision: bf16 | ||
| num_machines: 1 | ||
| num_processes: 8 | ||
| rdzv_backend: static | ||
| same_network: true | ||
| tpu_env: [] | ||
| tpu_use_cluster: false | ||
| tpu_use_sudo: false | ||
| use_cpu: false | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| # Logging and Output arguments | ||
| log_level: info | ||
| logging_strategy: steps | ||
| logging_steps: 1 | ||
| report_to: | ||
| - tensorboard | ||
| save_strategy: steps | ||
| save_steps: 1000 | ||
| save_total_limit: 1 | ||
| output_dir: data/OpenSeek-1.4B-A0.4B-KTO | ||
| overwrite_output_dir: true | ||
|
|
||
| # Model arguments | ||
| model_name_or_path: ./models/OpenSeek-Small-v1-SFT | ||
| model_revision: main | ||
| trust_remote_code: True | ||
| torch_dtype: bfloat16 | ||
| attn_implementation: flash_attention_2 | ||
|
|
||
| # Data training arguments | ||
| dataset_name: /workspace/datasets/AI-MO/NuminaMath-CoT-preference | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The dataset_name: ./datasets/AI-MO/NuminaMath-CoT-preference |
||
| dataset_config: default | ||
| dataset_num_proc: 8 | ||
| max_length: 4096 | ||
|
|
||
| # KTO Trainer arguments | ||
| seed: 233 | ||
| do_train: True | ||
| num_train_epochs: 4 | ||
| per_device_train_batch_size: 8 | ||
| do_eval: False | ||
| eval_strategy: 'no' | ||
| eval_steps: 100 | ||
| per_device_eval_batch_size: 1 | ||
| optim: adamw_torch | ||
| learning_rate: 2.0e-5 | ||
| lr_scheduler_type: cosine_with_min_lr | ||
| lr_scheduler_kwargs: | ||
| min_lr_rate: 0.1 | ||
| warmup_ratio: 0.1 | ||
| weight_decay: 0.01 | ||
| gradient_accumulation_steps: 2 | ||
| gradient_checkpointing: true | ||
| gradient_checkpointing_kwargs: | ||
| use_reentrant: false | ||
| max_grad_norm: 1.0 | ||
| bf16: True | ||
| use_liger_kernel: True | ||
| use_liger_loss: True | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| from datasets import load_dataset, load_from_disk, concatenate_datasets, Dataset, DatasetDict | ||
| from transformers import AutoTokenizer, AutoModelForCausalLM | ||
| import torch | ||
|
|
||
|
|
||
| # export HF_ENDPOINT=https://hf-mirror.com | ||
| # export XDG_CACHE_HOME=./cache | ||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1-SFT", trust_remote_code=True) | ||
| tokenizer.save_pretrained("./models/OpenSeek-Small-v1-SFT") | ||
| model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1-SFT", trust_remote_code=True).to(torch.bfloat16) | ||
| model.save_pretrained("./models/OpenSeek-Small-v1-SFT") | ||
|
|
||
| numina_math_cot = load_dataset("AI-MO/NuminaMath-CoT", num_proc=4) | ||
| print(numina_math_cot) | ||
| numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT") |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,21 @@ | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| from datasets import load_dataset, DatasetDict, concatenate_datasets, load_from_disk | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| def process(example): | ||||||||||||||||||||||||||||||||||||||||||
| # kto | ||||||||||||||||||||||||||||||||||||||||||
| example["prompt"] = [ | ||||||||||||||||||||||||||||||||||||||||||
| example["messages"][0] | ||||||||||||||||||||||||||||||||||||||||||
| ] | ||||||||||||||||||||||||||||||||||||||||||
| example["completion"] = [ | ||||||||||||||||||||||||||||||||||||||||||
| example["messages"][1] | ||||||||||||||||||||||||||||||||||||||||||
| ] | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| example["label"] = True | ||||||||||||||||||||||||||||||||||||||||||
| return example | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| numina_math_cot = load_from_disk("/root/code/small-doge/datasets/AI-MO/NuminaMath-CoT") | ||||||||||||||||||||||||||||||||||||||||||
| print(numina_math_cot) | ||||||||||||||||||||||||||||||||||||||||||
| numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"]) | ||||||||||||||||||||||||||||||||||||||||||
| print(numina_math_cot) | ||||||||||||||||||||||||||||||||||||||||||
| print(numina_math_cot["train"][0]) | ||||||||||||||||||||||||||||||||||||||||||
| numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT-preference") | ||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+16
to
+21
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The script contains a hardcoded absolute path
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| #!/bin/bash | ||
|
|
||
| export HF_ENDPOINT=https://hf-mirror.com | ||
| export XDG_CACHE_HOME=cache | ||
| export WANDB_OFFLINE=true | ||
|
|
||
| ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml ./trainer/kto.py --config recipes/openseek/config.yaml | ||
|
|
||
| # tmux new -s openseek | ||
| # tmux attach -t openseek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation for
final_result.jsonin the directory overview seems incorrect. It appears to be a file inside the<benchmark_name>directory, but based on the description, it should likely be at the same level as<benchmark_name>.