diff --git a/openseek/competition/pz/losercheems/final/README.md b/openseek/competition/pz/losercheems/final/README.md new file mode 100644 index 0000000..3055e77 --- /dev/null +++ b/openseek/competition/pz/losercheems/final/README.md @@ -0,0 +1,272 @@ +# OpenSeek KTO Alignment – Technical Report + +This document provides a detailed, end-to-end technical description of the KTO (Kahneman–Tversky Optimization style preference / safety / alignment fine-tuning) pipeline implemented under the `final/` directory. The workflow has four major stages: + +1) Asset & dataset acquisition (`scripts/download.py`): download the SFT base model + tokenizer and pull the raw dataset. +2) Dataset transformation (`scripts/kto_datasets_process.py`): convert the raw dataset into a KTO-compatible preference format. +3) Alignment training (`trainer/kto.py`) using TRL's `KTOTrainer` with DeepSpeed ZeRO-2 (`recipes/accelerate_configs/zero2.yaml`) and training hyperparameters (`recipes/openseek/config.yaml`), launched by `train.sh`. Checkpoints saved every 1,000 steps. +4) Evaluation (`eval_example/`): contains benchmark outputs and aggregate metrics for the final checkpoint. + +--- +## Public Checkpoint + +The KTO alignment checkpoint is released at: `JingzeShi/OpenSeek-1.4B-A0.4B-KTO`. + +Typical load snippet: +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +model_id = "JingzeShi/OpenSeek-1.4B-A0.4B-KTO" +tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) +``` +--- + +## Directory Overview + +``` +final/ + recipes/ + accelerate_configs/ + zero2.yaml # Accelerate + DeepSpeed ZeRO Stage 2 configuration + openseek/ + config.yaml # KTO training hyperparameters (Trainer-compatible YAML) + scripts/ + download.py # Download model/tokenizer + raw dataset (NuminaMath-CoT) + kto_datasets_process.py # Transform dataset → KTO preference format + trainer/ + kto.py # Main training entry (KTOTrainer) + eval_example/ # Example evaluation results for final checkpoint + final_result.json # Aggregated metrics summary + / # Per-benchmark JSONL + metrics + README.md # (This report) +``` + +--- +## Stage 1: Download Base Assets (`scripts/download.py`) + +Key actions: +- Downloads tokenizer & model from `BAAI/OpenSeek-Small-v1-SFT` (already SFT-prepared base for alignment). +- Saves them locally under `./models/OpenSeek-Small-v1-SFT` for reproducible offline reuse. +- Loads the raw dataset: `AI-MO/NuminaMath-CoT` from Hugging Face Hub. +- Persists dataset to disk: `./datasets/AI-MO/NuminaMath-CoT` (Arrow + metadata) to avoid repeated network fetches. + +Environment variables (optional but recommended): +- `HF_ENDPOINT=https://hf-mirror.com` (for regional mirrors) +- `XDG_CACHE_HOME=./cache` (centralize HF cache) + +Execution: +```bash +python scripts/download.py +``` + +Outputs: +- `./models/OpenSeek-Small-v1-SFT/` (model weights, tokenizer files, config) +- `./datasets/AI-MO/NuminaMath-CoT/` (train/validation splits as provided by source dataset) + +--- +## Stage 2: Dataset Transformation for KTO (`scripts/kto_datasets_process.py`) + +Objective: +Convert the original multi-turn / message-style math reasoning dataset (`NuminaMath-CoT`) into a simplified preference alignment format required by KTO: each example should expose a prompt, a completion, and a binary label. + +Implementation specifics: +- Loads the previously saved raw dataset from disk. +- For each sample, extracts the first two entries in `messages`: + - `messages[0]` → becomes a single-element list assigned to `prompt`. + - `messages[1]` → becomes a single-element list assigned to `completion`. +- Assigns `label = True` for all entries (i.e., all are treated as preferred / positive examples). +- Selects only the columns `["prompt", "completion", "label"]`. +- Saves the processed dataset to: `./datasets/AI-MO/NuminaMath-CoT-preference`. + +Command: +```bash +python scripts/kto_datasets_process.py +``` + +Resulting dataset schema (per split): +``` +{ + "prompt": List[Any] # list-wrapped message dict(s) or text segment(s) + "completion": List[Any] # list-wrapped assistant answer + "label": bool # True → preferred sample +} +``` + +Example (illustrative, not verbatim): +```json +{ + "prompt": [ {"role": "user", "content": "Solve: 2x + 3 = 7"} ], + "completion": [ {"role": "assistant", "content": "x = 2"} ], + "label": true +} +``` + +Notes & Considerations: +- Current processing creates only positive (True) labels. KTO can also leverage implicit negatives or additional heuristics. If extending, introduce negative variants (e.g., alternative incorrect completions) with `label=False`. +- Left vs right padding: handled later in tokenizer setup (alignment models usually benefit from left padding in generation-oriented training to keep latest tokens aligned in GPU compute). + +--- +## Stage 3: Alignment Training (KTO) – `trainer/kto.py` + +### 3.1 Launch Mechanism +Training is launched via Accelerate + DeepSpeed ZeRO Stage 2 for memory efficiency and multi-GPU scaling. The Slurm / shell entry is encapsulated in `train.sh`: +```bash +ACCELERATE_LOG_LEVEL=info accelerate launch \ + --config_file recipes/accelerate_configs/zero2.yaml \ + ./trainer/kto.py --config recipes/openseek/config.yaml +``` + +### 3.2 Accelerate + DeepSpeed Config (`zero2.yaml`) +Key parameters: +- `distributed_type: DEEPSPEED` & `zero_stage: 2` → ZeRO-2 sharding optimizer states + gradients (parameter partition not as full as ZeRO-3, but lower overhead). +- `mixed_precision: bf16` → uses BF16 if supported (Ampere+); stable vs FP16 on many math-heavy workloads. +- `num_processes: 8` → should match the number of visible GPUs (adjust to your cluster allocation). +- No optimizer or parameter CPU offload (`offload_*: none`) to reduce PCIe pressure (requires enough GPU RAM). + +### 3.3 Training Hyperparameters (`config.yaml`) +Extracted key fields: +- Logging & checkpointing: `logging_steps: 1`, `save_steps: 1000`, `save_total_limit: 1` (keeps only the latest checkpoint to save disk). +- Model source: `model_name_or_path: ./models/OpenSeek-Small-v1-SFT` (the SFT base from Stage 1). +- Attention backend: `attn_implementation: flash_attention_2` (ensure FlashAttention v2 build compatibility). +- Data: `dataset_name: /workspace/datasets/AI-MO/NuminaMath-CoT-preference` (adjust to your actual path if different); `max_length: 4096`. +- Optimization: + - `learning_rate: 2e-5` + - Scheduler: `cosine_with_min_lr` + `min_lr_rate: 0.1` (final LR = base_lr * 0.1 at tail). + - `warmup_ratio: 0.1` + - `weight_decay: 0.01` + - `gradient_accumulation_steps: 2` (effective batch = per_device * GPUs * accum). + - `gradient_checkpointing: true` + `use_reentrant: false` (saves memory at cost of extra compute). + - `max_grad_norm: 1.0` (gradient clipping). + - `bf16: True` (reinforces BF16 usage in Trainer config). + - Custom flags: `use_liger_kernel`, `use_liger_loss` (implies specialized fused ops or custom objective—ensure installed extensions if required). +- Epoch count: `num_train_epochs: 4`. + +### 3.4 Tokenizer & Padding (`kto.py`) +- Tokenizer uses left padding (`tokenizer.padding_side = "left"`), typical for generation-focused alignment so most recent tokens align along the right edge in attention windows, improving efficiency for some kernels. +- If tokenizer lacks a `pad_token`, it falls back to `eos_token`. + +### 3.5 Model & Reference Model +- Both `model` and `ref_model` are loaded from the same base. KTO uses the reference model to compute relative preference signals / calibration. Keeping them identical at initialization is standard. +- Quantization hooks (via `get_quantization_config`) are available but not explicitly set in the provided configs (would allow 4/8-bit experiments if desired). + +### 3.6 Trainer Initialization +- Uses TRL `KTOTrainer` with: + - `train_dataset=dataset[script_args.dataset_train_split]` (defaults typically `train`) + - Optional eval dataset only if `eval_strategy != "no"` (currently disabled for speed). + - `peft_config=get_peft_config(model_args)` (enables LoRA/other parameter-efficient fine-tuning if configured in `ModelConfig`). If PEFT is not explicitly configured, it may default to full fine-tuning. +- `use_cache` is disabled during training if gradient checkpointing is on. + +### 3.7 Checkpoint Artifacts +At each save step (every 1,000 steps): +- Model weights (BF16) +- Trainer state (optimizer, scheduler unless limited by DeepSpeed stage boundary) +- RNG states for reproducibility +Because `save_total_limit: 1`, only the latest checkpoint directory is retained (rolling deletion of older ones). If you intend to run model soup or regression comparisons, increase this limit. + +### 3.8 Performance & Memory Tips +- If encountering OOM: + - Lower `per_device_train_batch_size` + - Increase `gradient_accumulation_steps` + - Reduce `max_length` + - Enable quantization (4-bit/8-bit) if latency acceptable +- If throughput is low: + - Ensure FlashAttention 2 is correctly installed (or switch to `sdpa` fallback) + - Disable unnecessary logging (though `logging_steps: 1` is useful during early debugging, raise later) + +--- +## Stage 4: Evaluation (`eval_example/`) + +This stage reports benchmark results using a unified Chain-of-Thought prompting configuration. + +``` +PROMPT_TYPE="cot" +aime24: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 +amc23: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 +gsm8k: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 +math500: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 +minerva_math: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 +olympiadbench: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072 +``` + +Directory layout (unchanged): +``` +eval_example/ + final_result.json # Aggregated metrics + / + _metrics.json # Summary metrics + _result.jsonl # Raw generations +``` + +`final_result.json` consolidates the per-benchmark metrics produced under the above consistent decoding / prompting setup. + +--- +## End-to-End Execution Summary + +```bash +# 1. Download base model + raw dataset +python scripts/download.py + +# 2. Transform dataset into KTO preference format +python scripts/kto_datasets_process.py + +# 3. Launch KTO alignment training (DeepSpeed ZeRO-2) +sbatch train.sh # or run the accelerate command directly if not using Slurm + +# 4. (After training) Evaluate checkpoint(s) +# (Use your evaluation tooling; results stored under eval_example/) +``` + +--- +## Reproducibility + +| Aspect | Mechanism | Notes | +|--------|-----------|-------| +| Random Seeds | `seed: 233` + `set_seed()` | Multi-worker data map & packing can still introduce slight nondeterminism. | +| Checkpointing | Every 1,000 steps | Only last retained unless `save_total_limit` increased. | +| Determinism | Not fully enforced | For stricter determinism: set CUDA deterministic flags (may degrade performance). | + +Recommendations: +- Pin versions of `transformers`, `datasets`, `trl`, `accelerate`, `torch`. +- Archive `zero2.yaml` + `config.yaml` with final model for auditability. + +--- +## Extending / Modifying the Pipeline + +| Goal | Change | +|------|--------| +| Introduce negative preferences | Modify `kto_datasets_process.py` to generate paired positive/negative samples (set some `label=False`). | +| Multi-reference completions | Convert single-element lists to multiple alternatives in `completion`; adjust trainer consumption. | +| Curriculum alignment | Stage multiple processed datasets; fine-tune sequentially. | +| Longer context | Increase `max_length`; ensure GPU memory headroom and FlashAttention support. | +| Retain multiple checkpoints | Increase `save_total_limit`; optionally add model averaging or smoothing post-training. | +| Parameter-efficient tuning | Configure LoRA / prefix tuning in the `ModelConfig` (passed via KTOTrainer). | + +--- +## Troubleshooting + +| Issue | Symptoms | Mitigation | +|-------|----------|------------| +| OOM (GPU) | CUDA out of memory | Reduce batch size / seq length, enable gradient checkpointing (already on), use quantization. | +| Divergent loss | Loss spikes or NaNs | Lower LR, disable exotic kernels, check BF16 support, verify data labels. | +| Slow startup | Long dataset load | Ensure dataset is saved locally (disk I/O), increase `num_proc` during preprocessing only (not in training). | +| FlashAttention errors | Kernel build failures | Switch `attn_implementation` to `sdpa` or install matching CUDA toolkit & driver. | +| Checkpoints not saving | Missing directories | Verify write permissions & disk quota; ensure `output_dir` exists & not readonly. | +| No evaluation | Metrics absent | Set `do_eval: True` & `eval_strategy: steps` or `epoch`; supply proper eval split. | + +--- +## Security & Integrity Notes +- Trust only vetted model repos when `trust_remote_code=True`. +- Validate dataset integrity (hash or size) before large-scale training. +- For multi-tenant clusters, restrict write paths and use namespace-isolated caches. + +--- +## License & Attribution +- Code headers indicate Apache 2.0 licensing. +- Base model: `BAAI/OpenSeek-Small-v1-SFT` (refer to its upstream license & usage terms). +- Dataset: `AI-MO/NuminaMath-CoT` (comply with its license and any redistribution constraints). + +--- +## Summary +This `final/` pipeline layers preference-style alignment (KTO) atop an SFT base using a memory-efficient ZeRO-2 BF16 stack. It emphasizes reproducible dataset transformation, lean checkpoint retention, and modular extensibility for future preference formats, negative sampling, or evaluation automation. + +Happy aligning. diff --git a/openseek/competition/pz/losercheems/final/losercheems.pdf b/openseek/competition/pz/losercheems/final/losercheems.pdf new file mode 100644 index 0000000..fbdbf68 Binary files /dev/null and b/openseek/competition/pz/losercheems/final/losercheems.pdf differ diff --git a/openseek/competition/pz/losercheems/final/recipes/accelerate_configs/zero2.yaml b/openseek/competition/pz/losercheems/final/recipes/accelerate_configs/zero2.yaml new file mode 100644 index 0000000..f4df9d4 --- /dev/null +++ b/openseek/competition/pz/losercheems/final/recipes/accelerate_configs/zero2.yaml @@ -0,0 +1,21 @@ +compute_environment: LOCAL_MACHINE +debug: false +deepspeed_config: + deepspeed_multinode_launcher: standard + offload_optimizer_device: none + offload_param_device: none + zero3_init_flag: false + zero_stage: 2 +distributed_type: DEEPSPEED +downcast_bf16: 'no' +machine_rank: 0 +main_training_function: main +mixed_precision: bf16 +num_machines: 1 +num_processes: 8 +rdzv_backend: static +same_network: true +tpu_env: [] +tpu_use_cluster: false +tpu_use_sudo: false +use_cpu: false \ No newline at end of file diff --git a/openseek/competition/pz/losercheems/final/recipes/openseek/config.yaml b/openseek/competition/pz/losercheems/final/recipes/openseek/config.yaml new file mode 100644 index 0000000..2808fac --- /dev/null +++ b/openseek/competition/pz/losercheems/final/recipes/openseek/config.yaml @@ -0,0 +1,49 @@ +# Logging and Output arguments +log_level: info +logging_strategy: steps +logging_steps: 1 +report_to: +- tensorboard +save_strategy: steps +save_steps: 1000 +save_total_limit: 1 +output_dir: data/OpenSeek-1.4B-A0.4B-KTO +overwrite_output_dir: true + +# Model arguments +model_name_or_path: ./models/OpenSeek-Small-v1-SFT +model_revision: main +trust_remote_code: True +torch_dtype: bfloat16 +attn_implementation: flash_attention_2 + +# Data training arguments +dataset_name: /workspace/datasets/AI-MO/NuminaMath-CoT-preference +dataset_config: default +dataset_num_proc: 8 +max_length: 4096 + +# KTO Trainer arguments +seed: 233 +do_train: True +num_train_epochs: 4 +per_device_train_batch_size: 8 +do_eval: False +eval_strategy: 'no' +eval_steps: 100 +per_device_eval_batch_size: 1 +optim: adamw_torch +learning_rate: 2.0e-5 +lr_scheduler_type: cosine_with_min_lr +lr_scheduler_kwargs: + min_lr_rate: 0.1 +warmup_ratio: 0.1 +weight_decay: 0.01 +gradient_accumulation_steps: 2 +gradient_checkpointing: true +gradient_checkpointing_kwargs: + use_reentrant: false +max_grad_norm: 1.0 +bf16: True +use_liger_kernel: True +use_liger_loss: True \ No newline at end of file diff --git a/openseek/competition/pz/losercheems/final/scripts/download.py b/openseek/competition/pz/losercheems/final/scripts/download.py new file mode 100644 index 0000000..ed3ace7 --- /dev/null +++ b/openseek/competition/pz/losercheems/final/scripts/download.py @@ -0,0 +1,16 @@ +from datasets import load_dataset, load_from_disk, concatenate_datasets, Dataset, DatasetDict +from transformers import AutoTokenizer, AutoModelForCausalLM +import torch + + +# export HF_ENDPOINT=https://hf-mirror.com +# export XDG_CACHE_HOME=./cache + +tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1-SFT", trust_remote_code=True) +tokenizer.save_pretrained("./models/OpenSeek-Small-v1-SFT") +model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1-SFT", trust_remote_code=True).to(torch.bfloat16) +model.save_pretrained("./models/OpenSeek-Small-v1-SFT") + +numina_math_cot = load_dataset("AI-MO/NuminaMath-CoT", num_proc=4) +print(numina_math_cot) +numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT") \ No newline at end of file diff --git a/openseek/competition/pz/losercheems/final/scripts/kto_datasets_process.py b/openseek/competition/pz/losercheems/final/scripts/kto_datasets_process.py new file mode 100644 index 0000000..90f4522 --- /dev/null +++ b/openseek/competition/pz/losercheems/final/scripts/kto_datasets_process.py @@ -0,0 +1,21 @@ + +from datasets import load_dataset, DatasetDict, concatenate_datasets, load_from_disk + +def process(example): + # kto + example["prompt"] = [ + example["messages"][0] + ] + example["completion"] = [ + example["messages"][1] + ] + + example["label"] = True + return example + +numina_math_cot = load_from_disk("/root/code/small-doge/datasets/AI-MO/NuminaMath-CoT") +print(numina_math_cot) +numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"]) +print(numina_math_cot) +print(numina_math_cot["train"][0]) +numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT-preference") \ No newline at end of file diff --git a/openseek/competition/pz/losercheems/final/train.sh b/openseek/competition/pz/losercheems/final/train.sh new file mode 100644 index 0000000..12038cb --- /dev/null +++ b/openseek/competition/pz/losercheems/final/train.sh @@ -0,0 +1,10 @@ +#!/bin/bash + +export HF_ENDPOINT=https://hf-mirror.com +export XDG_CACHE_HOME=cache +export WANDB_OFFLINE=true + +ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml ./trainer/kto.py --config recipes/openseek/config.yaml + +# tmux new -s openseek +# tmux attach -t openseek \ No newline at end of file diff --git a/openseek/competition/pz/losercheems/final/trainer/kto.py b/openseek/competition/pz/losercheems/final/trainer/kto.py new file mode 100644 index 0000000..76aaad1 --- /dev/null +++ b/openseek/competition/pz/losercheems/final/trainer/kto.py @@ -0,0 +1,176 @@ +import logging +import os +import sys + +import datasets +import torch +import transformers +from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed +from transformers.trainer_utils import get_last_checkpoint + +from trl import ( + ModelConfig, + ScriptArguments, + KTOConfig, + KTOTrainer, + TrlParser, + get_kbit_device_map, + get_peft_config, + get_quantization_config, +) + + +logger = logging.getLogger(__name__) + + +def main( + script_args: ScriptArguments, + training_args: KTOConfig, + model_args: ModelConfig, +): + # Set seed for reproducibility + set_seed(training_args.seed) + + ############### + # Setup logging + ############### + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%Y-%m-%d %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], + ) + log_level = training_args.get_process_log_level() + logger.setLevel(log_level) + datasets.utils.logging.set_verbosity(log_level) + transformers.utils.logging.set_verbosity(log_level) + transformers.utils.logging.enable_default_handler() + transformers.utils.logging.enable_explicit_format() + + # Log on each process a small summary + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + + f" distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + logger.info(f"Model parameters {model_args}") + logger.info(f"Script parameters {script_args}") + logger.info(f"Data parameters {training_args}") + + ################ + # Load tokenizer + ################ + tokenizer = AutoTokenizer.from_pretrained( + model_args.model_name_or_path, + revision=model_args.model_revision, + trust_remote_code=model_args.trust_remote_code, + use_fast=True + ) + tokenizer.padding_side = "left" + if tokenizer.pad_token is None: + tokenizer.pad_token = tokenizer.eos_token + + ############### + # Load datasets + ############### + logger.info("Using processor for dataset mixing and processing") + dataset = datasets.load_from_disk(script_args.dataset_name) + + ################### + # Model init kwargs + ################### + logger.info("*** Initializing model kwargs ***") + torch_dtype = ( + model_args.torch_dtype if model_args.torch_dtype in ["auto", None] else getattr(torch, model_args.torch_dtype) + ) + quantization_config = get_quantization_config(model_args) + model_kwargs = dict( + revision=model_args.model_revision, + trust_remote_code=model_args.trust_remote_code, + attn_implementation=model_args.attn_implementation, + torch_dtype=torch_dtype, + use_cache=False if training_args.gradient_checkpointing else True, + device_map=get_kbit_device_map() if quantization_config is not None else None, + quantization_config=quantization_config, + ) + training_args.model_init_kwargs = model_kwargs + model = AutoModelForCausalLM.from_pretrained( + model_args.model_name_or_path, + **model_kwargs, + ) + ref_model = AutoModelForCausalLM.from_pretrained( + model_args.model_name_or_path, + **model_kwargs, + ) + + # Check for last checkpoint + last_checkpoint = None + if os.path.isdir(training_args.output_dir): + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info(f"Checkpoint detected, resuming training at {last_checkpoint=}.") + else: + logger.info("No checkpoint found, starting training from scratch.") + + ############################ + # Initialize the KTO Trainer + ############################ + training_args.model_init_kwargs = None + trainer = KTOTrainer( + model=model, + ref_model=ref_model, + args=training_args, + train_dataset=dataset[script_args.dataset_train_split], + eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None, + processing_class=tokenizer, + peft_config=get_peft_config(model_args), + ) + + ############### + # Training loop + ############### + logger.info("*** Start training... ***") + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + metrics["train_samples"] = len(dataset[script_args.dataset_train_split]) + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + ################################## + # Save model and create model card + ################################## + logger.info("*** Saving model... ***") + trainer.save_model(training_args.output_dir) + logger.info(f"Model saved to {training_args.output_dir}") + + # Save everything else on main process + if trainer.accelerator.is_main_process: + trainer.create_model_card() + # Restore k,v cache for fast inference + trainer.model.config.use_cache = True + trainer.model.config.save_pretrained(training_args.output_dir) + + logger.info("*** Training complete ***") + + ########## + # Evaluate + ########## + if training_args.do_eval: + logger.info("*** Start evaluation... ***") + metrics = trainer.evaluate() + metrics["eval_samples"] = len(dataset[script_args.dataset_test_split]) + trainer.log_metrics("eval", metrics) + trainer.save_metrics("eval", metrics) + logger.info("*** Evaluation complete ***") + + logger.info("*** Training finished! ***") + + +if __name__ == "__main__": + parser = TrlParser((ScriptArguments, KTOConfig, ModelConfig)) + script_args, training_args, model_args = parser.parse_args_and_config(fail_with_unknown_args=False) + main(script_args, training_args, model_args) diff --git a/openseek/competition/pz/losercheems/preliminary/README.md b/openseek/competition/pz/losercheems/preliminary/README.md new file mode 100644 index 0000000..34df6db --- /dev/null +++ b/openseek/competition/pz/losercheems/preliminary/README.md @@ -0,0 +1,207 @@ +# OpenSeek Continue Pretraining – Technical Report + +This document describes the three-stage pretraining workflow in this repository, focused on Ubuntu-based HPC clusters: + +1) Asset download and dataset building: `scripts/download.py` downloads the tokenizer and base model checkpoints, mixes multiple datasets by ratios, and performs sequence packing. +2) Distributed training: `recipes/accelerate_configs/ddp.yaml` (Accelerate DDP), `recipes/openseek/config.yaml` (training config), and `trainer/pt_dpsk.py` (training entry) are driven by `train.sh` on a cluster. Checkpoints are saved every 1,000 steps. +3) Model soup: `scripts/merge.py` merges the seven checkpoints with the base model using the average strategy. + +--- +## Public Checkpoint + +The continue pretraining checkpoint is released at: `JingzeShi/OpenSeek-1.4B-A0.4B`. + +Typical load snippet: +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +model_id = "JingzeShi/OpenSeek-1.4B-A0.4B" +tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) +``` +--- + +## Key Technical Points + +- Dataset mixing ratios must sum to 1.0 (validated) to meet the target total sample size after packing/truncation. +- Tokenizer guardrails: right padding; if no `pad_token`, fall back to `eos_token` to avoid shape issues. +- Packing vs truncation: + - Packing uses TRL `pack_dataset` to concatenate tokenized sequences up to `max_length` (higher throughput, fewer padding tokens). + - Truncation uses TRL `truncate_dataset` when packing is disabled. +- Reproducibility: set random seeds for dataset shuffling and training; minor variance is possible due to multi-process packing. +- DDP with bf16: use Accelerate `MULTI_GPU` + `mixed_precision: bf16` on GPUs that support BF16 (Ampere+). Adjust `num_processes` to the actual GPU count per node. +- Checkpointing: save every 1,000 steps (`save_steps: 1000`), with `save_total_limit` to bound disk usage. +- Model soup: `average` parameter-wise mean (only matching parameter names are merged). +- IO and performance: persist processed datasets to disk (Arrow) to reduce training-time overhead. + +--- + +## Directory Map and Key Files + +- `scripts/download.py` + - Downloads and saves base model/tokenizer to `./models/OpenSeek_Small_v1`. + - Mixes many dataset shards by ratios, tokenizes, and packs to `./datasets/OpenSeek-Pretrain-30B`. +- `processor/pt_datasets_process.py` + - Dataset ratio validation, optional formatting, tokenization, packing/truncation (`trl.pack_dataset`/`trl.truncate_dataset`), and parallel mapping. +- `recipes/accelerate_configs/ddp.yaml` + - Accelerate DDP configuration (single-node multi-GPU in the template; adapt to your cluster). +- `recipes/openseek/config.yaml` + - Trainer configuration (logging, save frequency, output dir, dtype, attn implementation, etc.). +- `trainer/pt_dpsk.py` + - Loads tokenizer/model, builds datasets via `mix_pt_datasets`, initializes `Trainer`, runs training/evaluation. +- `train.sh` + - Example Slurm batch script that sets CUDA/cuDNN modules/conda, mirrors, and launches training via Accelerate. +- `scripts/merge.py` + - Builds a model soup from base model and checkpoints using the `average` strategy. + +--- + +## Environment and Dependencies + +- OS: Ubuntu 20.04/22.04 +- GPU: NVIDIA A800 +- CUDA/cuDNN: match your cluster modules (example uses CUDA 12.2, cuDNN 9.8) +- Python: 3.10+ +- Core packages: + - torch (CUDA build), transformers, datasets, accelerate, trl + - sentencepiece (if tokenizer requires), safetensors (recommended) +- Optional environment variables: + - `HF_ENDPOINT=https://hf-mirror.com` (if you use a mirror) + - `XDG_CACHE_HOME=cache` (to consolidate cache location) + +Example (adapt to your module system/conda env): + +```bash +# create and activate a conda env (if not pre-provisioned by your cluster) +conda create -y -n train python=3.10 && conda activate train + +# install essentials (pin versions per your infra policies) +pip install --upgrade pip +pip install torch --index-url https://download.pytorch.org/whl/cu124 # example; match your CUDA +pip install transformers datasets accelerate trl sentencepiece safetensors +``` + +--- + +## Step 1: Download Base Model/Tokenizer and Build Mixed Packed Dataset + +Script: `scripts/download.py` + +- Loads `BAAI/OpenSeek-Small-v1`, saves model/tokenizer to `./models/OpenSeek_Small_v1`. +- Defines `datasets_and_ratios` (a list of `{dataset_name: ratio}`) summing to 1.0. +- Sets `total_sample_size=7_500_000`, `max_length=4096`, `packing=True`, `dataset_num_proc=4`, `seed=233`. +- Calls `mix_pt_datasets` (aliased from `pt_datasets_process.py`) to: + - Load each dataset shard (Hub path or local `load_from_disk` path). + - Optionally apply a `formatting_func`. + - Tokenize with right padding (fallback `pad_token=eos_token` if missing). + - Pack or truncate sequences based on `max_length`. + - Sample and concatenate by ratio to hit the target size. +- Saves the processed dataset to `./datasets/OpenSeek-Pretrain-30B` (Arrow format). + +Run (on Ubuntu cluster login/compute node): + +```bash +python scripts/download.py +``` + +Notes: +- With `packing=True`, the “sample size” refers to packed sequences, not raw documents. +- Adjust `max_length` and downstream batch size/grad-accumulation for available VRAM. +- Ensure sufficient disk space for cached and processed datasets. + +--- + +## Step 2: Distributed Training (DDP + Accelerate + Trainer) + +Core configs/scripts: +- `recipes/accelerate_configs/ddp.yaml` + - `distributed_type: MULTI_GPU`, `mixed_precision: bf16`, `gpu_ids: all`. + - Set `num_processes` to the actual GPU count per node (e.g., 4 or 8). +- `recipes/openseek/config.yaml` + - Logging: `logging_steps: 1`, `report_to: [tensorboard]`. + - Saving: `save_strategy: steps`, `save_steps: 1000`, `save_total_limit: 10`. + - Output: `output_dir: data/OpenSeek-1.4B-A0.4B`, `overwrite_output_dir: true`. + - Model: `model_name_or_path: ./models/OpenSeek_Small_v1`, `torch_dtype: bfloat16`, `trust_remote_code: true`. + - Attention: `attn_implementation: sdpa` (ensure compatibility with your CUDA/cuDNN stack). +- `trainer/pt_dpsk.py` + - Loads tokenizer, builds dataset via `mix_pt_datasets`, sets up `Trainer`, trains and evaluates. +- `train.sh` + - Example Slurm script that configures CUDA modules/conda and launches Accelerate. + +Submit training job (Slurm): + +```bash +sbatch train.sh +``` + +Outputs: +- Checkpoints at `data/OpenSeek-1.4B-A0.4B/checkpoint-1000`, `...-2000`, ..., `...-7000`. +- TensorBoard logs (if enabled) for monitoring. + +Recommendations: +- Make `num_processes` in `ddp.yaml` match `--gpus` allocated by Slurm. +- Ensure `output_dir` has sufficient free space; prune old checkpoints with `save_total_limit`. +- Tune effective batch size (global batch = per-device batch × num GPUs × grad-accum) to stabilize training. + +--- + +## Step 3: Model Soup (Checkpoint Merging) + +Script: `scripts/merge.py` + +- Merges base model `./models/OpenSeek_Small_v1` with checkpoints `data/OpenSeek-1.4B-A0.4B/checkpoint-1000..7000`. +- Strategy: + - `average`: unweighted parameter-wise mean across all checkpoints. +- Only matching parameter keys are merged; others are skipped. +- Saves merged model, tokenizer, config, and `merge_info.json` to output directory (default `./models/OpenSeek_Small_v1-merged-average`). + +Run: + +```bash +python scripts/merge.py +``` + +Notes: +- BF16 saves reduce disk footprint; convert to FP32 if you need full precision checkpoints for downstream tasks. +- Inspect merge logs for matched key counts to verify architecture compatibility. + +--- + +## End-to-End Repro (Commands) + +1) Build datasets and cache base assets: +```bash +python scripts/download.py +``` +2) Launch distributed training (Slurm): +```bash +sbatch train.sh +``` +3) Merge checkpoints into a model soup: +```bash +python scripts/merge.py +``` + +--- + +## Troubleshooting + +- Ratios do not sum to 1.0: + - Fix `datasets_and_ratios`; `download.py` prints total ratio for verification. +- OOM (GPU memory): + - Reduce `max_length`/`per_device_train_batch_size`, increase `gradient_accumulation_steps`, enable gradient checkpointing. +- DDP process count mismatch: + - Align `ddp.yaml:num_processes` with the actual GPU count allocated by Slurm. +- BF16 not supported: + - Switch to FP16 or FP32 and adjust configs accordingly (expect perf/throughput changes). +- Slow data loading: + - Persist datasets to disk (Arrow), increase `dataset_num_proc`, and ensure fast storage. + +--- + +## Reproducibility + +- Seeds are set for both dataset preparation and training. Minor nondeterminism may remain due to multi-processing and packing. +- Pin package versions and keep hardware/software constant to improve reproducibility. + +--- + diff --git a/openseek/competition/pz/losercheems/preliminary/losercheems.pdf b/openseek/competition/pz/losercheems/preliminary/losercheems.pdf new file mode 100644 index 0000000..db2385e Binary files /dev/null and b/openseek/competition/pz/losercheems/preliminary/losercheems.pdf differ diff --git a/openseek/competition/pz/losercheems/preliminary/processor/pt_datasets_process.py b/openseek/competition/pz/losercheems/preliminary/processor/pt_datasets_process.py new file mode 100644 index 0000000..f212c5a --- /dev/null +++ b/openseek/competition/pz/losercheems/preliminary/processor/pt_datasets_process.py @@ -0,0 +1,346 @@ +from typing import List, Dict, Optional, Union, Callable +import json +import logging +import warnings +import re +from datasets import Dataset, IterableDataset, DatasetDict, load_dataset, load_from_disk, concatenate_datasets +from transformers import AutoTokenizer, PreTrainedTokenizerBase +from trl.data_utils import pack_dataset, truncate_dataset +from argparse import ArgumentParser + + +# Configure logger +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', + datefmt='%Y-%m-%d %H:%M:%S' +) +logger = logging.getLogger(__name__) + + +def validate_dataset_ratios(datasets_and_ratios: List[Dict[str, float]]) -> None: + """Validate that dataset ratios are properly formatted and sum to 1.0.""" + if not datasets_and_ratios: + raise ValueError("datasets_and_ratios cannot be empty") + + total_ratio = 0.0 + for dataset_dict in datasets_and_ratios: + if not isinstance(dataset_dict, dict) or len(dataset_dict) != 1: + raise ValueError("Each item in datasets_and_ratios must be a dictionary with exactly one key-value pair") + + ratio = list(dataset_dict.values())[0] + if not isinstance(ratio, (int, float)) or ratio <= 0: + raise ValueError(f"Ratio must be a positive number, got {ratio}") + + total_ratio += ratio + + if abs(total_ratio - 1.0) > 1e-6: + raise ValueError(f"Total ratio must be 1.0, but got {total_ratio}. Please check your ratios.") + + +def validate_tokenizer(tokenizer: PreTrainedTokenizerBase) -> None: + """Validate tokenizer configuration.""" + if tokenizer.pad_token is None and tokenizer.eos_token is not None: + logger.warning("Tokenizer has no pad_token, using eos_token as pad_token") + tokenizer.pad_token = tokenizer.eos_token + + +def prepare_dataset( + dataset: Union[Dataset, IterableDataset], + dataset_name: str, + dataset_text_field: str, + processing_class: Union[PreTrainedTokenizerBase], + max_length: Optional[int], + packing: Optional[bool], + formatting_func: Optional[Callable[[dict], str]], + dataset_num_proc: Optional[int], +) -> Union[Dataset, IterableDataset]: + + # If the dataset is already preprocessed, skip the processing step + column_names = list(next(iter(dataset)).keys()) + is_processed = "input_ids" in column_names + + # Build the kwargs for the `map` function + map_kwargs = {} + if isinstance(dataset, Dataset): # InterableDataset does not support num_proc + map_kwargs["num_proc"] = dataset_num_proc + + # Apply the formatting function if any + if formatting_func is not None and is_processed: + warnings.warn( + "You passed a dataset that is already processed (contains an `input_ids` field) together with a " + "formatting function. Therefore `formatting_func` will be ignored. Either remove the " + "`formatting_func` or pass a dataset that is not already processed.", + UserWarning, + ) + + if formatting_func is not None and not is_processed: + if isinstance(dataset, Dataset): # `IterableDataset.map` does not support `desc` + map_kwargs["desc"] = f"Applying formatting function to {dataset_name} dataset" + + batched = isinstance(formatting_func(next(iter(dataset))), list) + + def _func(example): + return {"text": formatting_func(example)} + + dataset = dataset.map(_func, batched=batched, **map_kwargs) + + + if not is_processed: + + # Tokenize the dataset if needed + if isinstance(dataset, Dataset): # `IterableDataset.map` does not support `desc` + map_kwargs["desc"] = f"Tokenizing {dataset_name} dataset" + + def tokenize(example, processing_class, dataset_text_field): + try: + processed = processing_class(text=example[dataset_text_field]) + if ( + processing_class.eos_token_id is not None + and len(processed["input_ids"]) > 0 + and processed["input_ids"][-1] != processing_class.eos_token_id + ): + processed["input_ids"] = processed["input_ids"] + [processing_class.eos_token_id] + processed["attention_mask"] = processed["attention_mask"] + [1] + return processed + except Exception as e: + logger.error(f"Error tokenizing example: {e}") + # Return empty tokenization on error + return { + "input_ids": [processing_class.eos_token_id] if processing_class.eos_token_id is not None else [], + "attention_mask": [1] if processing_class.eos_token_id is not None else [] + } + + dataset = dataset.map( + tokenize, + fn_kwargs={"processing_class": processing_class, "dataset_text_field": dataset_text_field}, + **map_kwargs, + ) + + # Pack or truncate + if packing: + if max_length is None: + raise ValueError("When packing is enabled, `max_length` can't be `None`.") + if isinstance(dataset, Dataset): # `IterableDataset.map` does not support `desc` + map_kwargs["desc"] = f"Packing {dataset_name} dataset" + dataset = dataset.select_columns("input_ids") + dataset = pack_dataset(dataset, seq_length=max_length, strategy="bfd", map_kwargs=map_kwargs) + elif max_length is not None: + if isinstance(dataset, Dataset): # `IterableDataset.map` does not support `desc` + map_kwargs["desc"] = f"Truncating {dataset_name} dataset" + dataset = truncate_dataset(dataset, max_length, map_kwargs) + return dataset + + +def mix_datasets_by_ratio( + datasets_and_ratios: List[Dict[str, float]], + total_sample_size: int, + dataset_text_field: str, + processing_class: Union[PreTrainedTokenizerBase], + max_length: Optional[int], + packing: Optional[bool], + formatting_func: Optional[Callable[[dict], str]], + dataset_num_proc: Optional[int], + seed: Optional[int] = None, + cache_dir: Optional[str] = None, +): + """ + Mix multiple datasets by ratio. + + Args: + datasets_and_ratios: List of dictionaries, each containing a dataset and its ratio. + Each dictionary contains one key-value pair where key is the dataset name and value is the mixing ratio. + total_sample_size: Total sample size for the mixed training dataset. + dataset_text_field: Name of the field in the dataset that contains the text. + processing_class: Tokenizer class used for processing the text. + max_length: Maximum length of processed sequences. Set to None for no limit. + packing: Whether to pack sequences for efficiency. + formatting_func: Optional formatting function to convert dataset entries to the desired text format. + dataset_num_proc: Number of processes to use for dataset processing. + seed: Random seed for dataset shuffling to ensure reproducibility. + cache_dir: Directory to cache the datasets. + + Returns: + DatasetDict: A dictionary containing all mixed and processed dataset splits. + + Example: + ```python + from transformers import AutoTokenizer + + # Define datasets and their mixing ratios + datasets_and_ratios = [ + {"SmallDoge/MiniCorpus:web-en": 0.5}, + {"SmallDoge/MiniCorpus:web-zh": 0.2}, + {"SmallDoge/MiniCorpus:textbook-en": 0.15}, + {"SmallDoge/MiniCorpus:textbook-zh": 0.05}, + {"SmallDoge/MiniCorpus:code": 0.05}, + {"SmallDoge/MiniCorpus:math": 0.05}, + ] + + # Create tokenizer + tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-tokenizer") + + # Mix datasets + mixed_dataset = mix_datasets_by_ratio( + datasets_and_ratios=datasets_and_ratios, + total_sample_size=10000, + dataset_text_field="text", + processing_class=tokenizer, + max_length=2048, + packing=True, + formatting_func=None, + dataset_num_proc=4, + seed=42, + cache_dir="./cache", + ) + print(mixed_dataset) + ```""" + # Validate input parameters + validate_dataset_ratios(datasets_and_ratios) + + # Check if the dataset ratios sum to 1.0 (redundant but kept for backwards compatibility) + total_ratio = sum([list(dataset.values())[0] for dataset in datasets_and_ratios]) + if abs(total_ratio - 1.0) > 1e-6: + raise ValueError(f"Total ratio must be 1.0, but got {total_ratio}. Please check your ratios.") + + final_mixed_dataset = {} + + for dataset_and_ratio in datasets_and_ratios: + dataset_name, ratio = dataset_and_ratio.popitem() + + # Check subset name + windows_drive_pattern = r'^[a-zA-Z]:.*' + is_windows_path = bool(re.match(windows_drive_pattern, dataset_name)) + if ":" in dataset_name and not is_windows_path: + dataset_name, subset_name = dataset_name.split(":") + else: + subset_name = None + + # If the dataset is a string, load it from the hub or disk + if isinstance(dataset_name, str): + if re.match(r"^[^/]+/[^/]+$", dataset_name): + dataset = load_dataset(dataset_name, name=subset_name, cache_dir=cache_dir) + else: + if subset_name is not None: + warnings.warn( + f"You passed a local dataset path, subsetting is not supported, ignoring subset name: {subset_name}", + UserWarning, + ) + dataset = load_from_disk(dataset_name) + + # Process each split of the dataset + for split_name, split_dataset in dataset.items(): + split_dataset = split_dataset.select_columns(["input_ids"]) + logger.info(f"Original dataset size for {dataset_name}: {subset_name}: {split_name}: {len(split_dataset)}") + # Process the dataset from text to input_ids + split_dataset = prepare_dataset( + split_dataset, + dataset_name=f"{dataset_name}: {split_name}" if subset_name is None else f"{dataset_name}: {subset_name}: {split_name}", + dataset_text_field=dataset_text_field, + processing_class=processing_class, + max_length=max_length, + packing=packing, + formatting_func=formatting_func, + dataset_num_proc=dataset_num_proc, + ) + + # Calculate the target size for the dataset + if total_sample_size == -1: + target_size = len(split_dataset) + else: + target_size = int(total_sample_size * ratio) if split_name == "train" else len(split_dataset) + current_size = len(split_dataset) + logger.info(f"Processed dataset size for {dataset_name}: {split_name}: {current_size}") + logger.info(f"Target size for {dataset_name}: {split_name}: {target_size}") + + # If the dataset is smaller than the target size, repeat it + if current_size < target_size: + logger.warning( + f"Dataset {dataset_name}: {split_name} is smaller than the target size. " + f"Repeating the dataset to reach the target size." + ) + indices = [] + full_copies = target_size // current_size + remainder = target_size % current_size + + for _ in range(full_copies): + indices.extend(range(current_size)) + if remainder > 0: + indices.extend(range(remainder)) + + split_dataset = split_dataset.select(indices) + else: + logger.warning( + f"Dataset {dataset_name}: {split_name} is larger than the target size. " + f"Truncating the dataset to reach the target size." + ) + split_dataset = split_dataset.select(range(target_size)) + + # Concatenate the split dataset with the final mixed dataset + if split_name in final_mixed_dataset: + final_mixed_dataset[split_name] = concatenate_datasets( + [final_mixed_dataset[split_name], split_dataset] + ) + else: + final_mixed_dataset[split_name] = split_dataset + + # Shuffle the train dataset + if "train" in final_mixed_dataset: + final_mixed_dataset["train"] = final_mixed_dataset["train"].shuffle(seed) + + # Create a DatasetDict with the merged datasets + final_dataset = DatasetDict(final_mixed_dataset) + return final_dataset + + +def main(args): + # Load the tokenizer + tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name_or_path) + validate_tokenizer(tokenizer) + + # Mix datasets + mixed_dataset = mix_datasets_by_ratio( + datasets_and_ratios=args.datasets_and_ratios, + total_sample_size=args.total_sample_size, + dataset_text_field=args.dataset_text_field, + processing_class=tokenizer, + max_length=args.max_length, + packing=args.packing, + formatting_func=None, + dataset_num_proc=args.dataset_num_proc, + seed=args.seed, + cache_dir=args.cache_dir, + ) + + # Save the mixed dataset + mixed_dataset.save_to_disk(args.dataset_save_path) + print(f"Mixed dataset saved to {args.dataset_save_path}") + +if __name__ == "__main__": + argparser = ArgumentParser() + argparser.add_argument("--datasets_and_ratios", type=str, required=True, + help="JSON string of list of dictionaries with dataset names and mixing ratios") + argparser.add_argument("--dataset_save_path", type=str, required=True, + help="Path to save the mixed dataset") + argparser.add_argument("--total_sample_size", type=int, required=True, + help="Total sample size for the mixed training dataset") + argparser.add_argument("--dataset_text_field", type=str, required=True, + help="Name of the field in the dataset that contains the text") + argparser.add_argument("--tokenizer_name_or_path", type=str, required=True, + help="Tokenizer name or path") + argparser.add_argument("--max_length", type=int, default=2048, + help="Maximum length of processed sequences") + argparser.add_argument("--packing", action="store_true", + help="Whether to pack sequences for efficiency") + argparser.add_argument("--dataset_num_proc", type=int, default=4, + help="Number of processes for dataset processing") + argparser.add_argument("--seed", type=int, default=42, + help="Random seed for reproducibility") + argparser.add_argument("--cache_dir", type=str, default="./cache", + help="Directory to cache datasets") + args = argparser.parse_args() + + # Parse datasets_and_ratios from JSON string + args.datasets_and_ratios = json.loads(args.datasets_and_ratios) + + main(args) diff --git a/openseek/competition/pz/losercheems/preliminary/recipes/accelerate_configs/ddp.yaml b/openseek/competition/pz/losercheems/preliminary/recipes/accelerate_configs/ddp.yaml new file mode 100644 index 0000000..4f05571 --- /dev/null +++ b/openseek/competition/pz/losercheems/preliminary/recipes/accelerate_configs/ddp.yaml @@ -0,0 +1,16 @@ +compute_environment: LOCAL_MACHINE +debug: false +distributed_type: MULTI_GPU +downcast_bf16: 'no' +gpu_ids: all +machine_rank: 0 +main_training_function: main +mixed_precision: bf16 +num_machines: 1 +num_processes: 8 +rdzv_backend: static +same_network: true +tpu_env: [] +tpu_use_cluster: false +tpu_use_sudo: false +use_cpu: false diff --git a/openseek/competition/pz/losercheems/preliminary/recipes/openseek/config.yaml b/openseek/competition/pz/losercheems/preliminary/recipes/openseek/config.yaml new file mode 100644 index 0000000..93e11e3 --- /dev/null +++ b/openseek/competition/pz/losercheems/preliminary/recipes/openseek/config.yaml @@ -0,0 +1,189 @@ +# Logging and Output arguments +log_level: info +logging_strategy: steps +logging_steps: 1 +report_to: +- tensorboard +save_strategy: steps +save_steps: 1000 +save_total_limit: 10 +output_dir: data/OpenSeek-1.4B-A0.4B +overwrite_output_dir: true + +# Model arguments +model_config: + # Basic model configuration + vocab_size: 151851 + hidden_size: 1280 + intermediate_size: 7168 + num_hidden_layers: 6 + hidden_act: "silu" + use_cache: true + tie_word_embeddings: true + max_position_embeddings: 4096 + + # Attention configuration + attention_bias: false + attention_dropout: 0.0 + num_attention_heads: 10 + num_key_value_heads: 10 + # head_dim: 128 + qk_nope_head_dim: 128 + qk_rope_head_dim: 64 + v_head_dim: 128 + kv_lora_rank: 512 + q_lora_rank: null + + # RoPE configuration + rope_theta: 1000000 + rope_scaling: null + rope_interleave: true + + # MoE configuration + moe_intermediate_size: 896 + n_routed_experts: 64 + n_shared_experts: 2 + num_experts_per_tok: 6 + first_k_dense_replace: 1 + norm_topk_prob: true + topk_group: 1 + routed_scaling_factor: 2.446 + n_group: 1 + + # Other configuration + rms_norm_eps: 1.0e-6 + initializer_range: 0.006 + bos_token_id: 0 + eos_token_id: 1 + pretraining_tp: 1 + attn_implementation: sdpa + +model_name_or_path: ./models/OpenSeek_Small_v1 +torch_dtype: bfloat16 +trust_remote_code: True + +# # Data training arguments +# datasets_and_ratios: +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-high_part_142_text_document: 0.011068 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-low_part_62_text_document: 0.003577 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-mid_part_189_text_document: 0.007775 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-high_part_76_text_document: 0.002859 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-low_part_124_text_document: 0.001672 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-mid_part_29_text_document: 0.002339 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-high_part_244_text_document: 0.005397 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-low_part_150_text_document: 0.004064 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-mid_part_444_text_document: 0.005005 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-high_part_498_text_document: 0.004616 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-low_part_10_text_document: 0.00067 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-mid_part_144_text_document: 0.003429 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-high_part_86_text_document: 0.00261 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-low_part_133_text_document: 0.001824 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-mid_part_139_text_document: 0.002313 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-high_part_47_text_document: 0.008237 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-low_part_11_text_document: 0.002866 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-mid_part_97_text_document: 0.00667 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-high_part_43_text_document: 0.004657 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-low_part_10_text_document: 0.002005 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-mid_part_164_text_document: 0.004317 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-high_part_92_text_document: 0.011397 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-low_part_113_text_document: 0.006782 +# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-mid_part_563_text_document: 0.009175 +# - /workspace/datasets/OpenSeek-Pretrain-100B/arxiv_007_00000_text_document: 0.006414 +# - /workspace/datasets/OpenSeek-Pretrain-100B/books_016_00007_text_document: 0.004696 +# - /workspace/datasets/OpenSeek-Pretrain-100B/code-high_part_13_text_document: 0.010102 +# - /workspace/datasets/OpenSeek-Pretrain-100B/code-low_part_36_text_document: 0.011403 +# - /workspace/datasets/OpenSeek-Pretrain-100B/code-mid_part_37_text_document: 0.009674 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-high_23_text_document: 0.003755 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-low_51_text_document: 0.000499 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_118_text_document: 0.003608 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_176_text_document: 0.003623 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_256_text_document: 0.003704 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_320_text_document: 0.003733 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_32_text_document: 0.003631 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-high_1_text_document: 0.002573 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-low_2_text_document: 0.001638 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-mid_3_text_document: 0.003251 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-high_2_text_document: 0.060237 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-low_1_text_document: 0.089063 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-mid_2_text_document: 0.101376 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-high_4_text_document: 0.004598 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-low_6_text_document: 0.006857 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-mid_23_text_document: 0.00899 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-high_12_text_document: 0.013135 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-low_3_text_document: 0.01653 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-mid_5_text_document: 0.003536 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-high_5_text_document: 0.006314 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-low_5_text_document: 0.005978 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-mid_4_text_document: 0.007909 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-high_74_text_document: 0.002225 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-low_54_text_document: 0.001797 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-mid_275_text_document: 0.002042 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-high_4_text_document: 0.004081 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-low_2_text_document: 0.001659 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-mid_6_text_document: 0.012828 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-high_2_text_document: 0.0568 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-low_1_text_document: 0.074907 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-mid_1_text_document: 0.089359 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-high_13_text_document: 0.007663 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-low_9_text_document: 0.004052 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-mid_6_text_document: 0.001916 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-high_11_text_document: 0.005074 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-low_11_text_document: 0.006437 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-mid_29_text_document: 0.006406 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-high_4_text_document: 0.004 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-low_6_text_document: 0.003564 +# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-mid_3_text_document: 0.005768 +# - /workspace/datasets/OpenSeek-Pretrain-100B/math-high_part_04_text_document: 0.018165 +# - /workspace/datasets/OpenSeek-Pretrain-100B/math-low_part_10_text_document: 0.01694 +# - /workspace/datasets/OpenSeek-Pretrain-100B/math-mid_part_07_text_document: 0.016311 +# - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-train_train-0041-of-0136_text_document: 0.00687 +# - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-train_train-0125-of-0136_text_document: 0.007387 +# - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-val_valid-0034-of-0060_text_document: 0.000143 +# - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o_pubmedcentral_3_text_document: 0.061982 +# - /workspace/datasets/OpenSeek-Pretrain-100B/stack_018_00000_text_document: 0.004229 +# - /workspace/datasets/OpenSeek-Pretrain-100B/wiki_012_00000_text_document: 0.004202 +# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss0_part_28_text_document: 0.018171 +# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss1_part_59_text_document: 0.009776 +# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss2_part_16_text_document: 0.003725 +# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss0_part_192_text_document: 0.009492 +# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss1_part_550_text_document: 0.009236 +# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss2_part_71_text_document: 0.010643 + +datasets_and_ratios: + - /workspace/datasets/OpenSeek-Pretrain-30B: 1.0 + +total_sample_size: 7500000 +dataset_text_field: "text" +max_length: 4096 +packing: true +dataset_num_proc: 4 +cache_dir: "./cache" + +# PT trainer arguments +seed: 233 +do_train: True +max_steps: 7000 +per_device_train_batch_size: 1 +do_eval: False +eval_strategy: 'no' +eval_steps: 100 +per_device_eval_batch_size: 1 +optim: adamw_torch_fused +adam_beta1: 0.9 +adam_beta2: 0.95 +adam_epsilon: 1.0e-8 +learning_rate: 2.0e-5 +lr_scheduler_type: warmup_stable_decay +lr_scheduler_kwargs: + warmup_type: linear + decay_type: linear + num_decay_steps: 0 + min_lr_ratio: 0.0 +warmup_steps: 0 +weight_decay: 0.01 +gradient_accumulation_steps: 128 +gradient_checkpointing: false +gradient_checkpointing_kwargs: + use_reentrant: false +max_grad_norm: 1.0 +bf16: True diff --git a/openseek/competition/pz/losercheems/preliminary/scripts/download.py b/openseek/competition/pz/losercheems/preliminary/scripts/download.py new file mode 100644 index 0000000..97c0b22 --- /dev/null +++ b/openseek/competition/pz/losercheems/preliminary/scripts/download.py @@ -0,0 +1,105 @@ +from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets +from transformers import AutoTokenizer, AutoModelForCausalLM +import torch + +# export HF_ENDPOINT=https://hf-mirror.com +# export XDG_CACHE_HOME=cache + + + +if __name__ == "__main__": + + datasets_and_ratios = [ + + {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-actual-actual-high_part_142_text_document": 0.02242}, + {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-synthetic-distill-high_part_76_text_document": 0.00687}, + {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-synthetic-diverse_qa_pairs-high_part_244_text_document": 0.014466}, + {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-synthetic-extract_knowledge-high_part_498_text_document": 0.008715}, + {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-synthetic-knowledge_list-high_part_86_text_document": 0.006747}, + {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-synthetic-wrap_medium-high_part_47_text_document": 0.017773}, + {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-low-synthetic-wrap_medium-high_part_43_text_document": 0.010979}, + {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-medium-actual-actual-high_part_92_text_document": 0.027354}, + + + {"JingzeShi/OpenSeek-Pretrain-100B:arxiv_007_00000_text_document": 0.006414}, + {"JingzeShi/OpenSeek-Pretrain-100B:books_016_00007_text_document": 0.004696}, + {"JingzeShi/OpenSeek-Pretrain-100B:code-high_part_13_text_document": 0.031179}, + + + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_CC-high_23_text_document": 0.022553}, + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_OpenSource-high_1_text_document": 0.007462}, + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_arxiv-high_2_text_document": 0.250676}, + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_code-high_4_text_document": 0.020445}, + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_math-high_12_text_document": 0.033201}, + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_wiki-high_5_text_document": 0.020201}, + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_CC-high_74_text_document": 0.006064}, + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_OpenSource-high_4_text_document": 0.018568}, + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_arxiv-high_2_text_document": 0.221066}, + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_code-high_13_text_document": 0.013631}, + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_math-high_11_text_document": 0.017917}, + {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_wiki-high_4_text_document": 0.017917}, + + + {"JingzeShi/OpenSeek-Pretrain-100B:math-high_part_04_text_document": 0.051416}, + + + {"JingzeShi/OpenSeek-Pretrain-100B:pes2o-full-train_train-0041-of-0136_text_document": 0.00687}, + {"JingzeShi/OpenSeek-Pretrain-100B:pes2o-full-train_train-0125-of-0136_text_document": 0.007387}, + {"JingzeShi/OpenSeek-Pretrain-100B:pes2o-full-val_valid-0034-of-0060_text_document": 0.000143}, + {"JingzeShi/OpenSeek-Pretrain-100B:pes2o_pubmedcentral_3_text_document": 0.061982}, + {"JingzeShi/OpenSeek-Pretrain-100B:stack_018_00000_text_document": 0.004229}, + {"JingzeShi/OpenSeek-Pretrain-100B:wiki_012_00000_text_document": 0.004202}, + + + {"JingzeShi/OpenSeek-Pretrain-100B:zh_cc-high-loss0_part_28_text_document": 0.031672}, + {"JingzeShi/OpenSeek-Pretrain-100B:zh_cc-high-loss1_part_59_text_document": 0.029371}, + + + ] + + def calculate_total_ratio(datasets_and_ratios): + return sum(item for item in datasets_and_ratios.values()) + + total_ratio = sum(calculate_total_ratio(dataset) for dataset in datasets_and_ratios) + print(f"Total ratio: {total_ratio}") + + total_sample_size = 7_500_000 + dataset_text_field = "text" + max_length = 4096 + packing = True + dataset_num_proc = 4 + cache_dir = "./cache" + seed = 233 + model_name_or_path = "BAAI/OpenSeek-Small-v1" + + tokenizer = AutoTokenizer.from_pretrained( + model_name_or_path, + trust_remote_code=True, + use_fast=True + ) + tokenizer.padding_side = "right" + if tokenizer.pad_token is None: + tokenizer.pad_token = tokenizer.eos_token + + model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True) + model = model.to(torch.bfloat16) + print(model) + model.save_pretrained(f"./models/OpenSeek_Small_v1") + tokenizer.save_pretrained(f"./models/OpenSeek_Small_v1") + + dataset = mix_pt_datasets( + datasets_and_ratios=datasets_and_ratios, + total_sample_size=total_sample_size, + dataset_text_field=dataset_text_field, + processing_class=tokenizer, + max_length=max_length, + packing=packing, + formatting_func=None, + dataset_num_proc=dataset_num_proc, + seed=seed, + # cache_dir=cache_dir, + ) + dataset = dataset.select_columns(["input_ids"]) + print(dataset) + + dataset.save_to_disk("./datasets/OpenSeek-Pretrain-30B", num_proc=8) \ No newline at end of file diff --git a/openseek/competition/pz/losercheems/preliminary/scripts/merge.py b/openseek/competition/pz/losercheems/preliminary/scripts/merge.py new file mode 100644 index 0000000..f73ca26 --- /dev/null +++ b/openseek/competition/pz/losercheems/preliminary/scripts/merge.py @@ -0,0 +1,219 @@ +from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer +import time +import torch +import json +import os +from datetime import datetime +from collections import OrderedDict + +torch.manual_seed(0) + +checkpoint_paths = [] + +checkpoint_paths.append(r"./models/OpenSeek_Small_v1") +checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-1000") +checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-2000") +checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-3000") +checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-4000") +checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000") +checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000") +checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-6000") +checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-7000") + + +# merge这些模型 +print("Merging models...") + +def merge_checkpoints(checkpoint_paths, output_path, merge_method="average"): + """ + 合并多个checkpoints + + Args: + checkpoint_paths: 要合并的checkpoint路径列表 + output_path: 合并后模型的保存路径 + merge_method: 合并方法,支持 "average", "last", "weighted_average" + """ + print(f"Using merge method: {merge_method}") + + # 加载第一个模型作为基础 + base_model = AutoModelForCausalLM.from_pretrained(checkpoint_paths[0], trust_remote_code=True) + base_config = AutoConfig.from_pretrained(checkpoint_paths[0], trust_remote_code=True) + base_tokenizer = AutoTokenizer.from_pretrained(checkpoint_paths[0], trust_remote_code=True) + + if merge_method == "average": + # 平均合并所有checkpoints + print("Averaging weights from all checkpoints...") + merged_state_dict = OrderedDict() + weight_counts = OrderedDict() # 记录每个权重参与合并的次数 + + # 初始化merged_state_dict和weight_counts + for key in base_model.state_dict(): + merged_state_dict[key] = torch.zeros_like(base_model.state_dict()[key]) + weight_counts[key] = 0 + + # 累加所有模型的权重 + for i, checkpoint_path in enumerate(checkpoint_paths): + print(f"Processing checkpoint {i+1}/{len(checkpoint_paths)}: {checkpoint_path}") + + model = AutoModelForCausalLM.from_pretrained(checkpoint_path, trust_remote_code=True) + model_state_dict = model.state_dict() + + # 只合并匹配的权重 + matched_keys = 0 + total_keys = len(model_state_dict) + for key in base_model.state_dict(): + if key in model_state_dict: + + # 检查形状是否匹配 + if merged_state_dict[key].shape == model_state_dict[key].shape: + merged_state_dict[key] += model_state_dict[key] + weight_counts[key] += 1 + matched_keys += 1 + else: + print(f"Warning: Shape mismatch for key {key}: base {merged_state_dict[key].shape} vs model {model_state_dict[key].shape}") + else: + print(f"Warning: Key {key} not found in model {checkpoint_path}") + + print(f" Matched {matched_keys}/{len(base_model.state_dict())} keys from base model") + print(f" Model has {total_keys} total keys") + + # 释放内存 + del model + torch.cuda.empty_cache() if torch.cuda.is_available() else None + + # 计算平均值(只对参与合并的权重求平均) + for key in merged_state_dict: + if weight_counts[key] > 0: + merged_state_dict[key] /= weight_counts[key] + print(f"Key {key}: averaged over {weight_counts[key]} models") + else: + # 如果某个权重没有参与任何合并,使用base_model的权重 + merged_state_dict[key] = base_model.state_dict()[key].clone() + print(f"Key {key}: using base model weight (no matches found)") + + elif merge_method == "weighted_average": + # 加权平均合并(后面的checkpoint权重更高) + print("Weighted averaging weights from all checkpoints...") + merged_state_dict = OrderedDict() + weight_sums = OrderedDict() # 记录每个权重的总权重 + + # 初始化merged_state_dict和weight_sums + for key in base_model.state_dict(): + merged_state_dict[key] = torch.zeros_like(base_model.state_dict()[key]) + weight_sums[key] = 0.0 + + # 计算权重(线性递增) + weights = [i+1 for i in range(len(checkpoint_paths))] + total_weight = sum(weights) + weights = [w/total_weight for w in weights] + + print(f"Weights: {weights}") + + # 加权累加所有模型的权重 + for i, (checkpoint_path, weight) in enumerate(zip(checkpoint_paths, weights)): + print(f"Processing checkpoint {i+1}/{len(checkpoint_paths)}: {checkpoint_path} (weight: {weight:.3f})") + + model = AutoModelForCausalLM.from_pretrained(checkpoint_path, trust_remote_code=True) + model_state_dict = model.state_dict() + + # 只合并匹配的权重 + matched_keys = 0 + total_keys = len(model_state_dict) + for key in base_model.state_dict(): + if key in model_state_dict: + + # 检查形状是否匹配 + if merged_state_dict[key].shape == model_state_dict[key].shape: + merged_state_dict[key] += model_state_dict[key] * weight + weight_sums[key] += weight + matched_keys += 1 + else: + print(f"Warning: Shape mismatch for key {key}: base {merged_state_dict[key].shape} vs model {model_state_dict[key].shape}") + else: + print(f"Warning: Key {key} not found in model {checkpoint_path}") + + print(f" Matched {matched_keys}/{len(base_model.state_dict())} keys from base model") + print(f" Model has {total_keys} total keys") + + # 释放内存 + del model + torch.cuda.empty_cache() if torch.cuda.is_available() else None + + # 最终加权平均结果打印 + for key in merged_state_dict: + if weight_sums[key] > 0: + print(f"Key {key}: weighted averaged over {weight_sums[key]:.3f} total weight") + else: + merged_state_dict[key] = base_model.state_dict()[key].clone() + print(f"Key {key}: using base model weight (no matches found)") + + elif merge_method == "last": + # 只使用最后一个checkpoint + print("Using the last checkpoint...") + last_model = AutoModelForCausalLM.from_pretrained(checkpoint_paths[-1], trust_remote_code=True) + merged_state_dict = last_model.state_dict() + del last_model + + else: + raise ValueError(f"Unknown merge method: {merge_method}") + + # 加载合并后的权重到基础模型 + base_model.load_state_dict(merged_state_dict) + + # 保存合并后的模型 + print(f"Saving merged model to {output_path}") + os.makedirs(output_path, exist_ok=True) + base_model = base_model.to(torch.bfloat16) + base_model.save_pretrained(output_path) + base_tokenizer.save_pretrained(output_path) + base_config.save_pretrained(output_path) + + # 保存合并信息 + merge_info = { + "merge_method": merge_method, + "merged_checkpoints": checkpoint_paths, + "merge_time": datetime.now().isoformat(), + "num_checkpoints": len(checkpoint_paths) + } + + with open(os.path.join(output_path, "merge_info.json"), "w", encoding="utf-8") as f: + json.dump(merge_info, f, indent=2, ensure_ascii=False) + + print(f"Merge completed! Model saved to {output_path}") + return base_model, base_tokenizer + +# 执行合并 +if checkpoint_paths: + print(f"Total checkpoints to merge: {len(checkpoint_paths)}") + + # 方法1: 平均合并 + output_path_avg = "./models/OpenSeek_Small_v1-merged-average" + merged_model_avg, merged_tokenizer_avg = merge_checkpoints( + checkpoint_paths, + output_path_avg, + merge_method="average" + ) + + # # 方法2: 加权平均合并 + # output_path_weighted = "./models/OpenSeek_Small_v1-merged-weighted" + # merged_model_weighted, merged_tokenizer_weighted = merge_checkpoints( + # checkpoint_paths, + # output_path_weighted, + # merge_method="weighted_average" + # ) + + # # 方法3: 使用最后一个checkpoint + # output_path_last = "./models/OpenSeek_Small_v1-merged-last" + # merged_model_last, merged_tokenizer_last = merge_checkpoints( + # checkpoint_paths, + # output_path_last, + # merge_method="last" + # ) + + # print("All merge methods completed!") + # print(f"Average merge: {output_path_avg}") + # print(f"Weighted average merge: {output_path_weighted}") + # print(f"Last checkpoint: {output_path_last}") + +else: + print("No checkpoints found to merge!") \ No newline at end of file diff --git a/openseek/competition/pz/losercheems/preliminary/train.sh b/openseek/competition/pz/losercheems/preliminary/train.sh new file mode 100644 index 0000000..0fded40 --- /dev/null +++ b/openseek/competition/pz/losercheems/preliminary/train.sh @@ -0,0 +1,21 @@ +#!/bin/bash +#SBATCH -N 1 +#SBATCH -n 32 +#SBATCH -t 114514 +#SBATCH -J openseek +#SBATCH -o out.log +#SBATCH --gpus=4 + +module unload compilers/cuda +module unload cudnn +module load compilers/cuda/12.2 +module load cudnn/9.8.0.87_cuda12.x +conda activate train +conda init + +export HF_ENDPOINT=https://hf-mirror.com +export XDG_CACHE_HOME=cache + +accelerate launch --config_file recipes/accelerate_configs/ddp.yaml ./trainer/pt_dpsk.py --config recipes/openseek/config.yaml + +# sbatch train.sh diff --git a/openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py b/openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py new file mode 100644 index 0000000..06a4d51 --- /dev/null +++ b/openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py @@ -0,0 +1,211 @@ +import logging +import os +import sys +from argparse import ArgumentParser + +import yaml +import datasets +import torch +import transformers +from transformers import ( + AutoConfig, + AutoModelForCausalLM, + AutoTokenizer, + DataCollatorForLanguageModeling, + Trainer, + set_seed, +) +from transformers.trainer_utils import get_last_checkpoint +from utils.training_args_configs import PTConfig +from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets + +from trl import ModelConfig, ScriptArguments, TrlParser + + +logger = logging.getLogger(__name__) + + +def main( + script_args: ScriptArguments, + training_args: PTConfig, + model_args: ModelConfig, + model_config: dict, +): + # Set seed for reproducibility + set_seed(training_args.seed) + + ############### + # Setup logging + ############### + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%Y-%m-%d %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], + ) + log_level = training_args.get_process_log_level() + logger.setLevel(log_level) + datasets.utils.logging.set_verbosity(log_level) + transformers.utils.logging.set_verbosity(log_level) + transformers.utils.logging.enable_default_handler() + transformers.utils.logging.enable_explicit_format() + + # Log on each process a small summary + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + + f" distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + logger.info(f"Model parameters {model_args}") + logger.info(f"Script parameters {script_args}") + logger.info(f"Data parameters {training_args}") + + # Get model classes + config_class = AutoConfig + causal_lm_class = AutoModelForCausalLM + + ################ + # Load tokenizer + ################ + tokenizer = AutoTokenizer.from_pretrained( + model_args.model_name_or_path, + revision=model_args.model_revision, + trust_remote_code=model_args.trust_remote_code, + use_fast=True + ) + tokenizer.padding_side = "right" + if tokenizer.pad_token is None: + tokenizer.pad_token = tokenizer.eos_token + + ############## + # Load datasets + ############### + logger.info("Using processor for dataset mixing and processing") + dataset = mix_pt_datasets( + datasets_and_ratios=training_args.datasets_and_ratios, + total_sample_size=training_args.total_sample_size, + dataset_text_field=training_args.dataset_text_field, + processing_class=tokenizer, + max_length=training_args.max_length, + packing=training_args.packing, + formatting_func=None, + dataset_num_proc=training_args.dataset_num_proc, + seed=training_args.seed, + cache_dir=training_args.cache_dir, + ) + + ################### + # Model init kwargs + ################### + logger.info("*** Initializing model kwargs ***") + torch_dtype = ( + model_args.torch_dtype if model_args.torch_dtype in ["auto", None] else getattr(torch, model_args.torch_dtype) + ) + model_kwargs = dict( + revision=model_args.model_revision, + trust_remote_code=model_args.trust_remote_code, + attn_implementation=model_args.attn_implementation, + torch_dtype=torch_dtype, + use_cache=False if training_args.gradient_checkpointing else True, + ) + training_args.model_init_kwargs = model_kwargs + + ################## + # Initialize model + ################## + logger.info("Initializing model") + config = config_class(**model_config) + model = causal_lm_class.from_pretrained( + model_args.model_name_or_path, + config=config, + ).to(torch_dtype) + # if model_args.model_name_or_path is not None and model_args.model_name_or_path.endswith("checkpoint") else causal_lm_class(config=config).to(torch_dtype) + + model_num_params = sum(p.numel() for p in model.parameters() if p.requires_grad) + logger.info(f"Model structure: {model}") + logger.info(f"Model parameters: {model_num_params}") + + # Check for last checkpoint + last_checkpoint = None + if os.path.isdir(training_args.output_dir): + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info(f"Checkpoint detected, resuming training at {last_checkpoint=}.") + else: + logger.info("No checkpoint found, starting training from scratch.") + + ########################### + # Initialize the PT trainer + ########################### + data_collator = DataCollatorForLanguageModeling( + tokenizer=tokenizer, + mlm=False, + ) + trainer = Trainer( + model=model, + args=training_args, + train_dataset=dataset[script_args.dataset_train_split], + eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None, + processing_class=tokenizer, + data_collator=data_collator, + ) + + ############### + # Training loop + ############### + logger.info("*** Start training... ***") + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + metrics["train_samples"] = len(dataset[script_args.dataset_train_split]) + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + ################################## + # Save model and create model card + ################################## + logger.info("*** Saving model... ***") + trainer.save_model(training_args.output_dir) + logger.info(f"Model saved to {training_args.output_dir}") + + # Save everything else on main process + if trainer.accelerator.is_main_process: + trainer.create_model_card() + # Restore k,v cache for fast inference + trainer.model.config.use_cache = True + trainer.model.config.save_pretrained(training_args.output_dir) + + logger.info("*** Training complete ***") + + ########## + # Evaluate + ########## + if training_args.do_eval: + logger.info("*** Start evaluation... ***") + metrics = trainer.evaluate() + metrics["eval_samples"] = len(dataset[script_args.dataset_test_split]) + trainer.log_metrics("eval", metrics) + trainer.save_metrics("eval", metrics) + logger.info("*** Evaluation complete ***") + + logger.info("*** Training finished! ***") + + +if __name__ == "__main__": + model_config_parser = ArgumentParser() + model_config_parser.add_argument( + "--config", type=str, default="./recipes/config_full.yaml", help="path to yaml config file of PT" + ) + + parser = TrlParser((ScriptArguments, PTConfig, ModelConfig)) + script_args, training_args, model_args = parser.parse_args_and_config(fail_with_unknown_args=False) + + config_path = model_config_parser.parse_args().config + model_config = yaml.load( + open(config_path, "r", encoding="utf-8"), Loader=yaml.FullLoader + )["model_config"] + + main(script_args, training_args, model_args, model_config) diff --git a/openseek/competition/pz/losercheems/preliminary/utils/training_args_configs.py b/openseek/competition/pz/losercheems/preliminary/utils/training_args_configs.py new file mode 100644 index 0000000..4430708 --- /dev/null +++ b/openseek/competition/pz/losercheems/preliminary/utils/training_args_configs.py @@ -0,0 +1,41 @@ +from dataclasses import dataclass, field +from typing import Optional, List, Dict + +from transformers import TrainingArguments + + +@dataclass +class PTConfig(TrainingArguments): + """ + Configuration for Pre-Training (PT). + """ + + # Dataset mixing parameters + datasets_and_ratios: Optional[List[Dict[str, float]]] = field( + default=None, + metadata={"help": "List of datasets and their mixing ratios. Format: [{'dataset_name': ratio}, ...]"} + ) + total_sample_size: Optional[int] = field( + default=None, + metadata={"help": "Total number of samples to use from mixed datasets"} + ) + dataset_text_field: str = field( + default="text", + metadata={"help": "The field name containing text data in the dataset"} + ) + max_length: int = field( + default=2048, + metadata={"help": "Maximum sequence length for tokenization"} + ) + packing: bool = field( + default=True, + metadata={"help": "Whether to pack sequences for efficient training"} + ) + dataset_num_proc: int = field( + default=4, + metadata={"help": "Number of processes for dataset processing"} + ) + cache_dir: Optional[str] = field( + default=None, + metadata={"help": "Directory to cache processed datasets"} + )