diff --git a/openseek/competition/pz/losercheems/final/README.md b/openseek/competition/pz/losercheems/final/README.md
new file mode 100644
index 0000000..3055e77
--- /dev/null
+++ b/openseek/competition/pz/losercheems/final/README.md
@@ -0,0 +1,272 @@
+# OpenSeek KTO Alignment – Technical Report
+
+This document provides a detailed, end-to-end technical description of the KTO (Kahneman–Tversky Optimization style preference / safety / alignment fine-tuning) pipeline implemented under the `final/` directory. The workflow has four major stages:
+
+1) Asset & dataset acquisition (`scripts/download.py`): download the SFT base model + tokenizer and pull the raw dataset.
+2) Dataset transformation (`scripts/kto_datasets_process.py`): convert the raw dataset into a KTO-compatible preference format.
+3) Alignment training (`trainer/kto.py`) using TRL's `KTOTrainer` with DeepSpeed ZeRO-2 (`recipes/accelerate_configs/zero2.yaml`) and training hyperparameters (`recipes/openseek/config.yaml`), launched by `train.sh`. Checkpoints saved every 1,000 steps.
+4) Evaluation (`eval_example/`): contains benchmark outputs and aggregate metrics for the final checkpoint.
+
+---
+## Public Checkpoint
+
+The KTO alignment checkpoint is released at: `JingzeShi/OpenSeek-1.4B-A0.4B-KTO`.
+
+Typical load snippet:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "JingzeShi/OpenSeek-1.4B-A0.4B-KTO"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
+```
+---
+
+## Directory Overview
+
+```
+final/
+  recipes/
+    accelerate_configs/
+      zero2.yaml          # Accelerate + DeepSpeed ZeRO Stage 2 configuration
+    openseek/
+      config.yaml         # KTO training hyperparameters (Trainer-compatible YAML)
+  scripts/
+    download.py           # Download model/tokenizer + raw dataset (NuminaMath-CoT)
+    kto_datasets_process.py  # Transform dataset → KTO preference format
+  trainer/
+    kto.py                # Main training entry (KTOTrainer)
+  eval_example/           # Example evaluation results for final checkpoint
+      final_result.json   # Aggregated metrics summary
+      <benchmark_name>/   # Per-benchmark JSONL + metrics
+  README.md               # (This report)
+```
+
+---
+## Stage 1: Download Base Assets (`scripts/download.py`)
+
+Key actions:
+- Downloads tokenizer & model from `BAAI/OpenSeek-Small-v1-SFT` (already SFT-prepared base for alignment).
+- Saves them locally under `./models/OpenSeek-Small-v1-SFT` for reproducible offline reuse.
+- Loads the raw dataset: `AI-MO/NuminaMath-CoT` from Hugging Face Hub.
+- Persists dataset to disk: `./datasets/AI-MO/NuminaMath-CoT` (Arrow + metadata) to avoid repeated network fetches.
+
+Environment variables (optional but recommended):
+- `HF_ENDPOINT=https://hf-mirror.com` (for regional mirrors)
+- `XDG_CACHE_HOME=./cache` (centralize HF cache)
+
+Execution:
+```bash
+python scripts/download.py
+```
+
+Outputs:
+- `./models/OpenSeek-Small-v1-SFT/` (model weights, tokenizer files, config)
+- `./datasets/AI-MO/NuminaMath-CoT/` (train/validation splits as provided by source dataset)
+
+---
+## Stage 2: Dataset Transformation for KTO (`scripts/kto_datasets_process.py`)
+
+Objective:
+Convert the original multi-turn / message-style math reasoning dataset (`NuminaMath-CoT`) into a simplified preference alignment format required by KTO: each example should expose a prompt, a completion, and a binary label.
+
+Implementation specifics:
+- Loads the previously saved raw dataset from disk.
+- For each sample, extracts the first two entries in `messages`:
+  - `messages[0]` → becomes a single-element list assigned to `prompt`.
+  - `messages[1]` → becomes a single-element list assigned to `completion`.
+- Assigns `label = True` for all entries (i.e., all are treated as preferred / positive examples).
+- Selects only the columns `["prompt", "completion", "label"]`.
+- Saves the processed dataset to: `./datasets/AI-MO/NuminaMath-CoT-preference`.
+
+Command:
+```bash
+python scripts/kto_datasets_process.py
+```
+
+Resulting dataset schema (per split):
+```
+{
+  "prompt": List[Any]      # list-wrapped message dict(s) or text segment(s)
+  "completion": List[Any]  # list-wrapped assistant answer
+  "label": bool            # True → preferred sample
+}
+```
+
+Example (illustrative, not verbatim):
+```json
+{
+  "prompt": [ {"role": "user", "content": "Solve: 2x + 3 = 7"} ],
+  "completion": [ {"role": "assistant", "content": "x = 2"} ],
+  "label": true
+}
+```
+
+Notes & Considerations:
+- Current processing creates only positive (True) labels. KTO can also leverage implicit negatives or additional heuristics. If extending, introduce negative variants (e.g., alternative incorrect completions) with `label=False`.
+- Left vs right padding: handled later in tokenizer setup (alignment models usually benefit from left padding in generation-oriented training to keep latest tokens aligned in GPU compute).
+
+---
+## Stage 3: Alignment Training (KTO) – `trainer/kto.py`
+
+### 3.1 Launch Mechanism
+Training is launched via Accelerate + DeepSpeed ZeRO Stage 2 for memory efficiency and multi-GPU scaling. The Slurm / shell entry is encapsulated in `train.sh`:
+```bash
+ACCELERATE_LOG_LEVEL=info accelerate launch \
+  --config_file recipes/accelerate_configs/zero2.yaml \
+  ./trainer/kto.py --config recipes/openseek/config.yaml
+```
+
+### 3.2 Accelerate + DeepSpeed Config (`zero2.yaml`)
+Key parameters:
+- `distributed_type: DEEPSPEED` & `zero_stage: 2` → ZeRO-2 sharding optimizer states + gradients (parameter partition not as full as ZeRO-3, but lower overhead).
+- `mixed_precision: bf16` → uses BF16 if supported (Ampere+); stable vs FP16 on many math-heavy workloads.
+- `num_processes: 8` → should match the number of visible GPUs (adjust to your cluster allocation).
+- No optimizer or parameter CPU offload (`offload_*: none`) to reduce PCIe pressure (requires enough GPU RAM).
+
+### 3.3 Training Hyperparameters (`config.yaml`)
+Extracted key fields:
+- Logging & checkpointing: `logging_steps: 1`, `save_steps: 1000`, `save_total_limit: 1` (keeps only the latest checkpoint to save disk).
+- Model source: `model_name_or_path: ./models/OpenSeek-Small-v1-SFT` (the SFT base from Stage 1).
+- Attention backend: `attn_implementation: flash_attention_2` (ensure FlashAttention v2 build compatibility).
+- Data: `dataset_name: /workspace/datasets/AI-MO/NuminaMath-CoT-preference` (adjust to your actual path if different); `max_length: 4096`.
+- Optimization: 
+  - `learning_rate: 2e-5`
+  - Scheduler: `cosine_with_min_lr` + `min_lr_rate: 0.1` (final LR = base_lr * 0.1 at tail).
+  - `warmup_ratio: 0.1`
+  - `weight_decay: 0.01`
+  - `gradient_accumulation_steps: 2` (effective batch = per_device * GPUs * accum).
+  - `gradient_checkpointing: true` + `use_reentrant: false` (saves memory at cost of extra compute).
+  - `max_grad_norm: 1.0` (gradient clipping).
+  - `bf16: True` (reinforces BF16 usage in Trainer config).
+  - Custom flags: `use_liger_kernel`, `use_liger_loss` (implies specialized fused ops or custom objective—ensure installed extensions if required).
+- Epoch count: `num_train_epochs: 4`.
+
+### 3.4 Tokenizer & Padding (`kto.py`)
+- Tokenizer uses left padding (`tokenizer.padding_side = "left"`), typical for generation-focused alignment so most recent tokens align along the right edge in attention windows, improving efficiency for some kernels.
+- If tokenizer lacks a `pad_token`, it falls back to `eos_token`.
+
+### 3.5 Model & Reference Model
+- Both `model` and `ref_model` are loaded from the same base. KTO uses the reference model to compute relative preference signals / calibration. Keeping them identical at initialization is standard.
+- Quantization hooks (via `get_quantization_config`) are available but not explicitly set in the provided configs (would allow 4/8-bit experiments if desired).
+
+### 3.6 Trainer Initialization
+- Uses TRL `KTOTrainer` with:
+  - `train_dataset=dataset[script_args.dataset_train_split]` (defaults typically `train`)
+  - Optional eval dataset only if `eval_strategy != "no"` (currently disabled for speed).
+  - `peft_config=get_peft_config(model_args)` (enables LoRA/other parameter-efficient fine-tuning if configured in `ModelConfig`). If PEFT is not explicitly configured, it may default to full fine-tuning.
+- `use_cache` is disabled during training if gradient checkpointing is on.
+
+### 3.7 Checkpoint Artifacts
+At each save step (every 1,000 steps):
+- Model weights (BF16)
+- Trainer state (optimizer, scheduler unless limited by DeepSpeed stage boundary)
+- RNG states for reproducibility
+Because `save_total_limit: 1`, only the latest checkpoint directory is retained (rolling deletion of older ones). If you intend to run model soup or regression comparisons, increase this limit.
+
+### 3.8 Performance & Memory Tips
+- If encountering OOM:
+  - Lower `per_device_train_batch_size`
+  - Increase `gradient_accumulation_steps`
+  - Reduce `max_length`
+  - Enable quantization (4-bit/8-bit) if latency acceptable
+- If throughput is low:
+  - Ensure FlashAttention 2 is correctly installed (or switch to `sdpa` fallback)
+  - Disable unnecessary logging (though `logging_steps: 1` is useful during early debugging, raise later)
+
+---
+## Stage 4: Evaluation (`eval_example/`)
+
+This stage reports benchmark results using a unified Chain-of-Thought prompting configuration.
+
+```
+PROMPT_TYPE="cot"
+aime24:        seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
+amc23:         seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
+gsm8k:         seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
+math500:       seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
+minerva_math:  seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
+olympiadbench: seed 1, temperature 0.6, n_sampling 1, max_tokens_per_call 3072
+```
+
+Directory layout (unchanged):
+```
+eval_example/
+  final_result.json   # Aggregated metrics
+  <benchmark>/
+    <run_id>_metrics.json   # Summary metrics
+    <run_id>_result.jsonl   # Raw generations
+```
+
+`final_result.json` consolidates the per-benchmark metrics produced under the above consistent decoding / prompting setup.
+
+---
+## End-to-End Execution Summary
+
+```bash
+# 1. Download base model + raw dataset
+python scripts/download.py
+
+# 2. Transform dataset into KTO preference format
+python scripts/kto_datasets_process.py
+
+# 3. Launch KTO alignment training (DeepSpeed ZeRO-2)
+sbatch train.sh   # or run the accelerate command directly if not using Slurm
+
+# 4. (After training) Evaluate checkpoint(s)
+# (Use your evaluation tooling; results stored under eval_example/)
+```
+
+---
+## Reproducibility
+
+| Aspect | Mechanism | Notes |
+|--------|-----------|-------|
+| Random Seeds | `seed: 233` + `set_seed()` | Multi-worker data map & packing can still introduce slight nondeterminism. |
+| Checkpointing | Every 1,000 steps | Only last retained unless `save_total_limit` increased. |
+| Determinism | Not fully enforced | For stricter determinism: set CUDA deterministic flags (may degrade performance). |
+
+Recommendations:
+- Pin versions of `transformers`, `datasets`, `trl`, `accelerate`, `torch`.
+- Archive `zero2.yaml` + `config.yaml` with final model for auditability.
+
+---
+## Extending / Modifying the Pipeline
+
+| Goal | Change |
+|------|--------|
+| Introduce negative preferences | Modify `kto_datasets_process.py` to generate paired positive/negative samples (set some `label=False`). |
+| Multi-reference completions | Convert single-element lists to multiple alternatives in `completion`; adjust trainer consumption. |
+| Curriculum alignment | Stage multiple processed datasets; fine-tune sequentially. |
+| Longer context | Increase `max_length`; ensure GPU memory headroom and FlashAttention support. |
+| Retain multiple checkpoints | Increase `save_total_limit`; optionally add model averaging or smoothing post-training. |
+| Parameter-efficient tuning | Configure LoRA / prefix tuning in the `ModelConfig` (passed via KTOTrainer). |
+
+---
+## Troubleshooting
+
+| Issue | Symptoms | Mitigation |
+|-------|----------|------------|
+| OOM (GPU) | CUDA out of memory | Reduce batch size / seq length, enable gradient checkpointing (already on), use quantization. |
+| Divergent loss | Loss spikes or NaNs | Lower LR, disable exotic kernels, check BF16 support, verify data labels. |
+| Slow startup | Long dataset load | Ensure dataset is saved locally (disk I/O), increase `num_proc` during preprocessing only (not in training). |
+| FlashAttention errors | Kernel build failures | Switch `attn_implementation` to `sdpa` or install matching CUDA toolkit & driver. |
+| Checkpoints not saving | Missing directories | Verify write permissions & disk quota; ensure `output_dir` exists & not readonly. |
+| No evaluation | Metrics absent | Set `do_eval: True` & `eval_strategy: steps` or `epoch`; supply proper eval split. |
+
+---
+## Security & Integrity Notes
+- Trust only vetted model repos when `trust_remote_code=True`.
+- Validate dataset integrity (hash or size) before large-scale training.
+- For multi-tenant clusters, restrict write paths and use namespace-isolated caches.
+
+---
+## License & Attribution
+- Code headers indicate Apache 2.0 licensing.
+- Base model: `BAAI/OpenSeek-Small-v1-SFT` (refer to its upstream license & usage terms).
+- Dataset: `AI-MO/NuminaMath-CoT` (comply with its license and any redistribution constraints).
+
+---
+## Summary
+This `final/` pipeline layers preference-style alignment (KTO) atop an SFT base using a memory-efficient ZeRO-2 BF16 stack. It emphasizes reproducible dataset transformation, lean checkpoint retention, and modular extensibility for future preference formats, negative sampling, or evaluation automation.
+
+Happy aligning.
diff --git a/openseek/competition/pz/losercheems/final/losercheems.pdf b/openseek/competition/pz/losercheems/final/losercheems.pdf
new file mode 100644
index 0000000..fbdbf68
Binary files /dev/null and b/openseek/competition/pz/losercheems/final/losercheems.pdf differ
diff --git a/openseek/competition/pz/losercheems/final/recipes/accelerate_configs/zero2.yaml b/openseek/competition/pz/losercheems/final/recipes/accelerate_configs/zero2.yaml
new file mode 100644
index 0000000..f4df9d4
--- /dev/null
+++ b/openseek/competition/pz/losercheems/final/recipes/accelerate_configs/zero2.yaml
@@ -0,0 +1,21 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: false
+  zero_stage: 2
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
\ No newline at end of file
diff --git a/openseek/competition/pz/losercheems/final/recipes/openseek/config.yaml b/openseek/competition/pz/losercheems/final/recipes/openseek/config.yaml
new file mode 100644
index 0000000..2808fac
--- /dev/null
+++ b/openseek/competition/pz/losercheems/final/recipes/openseek/config.yaml
@@ -0,0 +1,49 @@
+# Logging and Output arguments
+log_level: info
+logging_strategy: steps
+logging_steps: 1
+report_to:
+- tensorboard
+save_strategy: steps
+save_steps: 1000
+save_total_limit: 1
+output_dir: data/OpenSeek-1.4B-A0.4B-KTO
+overwrite_output_dir: true
+
+# Model arguments
+model_name_or_path: ./models/OpenSeek-Small-v1-SFT
+model_revision: main
+trust_remote_code: True
+torch_dtype: bfloat16
+attn_implementation: flash_attention_2
+
+# Data training arguments
+dataset_name: /workspace/datasets/AI-MO/NuminaMath-CoT-preference
+dataset_config: default
+dataset_num_proc: 8
+max_length: 4096
+
+# KTO Trainer arguments
+seed: 233
+do_train: True
+num_train_epochs: 4
+per_device_train_batch_size: 8
+do_eval: False
+eval_strategy: 'no'
+eval_steps: 100
+per_device_eval_batch_size: 1
+optim: adamw_torch
+learning_rate: 2.0e-5
+lr_scheduler_type: cosine_with_min_lr
+lr_scheduler_kwargs:
+  min_lr_rate: 0.1
+warmup_ratio: 0.1
+weight_decay: 0.01
+gradient_accumulation_steps: 2
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+max_grad_norm: 1.0
+bf16: True
+use_liger_kernel: True
+use_liger_loss: True
\ No newline at end of file
diff --git a/openseek/competition/pz/losercheems/final/scripts/download.py b/openseek/competition/pz/losercheems/final/scripts/download.py
new file mode 100644
index 0000000..ed3ace7
--- /dev/null
+++ b/openseek/competition/pz/losercheems/final/scripts/download.py
@@ -0,0 +1,16 @@
+from datasets import load_dataset, load_from_disk, concatenate_datasets, Dataset, DatasetDict
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+
+# export HF_ENDPOINT=https://hf-mirror.com
+# export XDG_CACHE_HOME=./cache
+
+tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1-SFT", trust_remote_code=True)
+tokenizer.save_pretrained("./models/OpenSeek-Small-v1-SFT")
+model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1-SFT", trust_remote_code=True).to(torch.bfloat16)
+model.save_pretrained("./models/OpenSeek-Small-v1-SFT")
+
+numina_math_cot = load_dataset("AI-MO/NuminaMath-CoT", num_proc=4)
+print(numina_math_cot)
+numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT")
\ No newline at end of file
diff --git a/openseek/competition/pz/losercheems/final/scripts/kto_datasets_process.py b/openseek/competition/pz/losercheems/final/scripts/kto_datasets_process.py
new file mode 100644
index 0000000..90f4522
--- /dev/null
+++ b/openseek/competition/pz/losercheems/final/scripts/kto_datasets_process.py
@@ -0,0 +1,21 @@
+
+from datasets import load_dataset, DatasetDict, concatenate_datasets, load_from_disk
+
+def process(example):
+    # kto
+    example["prompt"] = [
+        example["messages"][0]
+    ]
+    example["completion"] = [
+        example["messages"][1]
+    ]
+
+    example["label"] = True
+    return example
+
+numina_math_cot = load_from_disk("/root/code/small-doge/datasets/AI-MO/NuminaMath-CoT")
+print(numina_math_cot)
+numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"])
+print(numina_math_cot)
+print(numina_math_cot["train"][0])
+numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT-preference")
\ No newline at end of file
diff --git a/openseek/competition/pz/losercheems/final/train.sh b/openseek/competition/pz/losercheems/final/train.sh
new file mode 100644
index 0000000..12038cb
--- /dev/null
+++ b/openseek/competition/pz/losercheems/final/train.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+export HF_ENDPOINT=https://hf-mirror.com
+export XDG_CACHE_HOME=cache
+export WANDB_OFFLINE=true
+
+ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml ./trainer/kto.py --config recipes/openseek/config.yaml
+
+# tmux new -s openseek
+# tmux attach -t openseek
\ No newline at end of file
diff --git a/openseek/competition/pz/losercheems/final/trainer/kto.py b/openseek/competition/pz/losercheems/final/trainer/kto.py
new file mode 100644
index 0000000..76aaad1
--- /dev/null
+++ b/openseek/competition/pz/losercheems/final/trainer/kto.py
@@ -0,0 +1,176 @@
+import logging
+import os
+import sys
+
+import datasets
+import torch
+import transformers
+from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
+from transformers.trainer_utils import get_last_checkpoint
+
+from trl import (
+    ModelConfig,
+    ScriptArguments,
+    KTOConfig,
+    KTOTrainer,
+    TrlParser,
+    get_kbit_device_map,
+    get_peft_config,
+    get_quantization_config,
+)
+
+
+logger = logging.getLogger(__name__)
+
+
+def main(
+    script_args: ScriptArguments,
+    training_args: KTOConfig,
+    model_args: ModelConfig,
+):
+    # Set seed for reproducibility
+    set_seed(training_args.seed)
+
+    ###############
+    # Setup logging
+    ###############
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+    log_level = training_args.get_process_log_level()
+    logger.setLevel(log_level)
+    datasets.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.enable_default_handler()
+    transformers.utils.logging.enable_explicit_format()
+
+    # Log on each process a small summary
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+        + f" distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+    logger.info(f"Model parameters {model_args}")
+    logger.info(f"Script parameters {script_args}")
+    logger.info(f"Data parameters {training_args}")
+
+    ################
+    # Load tokenizer
+    ################
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path,
+        revision=model_args.model_revision,
+        trust_remote_code=model_args.trust_remote_code,
+        use_fast=True
+    )
+    tokenizer.padding_side = "left"
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    ###############
+    # Load datasets
+    ###############
+    logger.info("Using processor for dataset mixing and processing")
+    dataset = datasets.load_from_disk(script_args.dataset_name)
+
+    ###################
+    # Model init kwargs
+    ###################
+    logger.info("*** Initializing model kwargs ***")
+    torch_dtype = (
+        model_args.torch_dtype if model_args.torch_dtype in ["auto", None] else getattr(torch, model_args.torch_dtype)
+    )
+    quantization_config = get_quantization_config(model_args)
+    model_kwargs = dict(
+        revision=model_args.model_revision,
+        trust_remote_code=model_args.trust_remote_code,
+        attn_implementation=model_args.attn_implementation,
+        torch_dtype=torch_dtype,
+        use_cache=False if training_args.gradient_checkpointing else True,
+        device_map=get_kbit_device_map() if quantization_config is not None else None,
+        quantization_config=quantization_config,
+    )
+    training_args.model_init_kwargs = model_kwargs
+    model = AutoModelForCausalLM.from_pretrained(
+        model_args.model_name_or_path,
+        **model_kwargs,
+    )
+    ref_model = AutoModelForCausalLM.from_pretrained(
+        model_args.model_name_or_path,
+        **model_kwargs,
+    )
+
+    # Check for last checkpoint
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir):
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+    if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+        logger.info(f"Checkpoint detected, resuming training at {last_checkpoint=}.")
+    else:
+        logger.info("No checkpoint found, starting training from scratch.")
+
+    ############################
+    # Initialize the KTO Trainer
+    ############################
+    training_args.model_init_kwargs = None
+    trainer = KTOTrainer(
+        model=model,
+        ref_model=ref_model,
+        args=training_args,
+        train_dataset=dataset[script_args.dataset_train_split],
+        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
+        processing_class=tokenizer,
+        peft_config=get_peft_config(model_args),
+    )
+
+    ###############
+    # Training loop
+    ###############
+    logger.info("*** Start training... ***")
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+    train_result = trainer.train(resume_from_checkpoint=checkpoint)
+    metrics = train_result.metrics
+    metrics["train_samples"] = len(dataset[script_args.dataset_train_split])
+    trainer.log_metrics("train", metrics)
+    trainer.save_metrics("train", metrics)
+    trainer.save_state()
+
+    ##################################
+    # Save model and create model card
+    ##################################
+    logger.info("*** Saving model... ***")
+    trainer.save_model(training_args.output_dir)
+    logger.info(f"Model saved to {training_args.output_dir}")
+
+    # Save everything else on main process
+    if trainer.accelerator.is_main_process:
+        trainer.create_model_card()
+        # Restore k,v cache for fast inference
+        trainer.model.config.use_cache = True
+        trainer.model.config.save_pretrained(training_args.output_dir)
+
+    logger.info("*** Training complete ***")
+
+    ##########
+    # Evaluate
+    ##########
+    if training_args.do_eval:
+        logger.info("*** Start evaluation... ***")
+        metrics = trainer.evaluate()
+        metrics["eval_samples"] = len(dataset[script_args.dataset_test_split])
+        trainer.log_metrics("eval", metrics)
+        trainer.save_metrics("eval", metrics)
+        logger.info("*** Evaluation complete ***")
+
+    logger.info("*** Training finished! ***")
+
+
+if __name__ == "__main__":
+    parser = TrlParser((ScriptArguments, KTOConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_and_config(fail_with_unknown_args=False)
+    main(script_args, training_args, model_args)
diff --git a/openseek/competition/pz/losercheems/preliminary/README.md b/openseek/competition/pz/losercheems/preliminary/README.md
new file mode 100644
index 0000000..34df6db
--- /dev/null
+++ b/openseek/competition/pz/losercheems/preliminary/README.md
@@ -0,0 +1,207 @@
+# OpenSeek Continue Pretraining – Technical Report
+
+This document describes the three-stage pretraining workflow in this repository, focused on Ubuntu-based HPC clusters:
+
+1) Asset download and dataset building: `scripts/download.py` downloads the tokenizer and base model checkpoints, mixes multiple datasets by ratios, and performs sequence packing.
+2) Distributed training: `recipes/accelerate_configs/ddp.yaml` (Accelerate DDP), `recipes/openseek/config.yaml` (training config), and `trainer/pt_dpsk.py` (training entry) are driven by `train.sh` on a cluster. Checkpoints are saved every 1,000 steps.
+3) Model soup: `scripts/merge.py` merges the seven checkpoints with the base model using the average strategy.
+
+---
+## Public Checkpoint
+
+The continue pretraining checkpoint is released at: `JingzeShi/OpenSeek-1.4B-A0.4B`.
+
+Typical load snippet:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "JingzeShi/OpenSeek-1.4B-A0.4B"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
+```
+---
+
+## Key Technical Points
+
+- Dataset mixing ratios must sum to 1.0 (validated) to meet the target total sample size after packing/truncation.
+- Tokenizer guardrails: right padding; if no `pad_token`, fall back to `eos_token` to avoid shape issues.
+- Packing vs truncation: 
+  - Packing uses TRL `pack_dataset` to concatenate tokenized sequences up to `max_length` (higher throughput, fewer padding tokens).
+  - Truncation uses TRL `truncate_dataset` when packing is disabled.
+- Reproducibility: set random seeds for dataset shuffling and training; minor variance is possible due to multi-process packing.
+- DDP with bf16: use Accelerate `MULTI_GPU` + `mixed_precision: bf16` on GPUs that support BF16 (Ampere+). Adjust `num_processes` to the actual GPU count per node.
+- Checkpointing: save every 1,000 steps (`save_steps: 1000`), with `save_total_limit` to bound disk usage.
+- Model soup: `average` parameter-wise mean (only matching parameter names are merged).
+- IO and performance: persist processed datasets to disk (Arrow) to reduce training-time overhead.
+
+---
+
+## Directory Map and Key Files
+
+- `scripts/download.py`
+  - Downloads and saves base model/tokenizer to `./models/OpenSeek_Small_v1`.
+  - Mixes many dataset shards by ratios, tokenizes, and packs to `./datasets/OpenSeek-Pretrain-30B`.
+- `processor/pt_datasets_process.py`
+  - Dataset ratio validation, optional formatting, tokenization, packing/truncation (`trl.pack_dataset`/`trl.truncate_dataset`), and parallel mapping.
+- `recipes/accelerate_configs/ddp.yaml`
+  - Accelerate DDP configuration (single-node multi-GPU in the template; adapt to your cluster).
+- `recipes/openseek/config.yaml`
+  - Trainer configuration (logging, save frequency, output dir, dtype, attn implementation, etc.).
+- `trainer/pt_dpsk.py`
+  - Loads tokenizer/model, builds datasets via `mix_pt_datasets`, initializes `Trainer`, runs training/evaluation.
+- `train.sh`
+  - Example Slurm batch script that sets CUDA/cuDNN modules/conda, mirrors, and launches training via Accelerate.
+- `scripts/merge.py`
+  - Builds a model soup from base model and checkpoints using the `average` strategy.
+
+---
+
+## Environment and Dependencies
+
+- OS: Ubuntu 20.04/22.04
+- GPU: NVIDIA A800
+- CUDA/cuDNN: match your cluster modules (example uses CUDA 12.2, cuDNN 9.8)
+- Python: 3.10+
+- Core packages:
+  - torch (CUDA build), transformers, datasets, accelerate, trl
+  - sentencepiece (if tokenizer requires), safetensors (recommended)
+- Optional environment variables:
+  - `HF_ENDPOINT=https://hf-mirror.com` (if you use a mirror)
+  - `XDG_CACHE_HOME=cache` (to consolidate cache location)
+
+Example (adapt to your module system/conda env):
+
+```bash
+# create and activate a conda env (if not pre-provisioned by your cluster)
+conda create -y -n train python=3.10 && conda activate train
+
+# install essentials (pin versions per your infra policies)
+pip install --upgrade pip
+pip install torch --index-url https://download.pytorch.org/whl/cu124  # example; match your CUDA
+pip install transformers datasets accelerate trl sentencepiece safetensors
+```
+
+---
+
+## Step 1: Download Base Model/Tokenizer and Build Mixed Packed Dataset
+
+Script: `scripts/download.py`
+
+- Loads `BAAI/OpenSeek-Small-v1`, saves model/tokenizer to `./models/OpenSeek_Small_v1`.
+- Defines `datasets_and_ratios` (a list of `{dataset_name: ratio}`) summing to 1.0.
+- Sets `total_sample_size=7_500_000`, `max_length=4096`, `packing=True`, `dataset_num_proc=4`, `seed=233`.
+- Calls `mix_pt_datasets` (aliased from `pt_datasets_process.py`) to:
+  - Load each dataset shard (Hub path or local `load_from_disk` path).
+  - Optionally apply a `formatting_func`.
+  - Tokenize with right padding (fallback `pad_token=eos_token` if missing).
+  - Pack or truncate sequences based on `max_length`.
+  - Sample and concatenate by ratio to hit the target size.
+- Saves the processed dataset to `./datasets/OpenSeek-Pretrain-30B` (Arrow format).
+
+Run (on Ubuntu cluster login/compute node):
+
+```bash
+python scripts/download.py
+```
+
+Notes:
+- With `packing=True`, the “sample size” refers to packed sequences, not raw documents.
+- Adjust `max_length` and downstream batch size/grad-accumulation for available VRAM.
+- Ensure sufficient disk space for cached and processed datasets.
+
+---
+
+## Step 2: Distributed Training (DDP + Accelerate + Trainer)
+
+Core configs/scripts:
+- `recipes/accelerate_configs/ddp.yaml`
+  - `distributed_type: MULTI_GPU`, `mixed_precision: bf16`, `gpu_ids: all`.
+  - Set `num_processes` to the actual GPU count per node (e.g., 4 or 8).
+- `recipes/openseek/config.yaml`
+  - Logging: `logging_steps: 1`, `report_to: [tensorboard]`.
+  - Saving: `save_strategy: steps`, `save_steps: 1000`, `save_total_limit: 10`.
+  - Output: `output_dir: data/OpenSeek-1.4B-A0.4B`, `overwrite_output_dir: true`.
+  - Model: `model_name_or_path: ./models/OpenSeek_Small_v1`, `torch_dtype: bfloat16`, `trust_remote_code: true`.
+  - Attention: `attn_implementation: sdpa` (ensure compatibility with your CUDA/cuDNN stack).
+- `trainer/pt_dpsk.py`
+  - Loads tokenizer, builds dataset via `mix_pt_datasets`, sets up `Trainer`, trains and evaluates.
+- `train.sh`
+  - Example Slurm script that configures CUDA modules/conda and launches Accelerate.
+
+Submit training job (Slurm):
+
+```bash
+sbatch train.sh
+```
+
+Outputs:
+- Checkpoints at `data/OpenSeek-1.4B-A0.4B/checkpoint-1000`, `...-2000`, ..., `...-7000`.
+- TensorBoard logs (if enabled) for monitoring.
+
+Recommendations:
+- Make `num_processes` in `ddp.yaml` match `--gpus` allocated by Slurm.
+- Ensure `output_dir` has sufficient free space; prune old checkpoints with `save_total_limit`.
+- Tune effective batch size (global batch = per-device batch × num GPUs × grad-accum) to stabilize training.
+
+---
+
+## Step 3: Model Soup (Checkpoint Merging)
+
+Script: `scripts/merge.py`
+
+- Merges base model `./models/OpenSeek_Small_v1` with checkpoints `data/OpenSeek-1.4B-A0.4B/checkpoint-1000..7000`.
+- Strategy:
+  - `average`: unweighted parameter-wise mean across all checkpoints.
+- Only matching parameter keys are merged; others are skipped.
+- Saves merged model, tokenizer, config, and `merge_info.json` to output directory (default `./models/OpenSeek_Small_v1-merged-average`).
+
+Run:
+
+```bash
+python scripts/merge.py
+```
+
+Notes:
+- BF16 saves reduce disk footprint; convert to FP32 if you need full precision checkpoints for downstream tasks.
+- Inspect merge logs for matched key counts to verify architecture compatibility.
+
+---
+
+## End-to-End Repro (Commands)
+
+1) Build datasets and cache base assets:
+```bash
+python scripts/download.py
+```
+2) Launch distributed training (Slurm):
+```bash
+sbatch train.sh
+```
+3) Merge checkpoints into a model soup:
+```bash
+python scripts/merge.py
+```
+
+---
+
+## Troubleshooting
+
+- Ratios do not sum to 1.0:
+  - Fix `datasets_and_ratios`; `download.py` prints total ratio for verification.
+- OOM (GPU memory):
+  - Reduce `max_length`/`per_device_train_batch_size`, increase `gradient_accumulation_steps`, enable gradient checkpointing.
+- DDP process count mismatch:
+  - Align `ddp.yaml:num_processes` with the actual GPU count allocated by Slurm.
+- BF16 not supported:
+  - Switch to FP16 or FP32 and adjust configs accordingly (expect perf/throughput changes).
+- Slow data loading:
+  - Persist datasets to disk (Arrow), increase `dataset_num_proc`, and ensure fast storage.
+
+---
+
+## Reproducibility
+
+- Seeds are set for both dataset preparation and training. Minor nondeterminism may remain due to multi-processing and packing.
+- Pin package versions and keep hardware/software constant to improve reproducibility.
+
+---
+
diff --git a/openseek/competition/pz/losercheems/preliminary/losercheems.pdf b/openseek/competition/pz/losercheems/preliminary/losercheems.pdf
new file mode 100644
index 0000000..db2385e
Binary files /dev/null and b/openseek/competition/pz/losercheems/preliminary/losercheems.pdf differ
diff --git a/openseek/competition/pz/losercheems/preliminary/processor/pt_datasets_process.py b/openseek/competition/pz/losercheems/preliminary/processor/pt_datasets_process.py
new file mode 100644
index 0000000..f212c5a
--- /dev/null
+++ b/openseek/competition/pz/losercheems/preliminary/processor/pt_datasets_process.py
@@ -0,0 +1,346 @@
+from typing import List, Dict, Optional, Union, Callable
+import json
+import logging
+import warnings
+import re
+from datasets import Dataset, IterableDataset, DatasetDict, load_dataset, load_from_disk, concatenate_datasets
+from transformers import AutoTokenizer, PreTrainedTokenizerBase
+from trl.data_utils import pack_dataset, truncate_dataset
+from argparse import ArgumentParser
+
+
+# Configure logger
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    datefmt='%Y-%m-%d %H:%M:%S'
+)
+logger = logging.getLogger(__name__)
+
+
+def validate_dataset_ratios(datasets_and_ratios: List[Dict[str, float]]) -> None:
+    """Validate that dataset ratios are properly formatted and sum to 1.0."""
+    if not datasets_and_ratios:
+        raise ValueError("datasets_and_ratios cannot be empty")
+    
+    total_ratio = 0.0
+    for dataset_dict in datasets_and_ratios:
+        if not isinstance(dataset_dict, dict) or len(dataset_dict) != 1:
+            raise ValueError("Each item in datasets_and_ratios must be a dictionary with exactly one key-value pair")
+        
+        ratio = list(dataset_dict.values())[0]
+        if not isinstance(ratio, (int, float)) or ratio <= 0:
+            raise ValueError(f"Ratio must be a positive number, got {ratio}")
+        
+        total_ratio += ratio
+    
+    if abs(total_ratio - 1.0) > 1e-6:
+        raise ValueError(f"Total ratio must be 1.0, but got {total_ratio}. Please check your ratios.")
+
+
+def validate_tokenizer(tokenizer: PreTrainedTokenizerBase) -> None:
+    """Validate tokenizer configuration."""
+    if tokenizer.pad_token is None and tokenizer.eos_token is not None:
+        logger.warning("Tokenizer has no pad_token, using eos_token as pad_token")
+        tokenizer.pad_token = tokenizer.eos_token
+
+
+def prepare_dataset(
+    dataset: Union[Dataset, IterableDataset],
+    dataset_name: str,
+    dataset_text_field: str,
+    processing_class: Union[PreTrainedTokenizerBase],
+    max_length: Optional[int],
+    packing: Optional[bool],
+    formatting_func: Optional[Callable[[dict], str]],
+    dataset_num_proc: Optional[int],
+) -> Union[Dataset, IterableDataset]:
+
+    # If the dataset is already preprocessed, skip the processing step
+    column_names = list(next(iter(dataset)).keys())
+    is_processed = "input_ids" in column_names
+
+    # Build the kwargs for the `map` function
+    map_kwargs = {}
+    if isinstance(dataset, Dataset):  # InterableDataset does not support num_proc
+        map_kwargs["num_proc"] = dataset_num_proc
+    
+    # Apply the formatting function if any
+    if formatting_func is not None and is_processed:
+        warnings.warn(
+            "You passed a dataset that is already processed (contains an `input_ids` field) together with a "
+            "formatting function. Therefore `formatting_func` will be ignored. Either remove the "
+            "`formatting_func` or pass a dataset that is not already processed.",
+            UserWarning,
+        )
+    
+    if formatting_func is not None and not is_processed:
+        if isinstance(dataset, Dataset):  # `IterableDataset.map` does not support `desc`
+            map_kwargs["desc"] = f"Applying formatting function to {dataset_name} dataset"
+
+        batched = isinstance(formatting_func(next(iter(dataset))), list)
+
+        def _func(example):
+            return {"text": formatting_func(example)}
+
+        dataset = dataset.map(_func, batched=batched, **map_kwargs)
+
+            
+    if not is_processed:
+
+        # Tokenize the dataset if needed
+        if isinstance(dataset, Dataset):  # `IterableDataset.map` does not support `desc`
+            map_kwargs["desc"] = f"Tokenizing {dataset_name} dataset"
+
+        def tokenize(example, processing_class, dataset_text_field):
+            try:
+                processed = processing_class(text=example[dataset_text_field])
+                if (
+                    processing_class.eos_token_id is not None
+                    and len(processed["input_ids"]) > 0
+                    and processed["input_ids"][-1] != processing_class.eos_token_id
+                ):
+                    processed["input_ids"] = processed["input_ids"] + [processing_class.eos_token_id]
+                    processed["attention_mask"] = processed["attention_mask"] + [1]
+                return processed
+            except Exception as e:
+                logger.error(f"Error tokenizing example: {e}")
+                # Return empty tokenization on error
+                return {
+                    "input_ids": [processing_class.eos_token_id] if processing_class.eos_token_id is not None else [],
+                    "attention_mask": [1] if processing_class.eos_token_id is not None else []
+                }
+
+        dataset = dataset.map(
+            tokenize,
+            fn_kwargs={"processing_class": processing_class, "dataset_text_field": dataset_text_field},
+            **map_kwargs,
+        )
+
+    # Pack or truncate
+    if packing:
+        if max_length is None:
+            raise ValueError("When packing is enabled, `max_length` can't be `None`.")
+        if isinstance(dataset, Dataset):  # `IterableDataset.map` does not support `desc`
+            map_kwargs["desc"] = f"Packing {dataset_name} dataset"
+        dataset = dataset.select_columns("input_ids")
+        dataset = pack_dataset(dataset, seq_length=max_length, strategy="bfd", map_kwargs=map_kwargs)
+    elif max_length is not None:
+        if isinstance(dataset, Dataset):  # `IterableDataset.map` does not support `desc`
+            map_kwargs["desc"] = f"Truncating {dataset_name} dataset"
+        dataset = truncate_dataset(dataset, max_length, map_kwargs)
+    return dataset
+
+
+def mix_datasets_by_ratio(
+    datasets_and_ratios: List[Dict[str, float]],
+    total_sample_size: int,
+    dataset_text_field: str,
+    processing_class: Union[PreTrainedTokenizerBase],
+    max_length: Optional[int],
+    packing: Optional[bool],
+    formatting_func: Optional[Callable[[dict], str]],
+    dataset_num_proc: Optional[int],
+    seed: Optional[int] = None,
+    cache_dir: Optional[str] = None,
+):
+    """
+    Mix multiple datasets by ratio.
+
+    Args:
+        datasets_and_ratios: List of dictionaries, each containing a dataset and its ratio.
+            Each dictionary contains one key-value pair where key is the dataset name and value is the mixing ratio.
+        total_sample_size: Total sample size for the mixed training dataset.
+        dataset_text_field: Name of the field in the dataset that contains the text.
+        processing_class: Tokenizer class used for processing the text.
+        max_length: Maximum length of processed sequences. Set to None for no limit.
+        packing: Whether to pack sequences for efficiency.
+        formatting_func: Optional formatting function to convert dataset entries to the desired text format.
+        dataset_num_proc: Number of processes to use for dataset processing.
+        seed: Random seed for dataset shuffling to ensure reproducibility.
+        cache_dir: Directory to cache the datasets.
+    
+    Returns:
+        DatasetDict: A dictionary containing all mixed and processed dataset splits.
+    
+    Example:
+    ```python
+        from transformers import AutoTokenizer
+
+        # Define datasets and their mixing ratios
+        datasets_and_ratios = [
+            {"SmallDoge/MiniCorpus:web-en": 0.5},
+            {"SmallDoge/MiniCorpus:web-zh": 0.2},
+            {"SmallDoge/MiniCorpus:textbook-en": 0.15},
+            {"SmallDoge/MiniCorpus:textbook-zh": 0.05},
+            {"SmallDoge/MiniCorpus:code": 0.05},
+            {"SmallDoge/MiniCorpus:math": 0.05},
+        ]
+
+        # Create tokenizer
+        tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-tokenizer")
+    
+        # Mix datasets
+        mixed_dataset = mix_datasets_by_ratio(
+            datasets_and_ratios=datasets_and_ratios,
+            total_sample_size=10000,
+            dataset_text_field="text",
+            processing_class=tokenizer,
+            max_length=2048,
+            packing=True,
+            formatting_func=None,
+            dataset_num_proc=4,
+            seed=42,
+            cache_dir="./cache",
+        )
+        print(mixed_dataset)
+    ```"""
+    # Validate input parameters
+    validate_dataset_ratios(datasets_and_ratios)
+    
+    # Check if the dataset ratios sum to 1.0 (redundant but kept for backwards compatibility)
+    total_ratio = sum([list(dataset.values())[0] for dataset in datasets_and_ratios])
+    if abs(total_ratio - 1.0) > 1e-6:
+        raise ValueError(f"Total ratio must be 1.0, but got {total_ratio}. Please check your ratios.")
+
+    final_mixed_dataset = {}
+
+    for dataset_and_ratio in datasets_and_ratios:
+        dataset_name, ratio = dataset_and_ratio.popitem()
+
+        # Check subset name
+        windows_drive_pattern = r'^[a-zA-Z]:.*'
+        is_windows_path = bool(re.match(windows_drive_pattern, dataset_name))
+        if ":" in dataset_name and not is_windows_path:
+            dataset_name, subset_name = dataset_name.split(":")
+        else:
+            subset_name = None
+
+        # If the dataset is a string, load it from the hub or disk
+        if isinstance(dataset_name, str):
+            if re.match(r"^[^/]+/[^/]+$", dataset_name):
+                dataset = load_dataset(dataset_name, name=subset_name, cache_dir=cache_dir)
+            else:
+                if subset_name is not None:
+                    warnings.warn(
+                        f"You passed a local dataset path, subsetting is not supported, ignoring subset name: {subset_name}",
+                        UserWarning,
+                    )
+                dataset = load_from_disk(dataset_name)
+
+        # Process each split of the dataset
+        for split_name, split_dataset in dataset.items():
+            split_dataset = split_dataset.select_columns(["input_ids"])
+            logger.info(f"Original dataset size for {dataset_name}: {subset_name}: {split_name}: {len(split_dataset)}")
+            # Process the dataset from text to input_ids
+            split_dataset = prepare_dataset(
+                split_dataset,
+                dataset_name=f"{dataset_name}: {split_name}" if subset_name is None else f"{dataset_name}: {subset_name}: {split_name}",
+                dataset_text_field=dataset_text_field,
+                processing_class=processing_class,
+                max_length=max_length,
+                packing=packing,
+                formatting_func=formatting_func,
+                dataset_num_proc=dataset_num_proc,
+            )
+
+            # Calculate the target size for the dataset
+            if total_sample_size == -1:
+                target_size = len(split_dataset)
+            else:
+                target_size = int(total_sample_size * ratio) if split_name == "train" else len(split_dataset)
+            current_size = len(split_dataset)
+            logger.info(f"Processed dataset size for {dataset_name}: {split_name}: {current_size}")
+            logger.info(f"Target size for {dataset_name}: {split_name}: {target_size}")
+
+            # If the dataset is smaller than the target size, repeat it
+            if current_size < target_size:
+                logger.warning(
+                    f"Dataset {dataset_name}: {split_name} is smaller than the target size. "
+                    f"Repeating the dataset to reach the target size."
+                )
+                indices = []
+                full_copies = target_size // current_size
+                remainder = target_size % current_size
+
+                for _ in range(full_copies):
+                    indices.extend(range(current_size))
+                if remainder > 0:
+                    indices.extend(range(remainder))
+                
+                split_dataset = split_dataset.select(indices)
+            else:
+                logger.warning(
+                    f"Dataset {dataset_name}: {split_name} is larger than the target size. "
+                    f"Truncating the dataset to reach the target size."
+                )
+                split_dataset = split_dataset.select(range(target_size))
+            
+            # Concatenate the split dataset with the final mixed dataset
+            if split_name in final_mixed_dataset:
+                final_mixed_dataset[split_name] = concatenate_datasets(
+                    [final_mixed_dataset[split_name], split_dataset]
+                )
+            else:
+                final_mixed_dataset[split_name] = split_dataset
+         
+    # Shuffle the train dataset
+    if "train" in final_mixed_dataset:
+        final_mixed_dataset["train"] = final_mixed_dataset["train"].shuffle(seed)
+
+    # Create a DatasetDict with the merged datasets
+    final_dataset = DatasetDict(final_mixed_dataset)
+    return final_dataset
+            
+
+def main(args):
+    # Load the tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name_or_path)
+    validate_tokenizer(tokenizer)
+
+    # Mix datasets
+    mixed_dataset = mix_datasets_by_ratio(
+        datasets_and_ratios=args.datasets_and_ratios,
+        total_sample_size=args.total_sample_size,
+        dataset_text_field=args.dataset_text_field,
+        processing_class=tokenizer,
+        max_length=args.max_length,
+        packing=args.packing,
+        formatting_func=None,
+        dataset_num_proc=args.dataset_num_proc,
+        seed=args.seed,
+        cache_dir=args.cache_dir,
+    )
+    
+    # Save the mixed dataset
+    mixed_dataset.save_to_disk(args.dataset_save_path)
+    print(f"Mixed dataset saved to {args.dataset_save_path}")
+
+if __name__ == "__main__":
+    argparser = ArgumentParser()
+    argparser.add_argument("--datasets_and_ratios", type=str, required=True,
+                          help="JSON string of list of dictionaries with dataset names and mixing ratios")
+    argparser.add_argument("--dataset_save_path", type=str, required=True,
+                          help="Path to save the mixed dataset")
+    argparser.add_argument("--total_sample_size", type=int, required=True,
+                          help="Total sample size for the mixed training dataset")
+    argparser.add_argument("--dataset_text_field", type=str, required=True,
+                          help="Name of the field in the dataset that contains the text")
+    argparser.add_argument("--tokenizer_name_or_path", type=str, required=True,
+                          help="Tokenizer name or path")
+    argparser.add_argument("--max_length", type=int, default=2048,
+                          help="Maximum length of processed sequences")
+    argparser.add_argument("--packing", action="store_true",
+                          help="Whether to pack sequences for efficiency")
+    argparser.add_argument("--dataset_num_proc", type=int, default=4,
+                          help="Number of processes for dataset processing")
+    argparser.add_argument("--seed", type=int, default=42,
+                          help="Random seed for reproducibility")
+    argparser.add_argument("--cache_dir", type=str, default="./cache",
+                          help="Directory to cache datasets")
+    args = argparser.parse_args()
+
+    # Parse datasets_and_ratios from JSON string
+    args.datasets_and_ratios = json.loads(args.datasets_and_ratios)
+
+    main(args)
diff --git a/openseek/competition/pz/losercheems/preliminary/recipes/accelerate_configs/ddp.yaml b/openseek/competition/pz/losercheems/preliminary/recipes/accelerate_configs/ddp.yaml
new file mode 100644
index 0000000..4f05571
--- /dev/null
+++ b/openseek/competition/pz/losercheems/preliminary/recipes/accelerate_configs/ddp.yaml
@@ -0,0 +1,16 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: all
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
diff --git a/openseek/competition/pz/losercheems/preliminary/recipes/openseek/config.yaml b/openseek/competition/pz/losercheems/preliminary/recipes/openseek/config.yaml
new file mode 100644
index 0000000..93e11e3
--- /dev/null
+++ b/openseek/competition/pz/losercheems/preliminary/recipes/openseek/config.yaml
@@ -0,0 +1,189 @@
+# Logging and Output arguments
+log_level: info
+logging_strategy: steps
+logging_steps: 1
+report_to:
+- tensorboard
+save_strategy: steps
+save_steps: 1000
+save_total_limit: 10
+output_dir: data/OpenSeek-1.4B-A0.4B
+overwrite_output_dir: true
+
+# Model arguments
+model_config:
+  # Basic model configuration
+  vocab_size: 151851
+  hidden_size: 1280
+  intermediate_size: 7168
+  num_hidden_layers: 6
+  hidden_act: "silu"
+  use_cache: true
+  tie_word_embeddings: true
+  max_position_embeddings: 4096
+  
+  # Attention configuration
+  attention_bias: false
+  attention_dropout: 0.0
+  num_attention_heads: 10
+  num_key_value_heads: 10
+  # head_dim: 128
+  qk_nope_head_dim: 128
+  qk_rope_head_dim: 64
+  v_head_dim: 128
+  kv_lora_rank: 512
+  q_lora_rank: null
+  
+  # RoPE configuration
+  rope_theta: 1000000
+  rope_scaling: null
+  rope_interleave: true
+  
+  # MoE configuration
+  moe_intermediate_size: 896
+  n_routed_experts: 64
+  n_shared_experts: 2
+  num_experts_per_tok: 6
+  first_k_dense_replace: 1
+  norm_topk_prob: true
+  topk_group: 1
+  routed_scaling_factor: 2.446
+  n_group: 1
+  
+  # Other configuration
+  rms_norm_eps: 1.0e-6
+  initializer_range: 0.006
+  bos_token_id: 0
+  eos_token_id: 1
+  pretraining_tp: 1
+  attn_implementation: sdpa
+
+model_name_or_path: ./models/OpenSeek_Small_v1
+torch_dtype: bfloat16
+trust_remote_code: True
+
+# # Data training arguments
+# datasets_and_ratios:
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-high_part_142_text_document: 0.011068
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-low_part_62_text_document: 0.003577
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-mid_part_189_text_document: 0.007775
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-high_part_76_text_document: 0.002859
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-low_part_124_text_document: 0.001672
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-mid_part_29_text_document: 0.002339
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-high_part_244_text_document: 0.005397
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-low_part_150_text_document: 0.004064
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-mid_part_444_text_document: 0.005005
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-high_part_498_text_document: 0.004616
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-low_part_10_text_document: 0.00067
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-mid_part_144_text_document: 0.003429
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-high_part_86_text_document: 0.00261
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-low_part_133_text_document: 0.001824
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-mid_part_139_text_document: 0.002313
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-high_part_47_text_document: 0.008237
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-low_part_11_text_document: 0.002866
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-mid_part_97_text_document: 0.00667
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-high_part_43_text_document: 0.004657
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-low_part_10_text_document: 0.002005
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-mid_part_164_text_document: 0.004317
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-high_part_92_text_document: 0.011397
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-low_part_113_text_document: 0.006782
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-mid_part_563_text_document: 0.009175
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/arxiv_007_00000_text_document: 0.006414
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/books_016_00007_text_document: 0.004696
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/code-high_part_13_text_document: 0.010102
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/code-low_part_36_text_document: 0.011403
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/code-mid_part_37_text_document: 0.009674
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-high_23_text_document: 0.003755
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-low_51_text_document: 0.000499
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_118_text_document: 0.003608
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_176_text_document: 0.003623
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_256_text_document: 0.003704
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_320_text_document: 0.003733
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_32_text_document: 0.003631
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-high_1_text_document: 0.002573
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-low_2_text_document: 0.001638
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-mid_3_text_document: 0.003251
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-high_2_text_document: 0.060237
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-low_1_text_document: 0.089063
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-mid_2_text_document: 0.101376
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-high_4_text_document: 0.004598
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-low_6_text_document: 0.006857
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-mid_23_text_document: 0.00899
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-high_12_text_document: 0.013135
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-low_3_text_document: 0.01653
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-mid_5_text_document: 0.003536
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-high_5_text_document: 0.006314
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-low_5_text_document: 0.005978
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-mid_4_text_document: 0.007909
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-high_74_text_document: 0.002225
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-low_54_text_document: 0.001797
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-mid_275_text_document: 0.002042
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-high_4_text_document: 0.004081
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-low_2_text_document: 0.001659
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-mid_6_text_document: 0.012828
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-high_2_text_document: 0.0568
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-low_1_text_document: 0.074907
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-mid_1_text_document: 0.089359
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-high_13_text_document: 0.007663
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-low_9_text_document: 0.004052
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-mid_6_text_document: 0.001916
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-high_11_text_document: 0.005074
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-low_11_text_document: 0.006437
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-mid_29_text_document: 0.006406
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-high_4_text_document: 0.004
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-low_6_text_document: 0.003564
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-mid_3_text_document: 0.005768
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/math-high_part_04_text_document: 0.018165
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/math-low_part_10_text_document: 0.01694
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/math-mid_part_07_text_document: 0.016311
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-train_train-0041-of-0136_text_document: 0.00687
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-train_train-0125-of-0136_text_document: 0.007387
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-val_valid-0034-of-0060_text_document: 0.000143
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o_pubmedcentral_3_text_document: 0.061982
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/stack_018_00000_text_document: 0.004229
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/wiki_012_00000_text_document: 0.004202
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss0_part_28_text_document: 0.018171
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss1_part_59_text_document: 0.009776
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss2_part_16_text_document: 0.003725
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss0_part_192_text_document: 0.009492
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss1_part_550_text_document: 0.009236
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss2_part_71_text_document: 0.010643
+
+datasets_and_ratios:
+  - /workspace/datasets/OpenSeek-Pretrain-30B: 1.0
+
+total_sample_size: 7500000
+dataset_text_field: "text"
+max_length: 4096
+packing: true
+dataset_num_proc: 4
+cache_dir: "./cache"
+
+# PT trainer arguments
+seed: 233
+do_train: True
+max_steps: 7000
+per_device_train_batch_size: 1
+do_eval: False
+eval_strategy: 'no'
+eval_steps: 100
+per_device_eval_batch_size: 1
+optim: adamw_torch_fused
+adam_beta1: 0.9
+adam_beta2: 0.95
+adam_epsilon: 1.0e-8
+learning_rate: 2.0e-5
+lr_scheduler_type: warmup_stable_decay
+lr_scheduler_kwargs:
+  warmup_type: linear
+  decay_type: linear
+  num_decay_steps: 0
+  min_lr_ratio: 0.0
+warmup_steps: 0
+weight_decay: 0.01
+gradient_accumulation_steps: 128
+gradient_checkpointing: false
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+max_grad_norm: 1.0
+bf16: True
diff --git a/openseek/competition/pz/losercheems/preliminary/scripts/download.py b/openseek/competition/pz/losercheems/preliminary/scripts/download.py
new file mode 100644
index 0000000..97c0b22
--- /dev/null
+++ b/openseek/competition/pz/losercheems/preliminary/scripts/download.py
@@ -0,0 +1,105 @@
+from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+# export HF_ENDPOINT=https://hf-mirror.com
+# export XDG_CACHE_HOME=cache
+
+
+
+if __name__ == "__main__":
+    
+    datasets_and_ratios = [
+
+        {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-actual-actual-high_part_142_text_document": 0.02242},
+        {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-synthetic-distill-high_part_76_text_document": 0.00687},
+        {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-synthetic-diverse_qa_pairs-high_part_244_text_document": 0.014466},    
+        {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-synthetic-extract_knowledge-high_part_498_text_document": 0.008715},
+        {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-synthetic-knowledge_list-high_part_86_text_document": 0.006747},
+        {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-high-synthetic-wrap_medium-high_part_47_text_document": 0.017773},
+        {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-low-synthetic-wrap_medium-high_part_43_text_document": 0.010979},
+        {"JingzeShi/OpenSeek-Pretrain-100B:Nemotron-CC-medium-actual-actual-high_part_92_text_document": 0.027354},
+
+
+        {"JingzeShi/OpenSeek-Pretrain-100B:arxiv_007_00000_text_document": 0.006414},
+        {"JingzeShi/OpenSeek-Pretrain-100B:books_016_00007_text_document": 0.004696},
+        {"JingzeShi/OpenSeek-Pretrain-100B:code-high_part_13_text_document": 0.031179},
+
+
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_CC-high_23_text_document": 0.022553},
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_OpenSource-high_1_text_document": 0.007462},
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_arxiv-high_2_text_document": 0.250676},
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_code-high_4_text_document": 0.020445},
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_math-high_12_text_document": 0.033201},
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis2_wiki-high_5_text_document": 0.020201},
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_CC-high_74_text_document": 0.006064},
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_OpenSource-high_4_text_document": 0.018568},
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_arxiv-high_2_text_document": 0.221066},
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_code-high_13_text_document": 0.013631},
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_math-high_11_text_document": 0.017917},
+        {"JingzeShi/OpenSeek-Pretrain-100B:cot_synthesis_wiki-high_4_text_document": 0.017917},
+
+
+        {"JingzeShi/OpenSeek-Pretrain-100B:math-high_part_04_text_document": 0.051416},
+
+
+        {"JingzeShi/OpenSeek-Pretrain-100B:pes2o-full-train_train-0041-of-0136_text_document": 0.00687},
+        {"JingzeShi/OpenSeek-Pretrain-100B:pes2o-full-train_train-0125-of-0136_text_document": 0.007387},
+        {"JingzeShi/OpenSeek-Pretrain-100B:pes2o-full-val_valid-0034-of-0060_text_document": 0.000143},
+        {"JingzeShi/OpenSeek-Pretrain-100B:pes2o_pubmedcentral_3_text_document": 0.061982},
+        {"JingzeShi/OpenSeek-Pretrain-100B:stack_018_00000_text_document": 0.004229},
+        {"JingzeShi/OpenSeek-Pretrain-100B:wiki_012_00000_text_document": 0.004202},
+
+
+        {"JingzeShi/OpenSeek-Pretrain-100B:zh_cc-high-loss0_part_28_text_document": 0.031672},
+        {"JingzeShi/OpenSeek-Pretrain-100B:zh_cc-high-loss1_part_59_text_document": 0.029371},
+
+
+    ]
+
+    def calculate_total_ratio(datasets_and_ratios):
+        return sum(item for item in datasets_and_ratios.values())
+
+    total_ratio = sum(calculate_total_ratio(dataset) for dataset in datasets_and_ratios)
+    print(f"Total ratio: {total_ratio}")
+
+    total_sample_size = 7_500_000
+    dataset_text_field = "text"
+    max_length = 4096
+    packing = True
+    dataset_num_proc = 4
+    cache_dir = "./cache"
+    seed = 233
+    model_name_or_path = "BAAI/OpenSeek-Small-v1"
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_name_or_path,
+        trust_remote_code=True,
+        use_fast=True
+    )
+    tokenizer.padding_side = "right"
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True)
+    model = model.to(torch.bfloat16)
+    print(model)
+    model.save_pretrained(f"./models/OpenSeek_Small_v1")
+    tokenizer.save_pretrained(f"./models/OpenSeek_Small_v1")
+
+    dataset = mix_pt_datasets(
+            datasets_and_ratios=datasets_and_ratios,
+            total_sample_size=total_sample_size,
+            dataset_text_field=dataset_text_field,
+            processing_class=tokenizer,
+            max_length=max_length,
+            packing=packing,
+            formatting_func=None,
+            dataset_num_proc=dataset_num_proc,
+            seed=seed,
+            # cache_dir=cache_dir,
+        )
+    dataset = dataset.select_columns(["input_ids"])
+    print(dataset)
+
+    dataset.save_to_disk("./datasets/OpenSeek-Pretrain-30B", num_proc=8)
\ No newline at end of file
diff --git a/openseek/competition/pz/losercheems/preliminary/scripts/merge.py b/openseek/competition/pz/losercheems/preliminary/scripts/merge.py
new file mode 100644
index 0000000..f73ca26
--- /dev/null
+++ b/openseek/competition/pz/losercheems/preliminary/scripts/merge.py
@@ -0,0 +1,219 @@
+from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer
+import time
+import torch
+import json
+import os
+from datetime import datetime
+from collections import OrderedDict
+
+torch.manual_seed(0)
+
+checkpoint_paths = []
+
+checkpoint_paths.append(r"./models/OpenSeek_Small_v1")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-1000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-2000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-3000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-4000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-6000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-7000")
+
+
+# merge这些模型
+print("Merging models...")
+
+def merge_checkpoints(checkpoint_paths, output_path, merge_method="average"):
+    """
+    合并多个checkpoints
+    
+    Args:
+        checkpoint_paths: 要合并的checkpoint路径列表
+        output_path: 合并后模型的保存路径
+        merge_method: 合并方法，支持 "average", "last", "weighted_average"
+    """
+    print(f"Using merge method: {merge_method}")
+    
+    # 加载第一个模型作为基础
+    base_model = AutoModelForCausalLM.from_pretrained(checkpoint_paths[0], trust_remote_code=True)
+    base_config = AutoConfig.from_pretrained(checkpoint_paths[0], trust_remote_code=True)
+    base_tokenizer = AutoTokenizer.from_pretrained(checkpoint_paths[0], trust_remote_code=True)
+    
+    if merge_method == "average":
+        # 平均合并所有checkpoints
+        print("Averaging weights from all checkpoints...")
+        merged_state_dict = OrderedDict()
+        weight_counts = OrderedDict()  # 记录每个权重参与合并的次数
+        
+        # 初始化merged_state_dict和weight_counts
+        for key in base_model.state_dict():
+            merged_state_dict[key] = torch.zeros_like(base_model.state_dict()[key])
+            weight_counts[key] = 0
+        
+        # 累加所有模型的权重
+        for i, checkpoint_path in enumerate(checkpoint_paths):
+            print(f"Processing checkpoint {i+1}/{len(checkpoint_paths)}: {checkpoint_path}")
+            
+            model = AutoModelForCausalLM.from_pretrained(checkpoint_path, trust_remote_code=True)
+            model_state_dict = model.state_dict()
+            
+            # 只合并匹配的权重
+            matched_keys = 0
+            total_keys = len(model_state_dict)
+            for key in base_model.state_dict():
+                if key in model_state_dict:
+                    
+                    # 检查形状是否匹配
+                    if merged_state_dict[key].shape == model_state_dict[key].shape:
+                        merged_state_dict[key] += model_state_dict[key]
+                        weight_counts[key] += 1
+                        matched_keys += 1
+                    else:
+                        print(f"Warning: Shape mismatch for key {key}: base {merged_state_dict[key].shape} vs model {model_state_dict[key].shape}")
+                else:
+                    print(f"Warning: Key {key} not found in model {checkpoint_path}")
+            
+            print(f"  Matched {matched_keys}/{len(base_model.state_dict())} keys from base model")
+            print(f"  Model has {total_keys} total keys")
+            
+            # 释放内存
+            del model
+            torch.cuda.empty_cache() if torch.cuda.is_available() else None
+        
+        # 计算平均值（只对参与合并的权重求平均）
+        for key in merged_state_dict:
+            if weight_counts[key] > 0:
+                merged_state_dict[key] /= weight_counts[key]
+                print(f"Key {key}: averaged over {weight_counts[key]} models")
+            else:
+                # 如果某个权重没有参与任何合并，使用base_model的权重
+                merged_state_dict[key] = base_model.state_dict()[key].clone()
+                print(f"Key {key}: using base model weight (no matches found)")
+    
+    elif merge_method == "weighted_average":
+        # 加权平均合并（后面的checkpoint权重更高）
+        print("Weighted averaging weights from all checkpoints...")
+        merged_state_dict = OrderedDict()
+        weight_sums = OrderedDict()  # 记录每个权重的总权重
+        
+        # 初始化merged_state_dict和weight_sums
+        for key in base_model.state_dict():
+            merged_state_dict[key] = torch.zeros_like(base_model.state_dict()[key])
+            weight_sums[key] = 0.0
+        
+        # 计算权重（线性递增）
+        weights = [i+1 for i in range(len(checkpoint_paths))]
+        total_weight = sum(weights)
+        weights = [w/total_weight for w in weights]
+        
+        print(f"Weights: {weights}")
+        
+        # 加权累加所有模型的权重
+        for i, (checkpoint_path, weight) in enumerate(zip(checkpoint_paths, weights)):
+            print(f"Processing checkpoint {i+1}/{len(checkpoint_paths)}: {checkpoint_path} (weight: {weight:.3f})")
+            
+            model = AutoModelForCausalLM.from_pretrained(checkpoint_path, trust_remote_code=True)
+            model_state_dict = model.state_dict()
+            
+            # 只合并匹配的权重
+            matched_keys = 0
+            total_keys = len(model_state_dict)
+            for key in base_model.state_dict():
+                if key in model_state_dict:
+                    
+                    # 检查形状是否匹配
+                    if merged_state_dict[key].shape == model_state_dict[key].shape:
+                        merged_state_dict[key] += model_state_dict[key] * weight
+                        weight_sums[key] += weight
+                        matched_keys += 1
+                    else:
+                        print(f"Warning: Shape mismatch for key {key}: base {merged_state_dict[key].shape} vs model {model_state_dict[key].shape}")
+                else:
+                    print(f"Warning: Key {key} not found in model {checkpoint_path}")
+            
+            print(f"  Matched {matched_keys}/{len(base_model.state_dict())} keys from base model")
+            print(f"  Model has {total_keys} total keys")
+            
+            # 释放内存
+            del model
+            torch.cuda.empty_cache() if torch.cuda.is_available() else None
+        
+        # 最终加权平均结果打印
+        for key in merged_state_dict:
+            if weight_sums[key] > 0:
+                print(f"Key {key}: weighted averaged over {weight_sums[key]:.3f} total weight")
+            else:
+                merged_state_dict[key] = base_model.state_dict()[key].clone()
+                print(f"Key {key}: using base model weight (no matches found)")
+    
+    elif merge_method == "last":
+        # 只使用最后一个checkpoint
+        print("Using the last checkpoint...")
+        last_model = AutoModelForCausalLM.from_pretrained(checkpoint_paths[-1], trust_remote_code=True)
+        merged_state_dict = last_model.state_dict()
+        del last_model
+    
+    else:
+        raise ValueError(f"Unknown merge method: {merge_method}")
+    
+    # 加载合并后的权重到基础模型
+    base_model.load_state_dict(merged_state_dict)
+    
+    # 保存合并后的模型
+    print(f"Saving merged model to {output_path}")
+    os.makedirs(output_path, exist_ok=True)
+    base_model = base_model.to(torch.bfloat16)
+    base_model.save_pretrained(output_path)
+    base_tokenizer.save_pretrained(output_path)
+    base_config.save_pretrained(output_path)
+    
+    # 保存合并信息
+    merge_info = {
+        "merge_method": merge_method,
+        "merged_checkpoints": checkpoint_paths,
+        "merge_time": datetime.now().isoformat(),
+        "num_checkpoints": len(checkpoint_paths)
+    }
+    
+    with open(os.path.join(output_path, "merge_info.json"), "w", encoding="utf-8") as f:
+        json.dump(merge_info, f, indent=2, ensure_ascii=False)
+    
+    print(f"Merge completed! Model saved to {output_path}")
+    return base_model, base_tokenizer
+
+# 执行合并
+if checkpoint_paths:
+    print(f"Total checkpoints to merge: {len(checkpoint_paths)}")
+    
+    # 方法1: 平均合并
+    output_path_avg = "./models/OpenSeek_Small_v1-merged-average"
+    merged_model_avg, merged_tokenizer_avg = merge_checkpoints(
+        checkpoint_paths, 
+        output_path_avg, 
+        merge_method="average"
+    )
+    
+    # # 方法2: 加权平均合并
+    # output_path_weighted = "./models/OpenSeek_Small_v1-merged-weighted"
+    # merged_model_weighted, merged_tokenizer_weighted = merge_checkpoints(
+    #     checkpoint_paths, 
+    #     output_path_weighted, 
+    #     merge_method="weighted_average"
+    # )
+    
+    # # 方法3: 使用最后一个checkpoint
+    # output_path_last = "./models/OpenSeek_Small_v1-merged-last"
+    # merged_model_last, merged_tokenizer_last = merge_checkpoints(
+    #     checkpoint_paths, 
+    #     output_path_last, 
+    #     merge_method="last"
+    # )
+    
+    # print("All merge methods completed!")
+    # print(f"Average merge: {output_path_avg}")
+    # print(f"Weighted average merge: {output_path_weighted}")
+    # print(f"Last checkpoint: {output_path_last}")
+
+else:
+    print("No checkpoints found to merge!")
\ No newline at end of file
diff --git a/openseek/competition/pz/losercheems/preliminary/train.sh b/openseek/competition/pz/losercheems/preliminary/train.sh
new file mode 100644
index 0000000..0fded40
--- /dev/null
+++ b/openseek/competition/pz/losercheems/preliminary/train.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+#SBATCH -N 1
+#SBATCH -n 32
+#SBATCH -t 114514
+#SBATCH -J openseek
+#SBATCH -o out.log
+#SBATCH --gpus=4
+
+module unload compilers/cuda
+module unload cudnn
+module load compilers/cuda/12.2
+module load cudnn/9.8.0.87_cuda12.x
+conda activate train
+conda init
+
+export HF_ENDPOINT=https://hf-mirror.com
+export XDG_CACHE_HOME=cache
+
+accelerate launch --config_file recipes/accelerate_configs/ddp.yaml ./trainer/pt_dpsk.py --config recipes/openseek/config.yaml
+
+# sbatch train.sh
diff --git a/openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py b/openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py
new file mode 100644
index 0000000..06a4d51
--- /dev/null
+++ b/openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py
@@ -0,0 +1,211 @@
+import logging
+import os
+import sys
+from argparse import ArgumentParser
+
+import yaml
+import datasets
+import torch
+import transformers
+from transformers import (
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    DataCollatorForLanguageModeling,
+    Trainer,
+    set_seed,
+)
+from transformers.trainer_utils import get_last_checkpoint
+from utils.training_args_configs import PTConfig
+from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets
+
+from trl import ModelConfig, ScriptArguments, TrlParser
+
+
+logger = logging.getLogger(__name__)
+
+
+def main(
+    script_args: ScriptArguments,
+    training_args: PTConfig,
+    model_args: ModelConfig,
+    model_config: dict,
+):
+    # Set seed for reproducibility
+    set_seed(training_args.seed)
+
+    ###############
+    # Setup logging
+    ###############
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+    log_level = training_args.get_process_log_level()
+    logger.setLevel(log_level)
+    datasets.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.enable_default_handler()
+    transformers.utils.logging.enable_explicit_format()
+
+    # Log on each process a small summary
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+        + f" distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+    logger.info(f"Model parameters {model_args}")
+    logger.info(f"Script parameters {script_args}")
+    logger.info(f"Data parameters {training_args}")
+
+    # Get model classes
+    config_class = AutoConfig
+    causal_lm_class = AutoModelForCausalLM
+
+    ################
+    # Load tokenizer
+    ################
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path,
+        revision=model_args.model_revision,
+        trust_remote_code=model_args.trust_remote_code,
+        use_fast=True
+    )
+    tokenizer.padding_side = "right"
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    ##############
+    # Load datasets
+    ###############
+    logger.info("Using processor for dataset mixing and processing")
+    dataset = mix_pt_datasets(
+        datasets_and_ratios=training_args.datasets_and_ratios,
+        total_sample_size=training_args.total_sample_size,
+        dataset_text_field=training_args.dataset_text_field,
+        processing_class=tokenizer,
+        max_length=training_args.max_length,
+        packing=training_args.packing,
+        formatting_func=None,
+        dataset_num_proc=training_args.dataset_num_proc,
+        seed=training_args.seed,
+        cache_dir=training_args.cache_dir,
+    )
+
+    ###################
+    # Model init kwargs
+    ###################
+    logger.info("*** Initializing model kwargs ***")
+    torch_dtype = (
+        model_args.torch_dtype if model_args.torch_dtype in ["auto", None] else getattr(torch, model_args.torch_dtype)
+    )
+    model_kwargs = dict(
+        revision=model_args.model_revision,
+        trust_remote_code=model_args.trust_remote_code,
+        attn_implementation=model_args.attn_implementation,
+        torch_dtype=torch_dtype,
+        use_cache=False if training_args.gradient_checkpointing else True,
+    )
+    training_args.model_init_kwargs = model_kwargs
+
+    ##################
+    # Initialize model
+    ##################
+    logger.info("Initializing model")
+    config = config_class(**model_config)
+    model = causal_lm_class.from_pretrained(
+        model_args.model_name_or_path,
+        config=config,
+    ).to(torch_dtype)
+    #  if model_args.model_name_or_path is not None and model_args.model_name_or_path.endswith("checkpoint") else causal_lm_class(config=config).to(torch_dtype)
+
+    model_num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    logger.info(f"Model structure: {model}")
+    logger.info(f"Model parameters: {model_num_params}")
+
+    # Check for last checkpoint
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir):
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+    if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+        logger.info(f"Checkpoint detected, resuming training at {last_checkpoint=}.")
+    else:
+        logger.info("No checkpoint found, starting training from scratch.")
+
+    ###########################
+    # Initialize the PT trainer
+    ###########################
+    data_collator = DataCollatorForLanguageModeling(
+        tokenizer=tokenizer,
+        mlm=False,
+    )
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=dataset[script_args.dataset_train_split],
+        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
+        processing_class=tokenizer,
+        data_collator=data_collator,
+    )
+
+    ###############
+    # Training loop
+    ###############
+    logger.info("*** Start training... ***")
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+    train_result = trainer.train(resume_from_checkpoint=checkpoint)
+    metrics = train_result.metrics
+    metrics["train_samples"] = len(dataset[script_args.dataset_train_split])
+    trainer.log_metrics("train", metrics)
+    trainer.save_metrics("train", metrics)
+    trainer.save_state()
+
+    ##################################
+    # Save model and create model card
+    ##################################
+    logger.info("*** Saving model... ***")
+    trainer.save_model(training_args.output_dir)
+    logger.info(f"Model saved to {training_args.output_dir}")
+
+    # Save everything else on main process
+    if trainer.accelerator.is_main_process:
+        trainer.create_model_card()
+        # Restore k,v cache for fast inference
+        trainer.model.config.use_cache = True
+        trainer.model.config.save_pretrained(training_args.output_dir)
+
+    logger.info("*** Training complete ***")
+
+    ##########
+    # Evaluate
+    ##########
+    if training_args.do_eval:
+        logger.info("*** Start evaluation... ***")
+        metrics = trainer.evaluate()
+        metrics["eval_samples"] = len(dataset[script_args.dataset_test_split])
+        trainer.log_metrics("eval", metrics)
+        trainer.save_metrics("eval", metrics)
+        logger.info("*** Evaluation complete ***")
+
+    logger.info("*** Training finished! ***")
+
+
+if __name__ == "__main__":
+    model_config_parser = ArgumentParser()
+    model_config_parser.add_argument(
+        "--config", type=str, default="./recipes/config_full.yaml", help="path to yaml config file of PT"
+    )
+
+    parser = TrlParser((ScriptArguments, PTConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_and_config(fail_with_unknown_args=False)
+    
+    config_path = model_config_parser.parse_args().config
+    model_config = yaml.load(
+        open(config_path, "r", encoding="utf-8"), Loader=yaml.FullLoader
+    )["model_config"]
+    
+    main(script_args, training_args, model_args, model_config)
diff --git a/openseek/competition/pz/losercheems/preliminary/utils/training_args_configs.py b/openseek/competition/pz/losercheems/preliminary/utils/training_args_configs.py
new file mode 100644
index 0000000..4430708
--- /dev/null
+++ b/openseek/competition/pz/losercheems/preliminary/utils/training_args_configs.py
@@ -0,0 +1,41 @@
+from dataclasses import dataclass, field
+from typing import Optional, List, Dict
+
+from transformers import TrainingArguments
+
+
+@dataclass
+class PTConfig(TrainingArguments):
+    """
+    Configuration for Pre-Training (PT).
+    """
+    
+    # Dataset mixing parameters
+    datasets_and_ratios: Optional[List[Dict[str, float]]] = field(
+        default=None,
+        metadata={"help": "List of datasets and their mixing ratios. Format: [{'dataset_name': ratio}, ...]"}
+    )
+    total_sample_size: Optional[int] = field(
+        default=None,
+        metadata={"help": "Total number of samples to use from mixed datasets"}
+    )
+    dataset_text_field: str = field(
+        default="text",
+        metadata={"help": "The field name containing text data in the dataset"}
+    )
+    max_length: int = field(
+        default=2048,
+        metadata={"help": "Maximum sequence length for tokenization"}
+    )
+    packing: bool = field(
+        default=True,
+        metadata={"help": "Whether to pack sequences for efficient training"}
+    )
+    dataset_num_proc: int = field(
+        default=4,
+        metadata={"help": "Number of processes for dataset processing"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Directory to cache processed datasets"}
+    )