Refactor and Optimize DeepSeek Coder Training Script #610

Imadnajam · 2025-01-29T10:17:21Z

Key Changes

1. Improved Tokenization and Data Preprocessing

We created a clear and modular tokenization function _tokenize_fn, which processes the input strings by padding and truncating them appropriately.

def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
    tokenized_list = [
        tokenizer(
            text,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        )
        for text in strings
    ]

…_Training.py

Imadnajam added 4 commits January 29, 2025 11:07

DeepSeek Coder - AI Programming Assistant Training Script

9d2b0e2

Delete finetune_Programming_Assistant_Training_Script.py

012d197

Create Programming Assistant Training Script

1b6f010

Rename Programming Assistant Training Script to Programming_Assistant…

46a00fd

…_Training.py

vaibhavcybermeru approved these changes Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and Optimize DeepSeek Coder Training Script #610

Refactor and Optimize DeepSeek Coder Training Script #610

Imadnajam commented Jan 29, 2025

Refactor and Optimize DeepSeek Coder Training Script #610

Are you sure you want to change the base?

Refactor and Optimize DeepSeek Coder Training Script #610

Conversation

Imadnajam commented Jan 29, 2025

Key Changes

1. Improved Tokenization and Data Preprocessing