This repository contains the codebase for the paper "Exploring Design Choices for Building Language-Specific LLMs".
virtualenv -p [PATH to python3.10 binary] tla
source tla/bin/activate
pip install -r requirements.txt
cd lm-evaluation-harness
pip install -e .
cd ../The scripts/ folder contains preprocessing scripts to generate language-speecific vocabularies and CPT data.
The following example command generates 200000 training and 10000 validation examples from the mC4 corpus for Hindi and saves into the data/ folder. Language ISO codes for other languages: (ar, tr, ta). Codes for all other languages from the mC4 dataset can be found here.
python create_mc4_data.py --lang hi --data_size 200000 --split train
python create_mc4_data.py --lang hi --data_size 10000 --split validation
For models that use SentencePiece (Llama-2, Mistral, Gemma, XGLM, etc.), the following script generates the new vocabularies and merges them with that of the original model:
python augment_tokenizers.py --vocab_size 50000 --train_from_scratch True --lang "hi" --model_base meta-llama/Llama-2-7b-hf --dataset_size 300000
For models that use the Tokenizers library (Bloom, Phi-2), the following script generates the new vocabularies and merges them with that of the original model:
python augment_df_tokenizers.py --vocab_size 50000 --lang "tr" --model_base bloom --dataset_size 300000
run.sh contains an example scropts that trains LLaMA-2-7B on Hindi (hi), with augmented vocabulary of 50K tokens on 200K examples. The model will be saved at saved_models/.
See run_gen_eval.sh for an example to evaluation on generation tasks (Flores, XL-Sum). run_lm_bench_eval.sh runs evluations on NLU benchmarks. Example on Hindi (hi):
run_lm_bench_eval.sh hi saved_models/results_Llama2_7b_200k_hi_50kTok
If our work was helpful in your research, please kindly cite us as follows:
@misc{tejaswi2024exploring,
title={Exploring Design Choices for Building Language-Specific LLMs},
author={Atula Tejaswi and Nilesh Gupta and Eunsol Choi},
year={2024},
eprint={2406.14670},
archivePrefix={arXiv}
}
[1] Xue, Linting, et al. "mT5: A massively multilingual pre-trained text-to-text transformer." arXiv preprint arXiv:2010.11934 (2020).
[2] Kudo, Taku, and John Richardson. "Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing." arXiv preprint arXiv:1808.06226 (2018).
[3] Dobler, Konstantin, and Gerard De Melo. "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models." The 2023 Conference on Empirical Methods in Natural Language Processing. 2023.