Updated: 21 August 2024
SEA-LION is a family of open-source language models developed by AI Singapore that better understands Southeast Asia's diverse contexts, languages, and cultures (SEA). We hope it makes LLMs more accessible and better represents the region's breadth of cultures and languages.
Our first versions of SEA-LION, released in December 2023, were trained from scratch using SEA-LION-PILE (about 1 trillion tokens). Our new version of SEA-LION is based on continued pre-training good open source models. Version 2-2.x is based on Llama 3. We believe that this approach, i.e., continued pre-training, might be more sustainable over the longer run.
We have benefited greatly from the open-source community and believe that efforts to better represent our region will similarly be well served by open-source efforts. SEA-LION will therefore be open and transparent in the following areas:
- Pre-Training data
- Model training code
- Model weights
- Fine-Tuning data
- Evaluation benchmarks
- Continued Pre-Trained and Fine-Tuned Llama 3 (with more models to follow)
- Instruction tuned in English, Bahasa Indonesia, Thai, Vietnamese, and Tamil
- Trained with up to 50B tokens from SEA languages
- Outperforms base Llama 3 and other models in both general and SEA capabilities
- Our contributions are open source (under MIT license); data and model licenses are listed on their respective Hugging Face data or model cards
See our HuggingFace page for more detailed model and license information.
SEA-LION models are available for download on HuggingFace at:
Base Models
Instruction-Tuned Models
Quantized Models
To use SEA-LION v2.x:
# Please use transformers==4.43.2
import transformers
import torch
model_id = "aisingapore/llama3-8b-cpt-sealionv2-instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "user", "content": "Apa sentimen dari kalimat berikut ini?\nKalimat: Buku ini sangat membosankan.\nJawaban: "},
]
outputs = pipeline(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
SEA-LION achieves better or competitive performances on tasks in regional languages while retaining the general performance of Llama 3.
Our leaderboard is here.
We use a holistic approach to evaluation, including not just traditional Natural Language Processing (NLP) benchmarking tasks (such as sentiment analysis and question answering) but also meticulously handcrafted linguistic and cultural diagnostic tests tailored to Southeast Asia.
The benchmark was introduced here BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models and GitHub.
Please refer to serving the SEA-LION model with TGI.
Please refer to serving the SEA-LION model with vLLM.
To run SEA-LION locally with Ollama via the command line:
- Download and install Ollama
- Run and chat with SEA-LION with the following command
ollama run aisingapore/llama3-8b-cpt-sea-lionv2-instruct
or explore SEA-LION with Chainlit and Ollama here
We welcome contributions to SEA-LION! Check out the contributing guide to get started.
Some ways to contribute:
- Report bugs and issues
- Enhance the documentation
- Add more model evaluation tasks and metrics
- Train versions of the model in more SEA languages
If you use SEA-LION in your work, please cite it as:
@misc{sea_lion_2024,
title={SEA-LION (Southeast Asian Languages In One Network): A Family of Large Language Models for Southeast Asia},
author={AI Singapore},
year={2024},
howpublished={\url{https://github.com/aisingapore/sealion}}
}
AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinion, finding, conclusion or recommendation expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore, or the National University of Singapore.
If you have questions, comments, or issues, please open a GitHub issue or contact us via this SEA-LION Inquiry Form.
- 3 to 7 billion parameters
- Instruction tuned in English and Bahasa Indonesia
- Trained with 980B tokens of text data from 11 languages spoken across SEA
- Specialized vocabulary and tokenization for optimal performance in SEA languages
- Excels on tasks in regional languages
- Open source under the MIT License for community contribution and adoption
Base Models
Instruction-Tuned Models
Model Details Please see model cards on Hugging Face.
Additional information and guides about SEA-LION v1 can be found here