ZaBantu is a project that aims to train cross-lingual language models for South African languages using the XLM-RoBERTa architecture. The project is inspired by the AfriBERTa and XLM-RoBERTa models. The project is currently in the beta phase and is being activily developed to ensure we have sufficient data and resources to benchmark the models.
You can navigate the documentation using the links below:
- Machine setup - Instructions on how to setup your machine to run the code in this repository on a CUDA GPU
- Get the data - Instructions on how to get the data used in this project
- Training the model - Instructions on how to train the model on our data or your own custom dataset
- Experiment tracking - Instructions on how to track your experiments using Comet.ml. This is optional but recommended
- You can quickly get started with training a light-weight model to see how everything works by following these instructions:
- WE ASSUME, that you already have access to a machine running
Ubuntu 20.04
with1 x NVIDIA Tesla T4
GPU. Any other version of Ubuntu or GPU should work similarly, but we have only tested on this configuration.
-
- Clone this repository to your local machine.
git clone https://github.com/ndamulelonemakh/zabantu-beta.git
cd zabantu-beta
-
- Install NVIDIA drivers (if not already installed)
bash scripts/nvidia_setup.sh
# This script will cause your machine to reboot
# Wait for the machine to reboot, then run it again(next step)
-
- Wait for the machine to reboot, then install the CUDA toolkit
bash scripts/nvidia_setup.sh
-
- Install Python Dependencies
bash scripts/server-setup.sh
# If any steps fail, try running the individual commands manually
Optional If you intend to use comet.ml and other optional tools, copy the
env.template
file to.env
and fill in the required fields
-
- Verify that your Pytorch installation is aware of your CUDA installation
# Optional:
python -c "import torch; print(torch.cuda.is_available())"
-
- Run the sample training pipeline by running the following command:
make train_lite
# If you are using comet.ml, you should be able to see the training progress on the comet.ml dashboard
- Refer to the CONTRIBUTING.md file for instructions on how to contribute to this project