Language Identification in South African Texts

Problem Statement

South Africa is an incredibly diverse and multicultural society, reflected in its rich linguistic diversity. People in South Africa have the freedom to express themselves in any language or culture they are comfortable with. Language plays a crucial role in society, contributing significantly to the cultural, economic, political, and social fabric of the country.

As of May 2023, South Africa recognizes 12 official languages, all with equal legal status. Given this linguistic variety, it’s common for most South Africans to speak at least two official languages.

In a country with such a multilingual population, it’s essential for our systems and devices to support multiple languages.

In this challenge, I will work with text from any of South Africa's 11 official languages and identify the language it is written in. This task, known as Language Identification, is a fundamental problem in Natural Language Processing (NLP), where the goal is to determine the language of a given piece of text.

Project Description

This project aims to develop a language identification model using various machine learning algorithms. The model will be trained on a dataset containing text in multiple South African languages, allowing it to classify unseen text into the correct language.

Algorithms Used

Support Vector Machine (SVC)
Linear Support Vector Machine (Linear SVC)
Logistic Regression
Multinomial Naive Bayes (MultinomialNB)

Model Evaluation

The performance of the models is evaluated using F1 scores, providing insight into their effectiveness in balancing precision and recall. The best-performing model, based on the F1 score, will be selected for language classification tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Classification-Heckathon		Classification-Heckathon
LIP.png		LIP.png
Language-Identification_Hackathon_Final.ipynb		Language-Identification_Hackathon_Final.ipynb
README.md		README.md
test_set.csv		test_set.csv
train_set.csv		train_set.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Identification in South African Texts

Problem Statement

Project Description

Algorithms Used

Model Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Language Identification in South African Texts

Problem Statement

Project Description

Algorithms Used

Model Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages