This project implements a language recognition model using BERT (Bidirectional Encoder Representations from Transformers). The model is trained on a dataset of text samples in multiple languages and is capable of accurately identifying the language of a given input text.
- The dataset used for training is stored in
lang.csv. - It contains text samples labeled with their respective languages.
To run this project, install the necessary dependencies using:
pip install -r requirements.txt
Recommended dependencies:
- Python 3.x
- Transformers (Hugging Face)
- PyTorch or TensorFlow
- Pandas
- NumPy
- Scikit-learn
- Jupyter Notebook (for running the .ipynb file)
- Files
- language-recognition-using-bert.ipynb - Jupyter Notebook containing the model implementation.
- lang.csv - Dataset used for training and evaluation. README.md - Documentation for the project. Usage Open language-recognition-using-bert.ipynb in Jupyter Notebook. Run the cells to preprocess the dataset, train the model, and test its performance. Modify the input text in the inference section to predict the language of any given text. Model The project uses BERT-based embeddings to extract contextual representations of text. A classifier head is trained on top of the embeddings to predict the language. Results The model achieves high accuracy in distinguishing between multiple languages, thanks to the contextual understanding of BERT.
Future Improvements Train on a larger, more diverse dataset. Experiment with multilingual BERT (mBERT) for better cross-lingual performance. Deploy the model as an API for real-time language detection. Author VIVEK MARRI
License This project is licensed under the MIT License.
Let me know if you want any modifications! 🚀