To build a model that can automatically identify the language of a given text. Language identification is essential for various applications, including machine translation, multilingual document tracking, and electronic devices (e.g., mobiles, laptops).
LingualSense is a deep learning project for classifying text languages. This README provides step-by-step instructions from data analysis to deployment.
- Perform EDA to analyze your dataset.
- Check the distribution of languages and clean any irregularities in the dataset.
- Tokenize the text data and pad sequences to a uniform length for model compatibility.
- Save the tokenizer and label encoder for future use in the app.
- Use a GRU-based model for text classification.
- Train the model using tokenized and padded sequences.
- Save the trained model as
gru_model.h5.
- Create a
Streamlitapp for real-time predictions. - Include input text areas, model loading, and prediction functionality.
- Add a styled user interface for better interaction.
-
Clone the repository:
git clone https://github.com/Springboard429/LingualSense_Infosys_Internship_Oct2024.git cd LingualSense -
Create a virtual environment:
-
Windows:
python -m venv lingualsense_env lingualsense_env\Scripts\activate
-
Mac/Linux:
python -m venv lingualsense_env source lingualsense_env/bin/activate
-
-
Install dependencies:
pip install -r requirements.txt
-
Place the following files in the project directory:
gru_model.h5 tokenizer.joblib label_encoder.joblib
- Execute the following command:
streamlit run app.py
Open the local URL (e.g., http://localhost:XXXX) to access the app.
- Input text in the text area.
- Click "Detect Languages" to get the predicted language of the text.