Automatic Sentiment Analysis Tool for Urdu Text on Social Media Platforms

Overview

This project develops a Natural Language Processing (NLP) pipeline for sentiment analysis of Urdu text extracted from various social media platforms like Twitter, Facebook, Instagram, and YouTube. The tool classifies the posts into positive, negative, or neutral sentiments to assist brands, influencers, and businesses in understanding Urdu-speaking users' sentiments.

Scenario

As a data scientist working in a firm specializing in sentiment analysis, the goal is to cater to Urdu-speaking users by addressing the complexities of the Urdu language and noisy data from social media. The key task is to preprocess and classify sentiments from Urdu social media posts using a custom NLP pipeline.

Key Features

Text Preprocessing:
- Stopword Removal: Custom stopword list tailored for Urdu.
- Punctuation, Emoji, and Hashtag Removal: Filtering non-informative tokens.
- Diacritics Removal: Removing diacritics like Zabar, Zer, Pesh.
Stemming & Lemmatization:
- Implementation of Urdu-specific stemming and lemmatization techniques.
Feature Extraction:
- Tokenization: Properly segmenting Urdu text.
- TF-IDF Analysis: Extracting relevant terms for sentiment classification.
- Word2Vec: Capturing word relationships based on context.
N-grams Analysis:
- Creation of unigrams, bigrams, and trigrams to identify common word patterns in Urdu text.
Sentiment Classification Model:
- Machine learning models (e.g., Logistic Regression, SVM) to classify sentiment.
- Evaluation Metrics: Performance metrics include accuracy, precision, recall, and F1-score.

Challenges Addressed

Urdu Text Complexity: Handling the grammatical structure, morphology, and script challenges.
Noisy Social Media Data: Dealing with emojis, spelling variations, URLs, and incomplete sentences.
Limited Language Resources: Development of custom Urdu NLP resources for stemming, tokenization, and sentiment lexicons.

Tools & Libraries

Python for core development
NLTK, spaCy, Urduhack for text processing
Scikit-learn for machine learning models
Gensim for Word2Vec implementation
pandas, matplotlib for data analysis and visualization

Dataset

A publicly available Urdu social media dataset from platforms such as Twitter or YouTube comments, consisting of raw social media posts and their sentiment labels (positive, negative, neutral).

Final Deliverables

Text Preprocessing Results: Cleaned Urdu text after preprocessing.
Feature Extraction Results: Tokenized text, TF-IDF scores, and Word2Vec outputs.
N-gram Analysis: Top unigrams, bigrams, and trigrams.
Sentiment Classification Model: Model performance summary (accuracy, precision, recall, F1-score).
Reflection: Challenges encountered and future optimization possibilities.

Reflection

This project highlights the challenges of performing sentiment analysis on Urdu text, including the handling of complex morphology, noisy data, and limited NLP resources for Urdu. Future improvements could involve incorporating deep learning models like BERT fine-tuned for Urdu, or leveraging additional datasets to improve performance.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.ipynb_checkpoints		.ipynb_checkpoints
.gitignore		.gitignore
README.md		README.md
cleaned_urdu_text.csv		cleaned_urdu_text.csv
index.ipynb		index.ipynb
lemmatized_dataset.csv		lemmatized_dataset.csv
normalized_dataset.csv		normalized_dataset.csv
similar_words_acha.csv		similar_words_acha.csv
stemmed_dataset.csv		stemmed_dataset.csv
stopwords-ur.txt		stopwords-ur.txt
tokenized_dataset.csv		tokenized_dataset.csv
top_10_bigrams.csv		top_10_bigrams.csv
top_10_trigrams.csv		top_10_trigrams.csv
top_tfidf_words.csv		top_tfidf_words.csv
urdu_sarcastic_dataset.csv		urdu_sarcastic_dataset.csv
urdu_word2vec_fixed.model		urdu_word2vec_fixed.model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Sentiment Analysis Tool for Urdu Text on Social Media Platforms

Overview

Scenario

Key Features

Challenges Addressed

Tools & Libraries

Dataset

Final Deliverables

Reflection

About

Releases

Packages

Languages

babar0081/Multi-class-Urdu-Sentiment-Analysis-System-SentiUrdu-Text-Mining-Tool

Folders and files

Latest commit

History

Repository files navigation

Automatic Sentiment Analysis Tool for Urdu Text on Social Media Platforms

Overview

Scenario

Key Features

Challenges Addressed

Tools & Libraries

Dataset

Final Deliverables

Reflection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages