Skip to content

DrKenReid/Generalized-Analysis-of-Text-Data

Repository files navigation

Generalized Analysis of Text Data

A comprehensive reference notebook demonstrating a wide range of NLP and text analysis techniques on the 20 Newsgroups dataset. Designed as both a learning resource and a reusable template for new text analysis projects.

Open In Colab

Techniques Demonstrated

Category Details
Data Wrangling 20 Newsgroups ingestion, Pandas dataset construction, text statistics
Text Preprocessing Tokenisation, stopword removal (NLTK + extended list), lemmatisation
Exploratory Analysis Word frequency distributions, category-level box plots
Topic Modelling Latent Dirichlet Allocation (LDA) with scikit-learn
Clustering K-Means on TF-IDF vectors, t-SNE and PCA visualisation
Word Embeddings Word2Vec training, similarity queries, 2-D projection
Document Similarity Cosine similarity on TF-IDF representations
NER spaCy named-entity recognition with entity-type frequency analysis
Sentiment Analysis NLTK VADER and TextBlob, category-level sentiment comparison
Text Classification Logistic Regression on TF-IDF features with accuracy reporting
Summarisation Hugging Face Transformers summarisation pipeline
Dependency Parsing spaCy POS tagging and dependency visualisation
Topic Coherence Gensim coherence scores for LDA evaluation

How to Use

  1. Open the notebook in Google Colab via the badge above.
  2. Run all cells (Runtime → Run all). No data upload is needed — the 20 Newsgroups dataset is fetched automatically.
  3. To analyse your own text data, replace the collect_data() call with a function that returns a list of documents, category labels, and category names in the same format.

Example Outputs

Top 20 Most Frequent Words Word Count by Category
LDA Topics Text Clustering (t-SNE)
Word Embeddings (PCA) Document Similarity Heatmap
Named Entities Topic Network
Sentiment by Category POS Tag Distribution
Dependency Parse Topic Coherence

A Note on Generality

Every technique in this notebook is deliberately context-agnostic. The 20 Newsgroups dataset is used purely as a convenient, well-understood benchmark — swap it for customer reviews, research abstracts, social media posts, or any other corpus and the analysis pipeline applies unchanged. The real value is in the workflow: start broad with frequency analysis, narrow down with topic modelling and clustering, then layer on entity recognition, sentiment, and classification as the questions demand.

License

This project is licensed under CC BY 4.0.

Related

Author

Ken Reid — Data Scientist, photographer, and avid reader.

About

A comprehensive toolkit for analyzing text data using various AI and NLP techniques, including topic modeling, sentiment analysis, and text classification, demonstrated on the 20 Newsgroups dataset.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors