A comprehensive reference notebook demonstrating a wide range of NLP and text analysis techniques on the 20 Newsgroups dataset. Designed as both a learning resource and a reusable template for new text analysis projects.
| Category | Details |
|---|---|
| Data Wrangling | 20 Newsgroups ingestion, Pandas dataset construction, text statistics |
| Text Preprocessing | Tokenisation, stopword removal (NLTK + extended list), lemmatisation |
| Exploratory Analysis | Word frequency distributions, category-level box plots |
| Topic Modelling | Latent Dirichlet Allocation (LDA) with scikit-learn |
| Clustering | K-Means on TF-IDF vectors, t-SNE and PCA visualisation |
| Word Embeddings | Word2Vec training, similarity queries, 2-D projection |
| Document Similarity | Cosine similarity on TF-IDF representations |
| NER | spaCy named-entity recognition with entity-type frequency analysis |
| Sentiment Analysis | NLTK VADER and TextBlob, category-level sentiment comparison |
| Text Classification | Logistic Regression on TF-IDF features with accuracy reporting |
| Summarisation | Hugging Face Transformers summarisation pipeline |
| Dependency Parsing | spaCy POS tagging and dependency visualisation |
| Topic Coherence | Gensim coherence scores for LDA evaluation |
- Open the notebook in Google Colab via the badge above.
- Run all cells (Runtime → Run all). No data upload is needed — the 20 Newsgroups dataset is fetched automatically.
- To analyse your own text data, replace the
collect_data()call with a function that returns a list of documents, category labels, and category names in the same format.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Every technique in this notebook is deliberately context-agnostic. The 20 Newsgroups dataset is used purely as a convenient, well-understood benchmark — swap it for customer reviews, research abstracts, social media posts, or any other corpus and the analysis pipeline applies unchanged. The real value is in the workflow: start broad with frequency analysis, narrow down with topic modelling and clustering, then layer on entity recognition, sentiment, and classification as the questions demand.
This project is licensed under CC BY 4.0.
- CNN X-ray Image Classifier — deep learning for medical imaging
- VAE for Molecule Discovery — generative modelling for drug discovery
- kenreid.co.uk/data_science — all projects, publications, and CV
Ken Reid — Data Scientist, photographer, and avid reader.
- kenreid.co.uk — Portfolio & blog
- @kenreid.co.uk — Bluesky
- @DrKenReid — GitHub











