Discovering Themes in Text with Topic Modeling (Latent Dirichlet Allocation)

Description:

This method helps uncover hidden themes within a collection of text documents, making it a valuable tool for exploring unfamiliar domains. For example, a social scientist analyzing public discussions on social media or academic papers on a particular topic can use this method to identify recurring themes or topics. The method assumes that each document contains a mixture of topics and each topic comprises a distinct set of words. By processing a text file with one document per line, the method generates two key outputs i.e., i) the probability of each topic appearing in each document ii) the most representative words for each topic, along with their probabilities.

This approach is built on Latent Dirichlet Allocation (LDA), using a specialized technique called collapsed Gibbs sampling (LDA with collapsed Gibbs sampling). This enhances efficiency, producing a balanced topic distribution while allowing users control over the model’s internal workings. It uses Markov chain Monti Carlo approach to initialize the model with a random state. The method provides vanilla implementation (using only basic packages for loading data e.g., numpy, JSON, and random number generation) of Topic modeling with maximum control to customize its behavior. It gives users transparent control over internal decisions. The method is implemented as a class to extend its behavior easily.

Keywords

topic modeling, Latent Dirichlet Allocation, document analysis

Use Case(s)

A social scientist wants to examine the dynamics of political poll reviews to gain nuanced insights into voter interests.

Repo Structure

data/
data/input.csv (BBC article headings, using only first 100 for demo)

generated by prepare-data.ipynb

scripts

LDA-collapsed-gibbs-sampling.ipynb
LICENSE.txt
requirements.txt not needed

Environment setup

It is the vanilla implementation of the Latent Dirichlet Allocation technique with everything built from scratch, therefore only basic libraries i.e., numpy, pandas, random and string are needed to read data and generate random numbers.

Update config.json to read method configurations in JSON format and update as desired.
Setup the environment using requirements.txt through command pip install -r requirements.txt
Put your data in data/input.csv
Execute the notebook LDA-collapsed-gibbs-sampling.ipynb to get results

Sample input and output

Input data The input can be any text to explore. For demonstration purposes, we use BBC news article headlines as sample documents. Below are 10 example headlines taken from the dataset, which can be found in the file data/input.csv

Headlines
India calls for fair trade rules
Sluggish economy hits German jobs
Indonesians face fuel price rise
Court rejects $280bn tobacco case
Dollar gains on Greenspan speech
Mixed signals from French economy
Ask Jeeves tips online ad revival
Rank 'set to sell off film unit'
US trade gap hits record in 2004
India widens access to telecoms

Output 1: Word Distribution Per Topic

The latent topics identified are represented by the most significant words and their probabilities. It is similar to clustering in the sense that the words are grouped as topics and labeled unintuitively as topic 0, topic 1, etc. However, unlike clustering, the words have probabilities of relevance to the topic. Using these probabilities, only the top few words (10 in config.json) are used to represent a topic i.e., topic-word distribution. For the three topics:

Topic Name	Words and Probabilities
Topic 0	('deal', 0.039125056962437336)('profit', 0.03261506412342946)('profits', 0.026105071284421584)('Japanese', 0.019595078445413708)('takeover', 0.01308508560640583)('lifts', 0.01308508560640583)("India's", 0.01308508560640583)('high', 0.01308508560640583)('Parmalat', 0.01308508560640583)('China', 0.01308508560640583)
Topic 1	('economy', 0.04184945338068379)('hits', 0.03488614998955504)('fuel', 0.03488614998955504)('Yukos', 0.02792284659842629)('growth', 0.02792284659842629)('Japan', 0.02792284659842629)('German', 0.020959543207297537)('$280bn', 0.013996239816168788)('French', 0.013996239816168788)('prices', 0.013996239816168788)
Topic 2	('jobs', 0.024660229998155092)('firm', 0.024660229998155092)('gets', 0.024660229998155092)('India', 0.018510546706844596)('sales', 0.018510546706844596)('new', 0.018510546706844596)('oil', 0.018510546706844596)('BMW', 0.018510546706844596)('trade', 0.012360863415534098)('rise', 0.012360863415534098)

The complete distribution is written to data/output-data/topic-word-distribution.txt

Output 2: Topic Distribution Per Document

Each document is assigned probabilities of representing a topic based on the topic association of its words. These probabilities indicate the extent to which each document relates to specific topics. For example, document 0 can be 45% topic 0, 45% topic 1, and 10% topic 2.

In case a reader is interested in only reading more about topic 0, he/she may only focus on the documents where topic 0 is the major topic.

Document	Topic 0	Topic 1	Topic 2
Document 0	0.125	0.5	0.375
Document 1	0.375	0.25	0.375
Document 2	0.125	0.75	0.125
Document 3	0.375	0.25	0.375
Document 4	0.7142857142857143	0.14285714285714285	0.14285714285714285
Document 5	0.42857142857142855	0.2857142857142857	0.2857142857142857
Document 6	0.125	0.375	0.5
Document 7	0.25	0.0.5	0.25
Document 8	0.375	0.375	0.25
Document 9	0.14285714285714285	0.14285714285714285	0.7142857142857143
...

Written in file data/output-data/document-topic-distribution.txt

How to Use

Put your data in data/input.csv
Execute the first notebook prepare-data.ipynb to transform the data into integer encoding
Execute the main notebook `LDA-collapsed-gibbs-sampling.ipynb to get results

Contact details

M. Taimoor Khan ([email protected])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discovering Themes in Text with Topic Modeling (Latent Dirichlet Allocation)

Description:

Keywords

Use Case(s)

Repo Structure

Environment setup

Sample input and output

How to Use

Contact details

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.github/workflows		.github/workflows
data		data
CITATION.cff		CITATION.cff
LDA-collapsed-gibbs-sampling.ipynb		LDA-collapsed-gibbs-sampling.ipynb
LICENSE		LICENSE
README.md		README.md
config.json		config.json
postBuild		postBuild
requirements.txt		requirements.txt

License

taimoorkhan-nlp/latent_dirichlet_allocation

Folders and files

Latest commit

History

Repository files navigation

Discovering Themes in Text with Topic Modeling (Latent Dirichlet Allocation)

Description:

Keywords

Use Case(s)

Repo Structure

Environment setup

Sample input and output

How to Use

Contact details

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages