Part of the TopicModeling-ResearchTool Series
This is Part 2 of a topic modeling project. This project originally focused on analyzing textual data related to protests, specifically categorizing sentiments such as pro-police, pro-protester, anti-police, anti-protester, and neutral.
This updated version builds on that foundation, making the tool more flexible and general-purpose by allowing users to input their own custom categories. It uses Google's Gemini 2.0 Flash-Lite model to automatically classify documents into the selected categories.
- Upload CSV files containing document text.
- Choose a specific column for topic classification.
- Use either:
- Default protest-related categories or
- Your own custom categories (e.g., "sports, politics, entertainment").
- Classify text using Google's Gemini 2.0 Flash-Lite.
- Visualize results with an interactive bar chart.
- Search documents by keyword.
git clone https://github.com/munas-git/GenAITopicModeling-ResearchTool-2.git
cd GenAITopicModeling-ResearchTool-2
pip install -r requirements.txt
- Go to Google AI Studio
- Log in with your Google account.
- Click on "Create API Key".
- Copy the API key provided.
In the root directory of the project, create a .env
file and paste your API key like this:
GOOGLE_GENAI_API_KEY=your_google_genai_api_key
streamlit run app.py
- Upload a CSV file containing text (e.g., tweets, articles, abstracts).
- Choose the column that contains the document text.
- Enter your own categories or use defaults.
- The tool sends batches of documents to Gemini 2.0 Flash-Lite for classification.
- Outputs:
- Labeled data (document + assigned topic)
- Bar chart of topic frequency
- Keyword search for document filtering
- Classifying news articles by topic
- Analyzing customer reviews for sentiment
- Segmenting research abstracts by domain
- Filtering social media posts by stance or tone
- Ensure your text column does not contain missing values.
- Currently optimized for batch processing (30 documents at a time).
Feature | Part 1 | Part 2 (This repo) |
---|---|---|
Model used | OpenAI (GPT-4/3.5) | Google Gemini 2.0 Flash-Lite (Free, Yay!) |
Input flexibility | Fixed (N-Grams identified) categories | Custom user-defined categories supported |
Preprocessing & Topic Extraction | n-gram frequency filtering | Raw document classification via LLM |
API Key | OPENAI_API_KEY in .env |
GOOGLE_GENAI_API_KEY in .env |
This project is licensed under the MIT License.
Created to support researchers, journalists, and analysts in automatically categorizing large bodies of text using LLMs.
- Author: munas-git
- Contributions & feedback welcome via Issues or Pull Requests!