ChatSphere is an advanced chatbot framework developed from scratch using Python with enhanced NLP capabilities, specifically designed for college environments. Built upon our initial work with DailyDialog dataset, this final version features a custom-built institutional dataset and robust deployment pipeline.
Key capabilities:
- Intent classification (Naive Bayes, Decision Tree, Random Forest)
- Advanced text preprocessing with MWE recognition and contraction handling
- TF-IDF feature extraction with mathematical rigor
- Containerized deployment on Google Cloud with Vercel frontend
- Contextual dialogue management
Text Preprocessing Pipeline Multi-Word Expression (MWE) Recognition using WordNet Comprehensive contraction handling Advanced hyphen/punctuation management Case normalization and token splitting
Feature Extraction
- Custom TF-IDF Vectorizer with smoothing:
TF-IDF(t,d) = (Count(t,d)/∑Count(t',d)) × [log((N+1)/(DF(t)+1)) + 1] - Vocabulary construction from tokenized corpus
Intent Classifiers
- Multinomial Naive Bayes (90.48% accuracy)
- Laplace smoothing implementation
P(t|c) = (Count(t,c)+1)/(∑Count(t',c)+|V|) - Decision Tree (89.08% accuracy)
- Gini Index and Entropy based splitting
- Random Forest (91.11% accuracy - best performing)
- Ensemble of decision trees with mode aggregation
Deployment Architecture
- Containerized using Docker
- Flask backend on Google Cloud
- Express/Node.js frontend on Vercel
- Vertex AI with L4 GPU for training
-
Place these files in your working directory:
-
intents.json - Custom college dataset with tags, patterns and responses
-
(Optional) DailyDialog files for baseline comparison
from chatbot import ChatSphere
chatbot = ChatSphere(bot_name="CollegeBot")
chatbot.train(data_file="intents.json")
- Containerization
FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "app.py"]
-
Google Cloud Deployment Build Docker image: docker build -t chatsphere .
Push to Google Container Registry
Deploy to Cloud Run with automatic scaling
-
Frontend on Vercel Configure frontend with backend API key
Deploy using Vercel CLI: vercel --prod
-
Training Infrastructure Use College_Chatbot.ipynb on Vertex AI (Colab Enterprise)
L4 GPU recommended for training
-
Academic Entities
chatbot.add_entity_pattern( "course_code", r"\b[A-Z]{2,4}\s?\d{3}\b" # e.g., "CS 101" )
-
Location Entities chatbot.add_entity_pattern( "campus_location", r"\b(?:library|lab complex|academic block)\b" )
-
Administrative Terms
chatbot.add_entity_pattern( "admin_term", r"\b(?:registrar|time table|exam schedule)\b" )
-
With Custom College Data chatbot.train( data_file="intents.json", test_size=0.2, vectorizer_params={"max_features": 1500} )
-
Performance Tuning
chatbot.set_model_params(
n_estimators=200,
max_depth=10,
min_samples_split=5
)
Example Session
You: Where is the CS101 class?
ChatSphere: CS101 is held in Academic Block B, Room 203.
You: When is the exam registration deadline?
ChatSphere: The deadline for exam registration is November 15th.
I detected: date: November 15th
Reset Conversation
chatbot.reset() # Clears dialogue history
-
Add New Intents Edit intents.json to include: { "tag": "new_intent", "patterns": ["sample queries"], "responses": ["appropriate responses"] }
-
Modify Classifier Options: 'naive_bayes', 'decision_tree', 'random_forest' chatbot = ChatSphere(model_type="random_forest")
-
WordNet Issues import wn wn.download("omw-en")
-
Deployment Errors Verify API keys in Google Cloud and Vercel
-
Check container logs: gcloud logging read
-
Performance Issues Increase TF-IDF max_features
-
Adjust classifier hyperparameters
-
Expand training data in intents.json
-
Implement transformer architectures
-
Add continuous learning pipeline
-
Integrate with college databases
-
Develop mobile interface
-
Aarav Dawer Bhojwani (Naive Bayes)
-
Agam Harpreet Singh & Ishan Shah (Random Forest, Deployment)
-
Nirmal Kumar Godara (Decision Tree)
-
Mahi Chouhan (Text Preprocessing)