Early identification of learning difficulty is crucial in educational platforms.
Instead of predicting outcomes after a student fails, this project focuses on predicting whether a student is likely to struggle with a topic based on prior learning behavior.
The core idea is that learning difficulty depends not only on topic-level performance, but also on overall learning patterns of the student.
Since no public dataset exactly fits this problem, a synthetic dataset was created to allow controlled experimentation and clear interpretation.
- 50 students
- 10 topics
- 500 total records (student–topic interactions)
Each row represents how a student performed on a particular topic.
avg_prev_score– average performance before the current topictime_ratio– actual time spent divided by expected timeattempts– number of attempts madetopic_order– sequence of the topic
This project combines unsupervised learning and supervised learning to build a realistic prediction pipeline.
High-level steps:
- Capture student learning behavior using clustering
- Integrate behavioral context into topic-level data
- Predict learning difficulty using a supervised model
- Validate improvements using a baseline comparison
- Interpret model behavior
Learning behavior is a student-level property, not a topic-level one.
Steps:
- Aggregate topic-level data to create student-level behavior profiles
- Use the following features:
- Mean previous score
- Mean time ratio
- Mean number of attempts
- Apply K-Means clustering
- Perform feature scaling since K-Means is distance-based
The resulting clusters represent different learning styles such as fast learners, average learners, and struggling learners.
At this stage, clusters provide context, not predictions.
Clustering is performed at the student level, while predictions are made at the topic level.
To combine both:
- Cluster labels are merged back into the topic-level dataset using a left join on
student_id
This ensures each topic interaction contains both performance data and student behavioral context.
A realistic and interpretable rule is used to define learning difficulty.
A student is considered to be struggling on a topic if:
avg_prev_score < 65- AND
time_ratio > 1.2
This creates a binary target variable:
1→ struggling0→ not struggling
To predict learning difficulty, Logistic Regression is used because:
- The target variable is binary
- The model is interpretable
- It aligns with foundational machine learning principles
Training process:
- Train–test split to evaluate generalization
- Feature scaling applied only to input features (to avoid data leakage)
- Model trained using:
- Topic-level performance indicators
- Effort-related features
- Student learning behavior cluster
To validate whether clustering added value:
- A baseline model was trained without the cluster feature
- Its performance was compared with the cluster-aware model
- Recall for struggling students remained the same
- Precision improved when behavioral clustering was included
This indicates that learning behavior context reduced false positives without missing at-risk students.
Logistic regression coefficients were analyzed to understand model behavior.
Key insights:
- Higher time spent strongly increases difficulty risk
- Strong prior performance reduces difficulty risk
- Learning behavior clusters add meaningful contextual information
These observations confirm that the model learned reasonable and explainable patterns.
- Combining unsupervised and supervised learning improves prediction quality
- Behavioral context helps reduce noisy predictions
- Baseline comparison is essential for validating ideas
- Model interpretation is as important as evaluation metrics
- Probability-based early warning thresholds
- Better handling of class imbalance
- API deployment using FastAPI
- Evaluation on real-world educational datasets