Skip to content

DhaneshKolu/student-learning-clustering

Repository files navigation

Early Prediction of Student Learning Difficulty Using Behavioral Patterns

Problem Statement

Early identification of learning difficulty is crucial in educational platforms.
Instead of predicting outcomes after a student fails, this project focuses on predicting whether a student is likely to struggle with a topic based on prior learning behavior.

The core idea is that learning difficulty depends not only on topic-level performance, but also on overall learning patterns of the student.


Dataset

Since no public dataset exactly fits this problem, a synthetic dataset was created to allow controlled experimentation and clear interpretation.

  • 50 students
  • 10 topics
  • 500 total records (student–topic interactions)

Each row represents how a student performed on a particular topic.

Features

  • avg_prev_score – average performance before the current topic
  • time_ratio – actual time spent divided by expected time
  • attempts – number of attempts made
  • topic_order – sequence of the topic

Approach Overview

This project combines unsupervised learning and supervised learning to build a realistic prediction pipeline.

High-level steps:

  1. Capture student learning behavior using clustering
  2. Integrate behavioral context into topic-level data
  3. Predict learning difficulty using a supervised model
  4. Validate improvements using a baseline comparison
  5. Interpret model behavior

1. Student-Level Behavioral Clustering

Learning behavior is a student-level property, not a topic-level one.

Steps:

  • Aggregate topic-level data to create student-level behavior profiles
  • Use the following features:
    • Mean previous score
    • Mean time ratio
    • Mean number of attempts
  • Apply K-Means clustering
  • Perform feature scaling since K-Means is distance-based

The resulting clusters represent different learning styles such as fast learners, average learners, and struggling learners.
At this stage, clusters provide context, not predictions.


2. Integrating Behavioral Context

Clustering is performed at the student level, while predictions are made at the topic level.

To combine both:

  • Cluster labels are merged back into the topic-level dataset using a left join on student_id

This ensures each topic interaction contains both performance data and student behavioral context.


3. Defining Learning Difficulty

A realistic and interpretable rule is used to define learning difficulty.

A student is considered to be struggling on a topic if:

  • avg_prev_score < 65
  • AND
  • time_ratio > 1.2

This creates a binary target variable:

  • 1 → struggling
  • 0 → not struggling

4. Difficulty Prediction (Supervised Learning)

To predict learning difficulty, Logistic Regression is used because:

  • The target variable is binary
  • The model is interpretable
  • It aligns with foundational machine learning principles

Training process:

  • Train–test split to evaluate generalization
  • Feature scaling applied only to input features (to avoid data leakage)
  • Model trained using:
    • Topic-level performance indicators
    • Effort-related features
    • Student learning behavior cluster

5. Baseline Comparison

To validate whether clustering added value:

  • A baseline model was trained without the cluster feature
  • Its performance was compared with the cluster-aware model

Observations

  • Recall for struggling students remained the same
  • Precision improved when behavioral clustering was included

This indicates that learning behavior context reduced false positives without missing at-risk students.


6. Model Interpretation

Logistic regression coefficients were analyzed to understand model behavior.

Key insights:

  • Higher time spent strongly increases difficulty risk
  • Strong prior performance reduces difficulty risk
  • Learning behavior clusters add meaningful contextual information

These observations confirm that the model learned reasonable and explainable patterns.


Key Takeaways

  • Combining unsupervised and supervised learning improves prediction quality
  • Behavioral context helps reduce noisy predictions
  • Baseline comparison is essential for validating ideas
  • Model interpretation is as important as evaluation metrics

Future Improvements

  • Probability-based early warning thresholds
  • Better handling of class imbalance
  • API deployment using FastAPI
  • Evaluation on real-world educational datasets

About

This project uses unsupervised machine learning to cluster students into different learning patterns. A synthetic dataset of student-topic interactions was created, processed, and scaled. KMeans clustering was applied to identify fast, average, and slow learners based on learning behavior metrics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors