Seeds Dataset Clustering Analysis

This is done as a part of my master's studies.

Introduction

The Seeds Dataset contains measurements of geometrical properties of wheat kernels belonging to three different varieties: Kama, Rosa, and Canadian. Each instance in the dataset consists of 7 features measured from wheat kernel images:

Area
Perimeter
Compactness
Length of kernel
Width of kernel
Asymmetry coefficient
Length of kernel groove

The dataset comprises 210 samples, with 70 samples for each wheat variety.

Clustering Methods

Elbow Method for Determining Optimal Number of Clusters

Centroid-based Clustering: K-Means

K-Means is a partitioning clustering algorithm that divides data into K distinct, non-overlapping clusters.

Agglomerative Hierarchical Clustering

Hierarchical clustering builds a tree of clusters by progressively merging or splitting groups. Agglomerative clustering follows a bottom-up approach:

Different Linkage Methods Analysis

Single Linkage
- Defines distance between clusters as the minimum distance between any two points from each cluster
Complete Linkage
- Defines distance between clusters as the maximum distance between any two points from each cluster
Average Linkage
- Defines distance between clusters as the average distance between all pairs of points across clusters
Ward Linkage
- Minimizes the increase in the sum of squared differences within all clusters after merging

Dendrogram Analysis

A dendrogram is a tree-like diagram that records the sequences of merges in hierarchical clustering. It visualizes:

The hierarchical relationship between clusters
The distance or dissimilarity between merged clusters (height of the branch)
The order in which clusters are formed

By cutting the dendrogram horizontally at a certain height, we can obtain any number of clusters.

Evaluation Metrics

Internal Evaluation: Silhouette Score

The silhouette score measures how similar an object is to its own cluster compared to other clusters.

External Evaluation: Purity Score

Purity measures how homogeneous each cluster is with respect to the true classes.

Classification Metrics

Since the Seeds dataset includes true labels, we can evaluate clustering as if it were a classification task.

Clustering + Labels --> Classification

Accuracy
Precision
Recall
F1 Score
Confusion Matrix

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
resources		resources
.gitignore		.gitignore
README.md		README.md
clustering_notebook.ipynb		clustering_notebook.ipynb
code_explanation.pdf		code_explanation.pdf
evaluation_utils.py		evaluation_utils.py
seeds_dataset.txt		seeds_dataset.txt
utils.py		utils.py
visualization_utils.py		visualization_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seeds Dataset Clustering Analysis

Introduction

Clustering Methods

Elbow Method for Determining Optimal Number of Clusters

Centroid-based Clustering: K-Means

Agglomerative Hierarchical Clustering

Different Linkage Methods Analysis

Dendrogram Analysis

Evaluation Metrics

Internal Evaluation: Silhouette Score

External Evaluation: Purity Score

Classification Metrics

About

Releases

Packages

Languages

pamudu123/seeds_clustering

Folders and files

Latest commit

History

Repository files navigation

Seeds Dataset Clustering Analysis

Introduction

Clustering Methods

Elbow Method for Determining Optimal Number of Clusters

Centroid-based Clustering: K-Means

Agglomerative Hierarchical Clustering

Different Linkage Methods Analysis

Dendrogram Analysis

Evaluation Metrics

Internal Evaluation: Silhouette Score

External Evaluation: Purity Score

Classification Metrics

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages