This is done as a part of my master's studies.
The Seeds Dataset contains measurements of geometrical properties of wheat kernels belonging to three different varieties: Kama, Rosa, and Canadian. Each instance in the dataset consists of 7 features measured from wheat kernel images:
- Area
- Perimeter
- Compactness
- Length of kernel
- Width of kernel
- Asymmetry coefficient
- Length of kernel groove
The dataset comprises 210 samples, with 70 samples for each wheat variety.
K-Means is a partitioning clustering algorithm that divides data into K distinct, non-overlapping clusters.
Hierarchical clustering builds a tree of clusters by progressively merging or splitting groups. Agglomerative clustering follows a bottom-up approach:
-
Single Linkage
- Defines distance between clusters as the minimum distance between any two points from each cluster
-
Complete Linkage
- Defines distance between clusters as the maximum distance between any two points from each cluster
-
Average Linkage
- Defines distance between clusters as the average distance between all pairs of points across clusters
-
Ward Linkage
- Minimizes the increase in the sum of squared differences within all clusters after merging
A dendrogram is a tree-like diagram that records the sequences of merges in hierarchical clustering. It visualizes:
- The hierarchical relationship between clusters
- The distance or dissimilarity between merged clusters (height of the branch)
- The order in which clusters are formed
By cutting the dendrogram horizontally at a certain height, we can obtain any number of clusters.
The silhouette score measures how similar an object is to its own cluster compared to other clusters.
Purity measures how homogeneous each cluster is with respect to the true classes.
Since the Seeds dataset includes true labels, we can evaluate clustering as if it were a classification task.
Clustering + Labels --> Classification
- Accuracy
- Precision
- Recall
- F1 Score
- Confusion Matrix