This document describes the end-to-end process followed during exploratory data analysis (EDA) for the Bodyfat Dataset. It includes data cleaning, feature engineering, clustering, visualization, and interactive analysis.
Essential libraries are imported for:
- Data handling:
pandas,polars,numpy - Visualization:
matplotlib,seaborn,plotly - Machine learning:
scikit-learn,umap-learn - System utilities:
os,warnings,pathlib
Environment settings like LOKY_MAX_CPU_COUNT are configured and common warnings are suppressed to keep logs clean.
- The dataset is loaded using Polars for speed and converted to Pandas for compatibility.
- The file path is kept relative for flexibility across environments.
- Suspected inversions (
Height < 1.0,Weight > 100) are swapped.
- Records with extremely low height (< 0.5m) or weight (< 10kg) are removed.
- All rows with missing values are dropped to ensure clean inputs for analysis and modeling.
Four key derived features are created:
- BMI =
Weight / (Height^2) - Forearm to Biceps Ratio =
Forearm / Biceps - Waist to Height Ratio =
Abdomen / Height - Waist to Hip Ratio =
Abdomen / Hip
Additionally:
- BMI outliers (> 100) are identified and dropped to prevent distortion.
- Pairplots: To observe linear/nonlinear relationships.
- Heatmaps: Feature-wise correlation matrices.
- Histograms with KDE: Including Mean, Mode, and Outlier indicators.
- All numeric features are standardized using
StandardScaler.
- K-Means (k=3): Partitioning into 3 clusters.
- DBSCAN: Density-based clustering.
- Silhouette Scores are calculated to evaluate clustering performance.
- t-SNE and UMAP for visualizing high-dimensional data in 2D.
- Hierarchical Clustering using Ward linkage for dendrogram visualization.
- Scatter Plots: BodyFat vs BMI, colored by clusters.
- Box Plots: BMI, Abdomen, and Weight across clusters.
- Age is binned into: 18–30, 31–45, 46–60, and 60+.
- Violin Plots: BodyFat distribution by AgeGroup and Sex.
- Boxplots: Key features across AgeGroups and Sex.
- Sex-specific Correlation Heatmaps for top features.
- Built using Plotly Express.
- An interactive scatter plot: BMI vs BodyFat colored by clusters with hover data.
- Each numeric column is scanned for outliers using the rule:
|x - mean| > 2 * std. - Binary flags are added for each outlier column.
- Age vs BodyFat: Colored by Sex with LOWESS regression.
- BMI vs Abdomen: Cluster-based visual to analyze abdominal obesity patterns.
- BMI and Waist-based ratios are strong body composition indicators.
- Sex and age play crucial roles in body fat distribution.
- Unsupervised clustering highlights natural groupings in biometric data.
- t-SNE, UMAP, and dendrograms confirm the internal structure of the dataset.