This repository contains data preprocessing scripts and unit tests for handling the AI4M Dataset. The project includes:
- Non-encoded preprocessing: Filling missing values with statistical measures.
- One-hot encoding preprocessing: Transforming categorical features into numerical format.
- Unit tests: Ensuring correctness of preprocessing steps.
- Handles missing values (mode for categorical, median for numerical).
- Performs one-hot encoding while managing unknown categories.
- Includes unit tests to validate preprocessing correctness.
To run this project locally, follow these steps:
-
Clone the repository
git clone https://github.com/your-username/your-repo.git cd your-repo -
Install dependencies
pip install pandas numpy scikit-learn unittest
Run the preprocessing scripts on your dataset:
from preprocessing import preprocess_non_encoded, preprocess_one_hot
import pandas as pd
df = pd.read_csv("AI4M Dataset.csv")
non_encoded_df = preprocess_non_encoded(df)
one_hot_df = preprocess_one_hot(df)Ensure that preprocessing functions work correctly:
python -m unittest test_preprocessing.py├── preprocessing.py # Preprocessing functions
├── test_preprocessing.py # Unit tests for preprocessing
├── AI4M Dataset.csv # Sample dataset
├── README.md # Project documentation
Feel free to open an issue or submit a pull request if you have improvements!