Scikit Learn Clone is a custom implementation of various machine learning algorithms and utilities inspired by Scikit-Learn. This project is designed to provide a comprehensive set of tools for data preprocessing, model selection, evaluation, and supervised learning.
The project is organized into several modules, each focusing on a specific aspect of machine learning. Below is the detailed structure:
scikit_learn_clone/
├── data/
├── ensemble_methods/
│ ├── __init__.py
│ ├── adaBoost.py
│ ├── bagging.py
│ ├── gradient_boosting.py
│ ├── randomForest.py
│ ├── stacking.py
├── metrics_model_evaluation/
│ ├── __init__.py
│ ├── accuracy.py
│ ├── confusion_matrix.py
│ ├── f1_score.py
│ ├── mean_absolute_error.py
│ ├── mean_squared_error.py
│ ├── precision.py
│ ├── r2_score.py
│ ├── recall.py
├── model_selection/
│ ├── __init__.py
│ ├── cross_validation.py
│ ├── grid_search.py
│ ├── kfold.py
│ ├── make_scorer.py
│ ├── param_grid.py
│ ├── train_test_split.py
├── preprocessing/
│ ├── __init__.py
│ ├── impute_missing_values_mean.py
│ ├── impute_missing_values_median.py
│ ├── impute_missing_values_mode.py
│ ├── labelencoder.py
│ ├── normalize_features.py
│ ├── numerical_categorical_variable.py
│ ├── onehotencoder.py
│ ├── outliers.py
│ ├── scale_features_min_max.py
│ ├── scale_features_standard.py
│ ├── select_features.py
├── supervised_learning/
│ ├── __init__.py
│ ├── DecisionTree.py
│ ├── knn.py
│ ├── LinearRegression.py
│ ├── LogisticRegression.py
│ ├── NaiveBayes.py
│ ├── NeuralNetwork.py
├── testing/
├── utilities/
│ ├── __init__.py
│ ├── Estimator.py
│ ├── MetaEstimator.py
│ ├── ModelSelector.py
│ ├── Pipeline.py
│ ├── Predictor.py
│ ├── Transformer.py
├── .gitignore
├── README.md
└── setup.py
To install the package, use pip:
pip install sktlearn-clone
Here are some examples of how to use the various modules in this package.
from scikit_learn_clone.supervised_learning.DecisionTree import DecisionTreeClassifier
from scikit_learn_clone.model_selection.train_test_split import train_test_split
from scikit_learn_clone.metrics_model_evaluation.accuracy import accuracy_score
# Sample dataset
X = [[0, 0], [1, 1]]
y = [0, 1]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
# Initialize and train the classifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
# Predict and evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
from scikit_learn_clone.model_selection.kfold import KFold
from scikit_learn_clone.supervised_learning.LinearRegression import LinearRegression
# Sample dataset
X = [[i] for i in range(10)]
y = [2 * i for i in range(10)]
# Initialize KFold
kf = KFold(n_splits=5)
# Initialize model
model = LinearRegression()
# Perform K-Fold Cross-Validation
for train_index, test_index in kf.split(X):
X_train, X_test = [X[i] for i in train_index], [X[i] for i in test_index]
y_train, y_test = [y[i] for i in train_index], [y[i] for i in test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(f"Fold results: {predictions}")
- AdaBoost: Adaptive Boosting algorithm.
- Bagging: Bootstrap Aggregating algorithm.
- Gradient Boosting: Gradient Boosting algorithm.
- Random Forest: Ensemble of Decision Trees.
- Stacking: Stacked generalization.
- Accuracy: Classification accuracy.
- Confusion Matrix: Performance measurement for classification.
- F1 Score: Harmonic mean of precision and recall.
- Mean Absolute Error: Regression metric.
- Mean Squared Error: Regression metric.
- Precision: Classification precision.
- R2 Score: Coefficient of determination.
- Recall: Classification recall.
- Cross-Validation: Split the dataset into k consecutive folds.
- Grid Search: Exhaustive search over specified parameter values.
- K-Fold: K-Fold cross-validation iterator.
- Make Scorer: Convert metrics into callables.
- Param Grid: Define the parameter grid for search.
- Train-Test Split: Split arrays or matrices into random train and test subsets.
- Imputation: Handle missing values.
- Mean, Median, Mode imputation.
- Label Encoding: Encode categorical features as an integer array.
- Normalization: Scale input vectors individually to unit norm.
- One-Hot Encoding: Encode categorical integer features as a one-hot numeric array.
- Outlier Detection: Identify and handle outliers in the data.
- Feature Scaling: Standardize features by removing the mean and scaling to unit variance.
- Min-Max scaling.
- Feature Selection: Select features based on importance or correlation.
- Decision Tree: Decision Tree classifier.
- k-Nearest Neighbors: k-Nearest Neighbors algorithm.
- Linear Regression: Linear Regression algorithm.
- Logistic Regression: Logistic Regression algorithm.
- Naive Bayes: Naive Bayes algorithm.
- Neural Network: Neural Network algorithm.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature
). - Make your changes.
- Commit your changes (
git commit -m 'Add some feature'
). - Push to the branch (
git push origin feature/your-feature
). - Open a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
This revised version ensures the project structure is properly separated and formatted. This will improve readability and help users quickly find relevant information.