|
| 1 | +From: Morten Hjorth-Jensen < [email protected]> |
| 2 | +Subject: Logreg |
| 3 | +Date: May 29, 2025 at 11:43:36 AM GMT+2 |
| 4 | +To: Morten Hjorth-Jensen < [email protected]> |
| 5 | + |
| 6 | +
|
| 7 | +Logistic Regression from Scratch |
| 8 | +
|
| 9 | +
|
| 10 | +Logistic regression is a foundational binary classification technique that models the probability of a sample belonging to class 1 using the sigmoid (logistic) function. In binary classification, it outputs a probability between 0 and 1, typically thresholded at 0.5 to decide class labels . For multiclass problems (more than two classes), we generalize logistic regression via the softmax function (also known as multinomial logistic regression), which produces a probability distribution over all classes . In both cases, training involves minimizing the cross-entropy loss (also called log-loss) using gradient descent . Accuracy is commonly used to evaluate classification performance (the fraction of correctly predicted samples) , alongside the cross-entropy value to measure model fit. |
| 11 | +
|
| 12 | +Our goal is to build an object-oriented Python implementation of logistic regression without scikit-learn, using only standard libraries like NumPy and CSV. The implementation will support both binary and multiclass modes, use gradient descent for training, and include methods for prediction and evaluation. It will also include synthetic data generation (to test the model) and functionality to export predictions and true labels to CSV. |
| 13 | +
|
| 14 | +
|
| 15 | +Design and Features |
| 16 | +
|
| 17 | +
|
| 18 | + • Class Structure: A LogisticRegression class encapsulates model parameters (weights, bias) and methods for training (fit) and prediction (predict, predict_proba). The design handles binary vs multiclass internally by checking how many unique target labels are present. |
| 19 | + • Probability Functions: For binary classification we use the sigmoid function. For multiclass, we use the softmax function, defined as \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}, which produces a valid probability distribution over classes . |
| 20 | + • Training with Gradient Descent: We implement batch gradient descent on the cross-entropy loss. For binary logistic regression the loss is
-\frac{1}{N}\sum_i [y_i\log(p_i) + (1-y_i)\log(1-p_i)]
where p_i=\sigma(\mathbf{w}^T\mathbf{x}i) . For multiclass, the loss (categorical cross-entropy) for one-hot true labels y and predicted probabilities \hat{y} is
-\frac{1}{N}\sum_i \sum{k=1}^C y_{ik}\log(\hat{y}_{ik})
. We compute gradients of these losses w.r.t. the weights and iteratively update the model. |
| 21 | + • Prediction: The predict_proba method returns probabilities (sigmoid or softmax). The predict method applies a threshold (0.5) for binary or takes argmax for multiclass. |
| 22 | + • Metrics: We include functions for accuracy and cross-entropy loss. Accuracy is defined as the fraction of correct predictions . We compute binary cross-entropy or categorical cross-entropy to quantify model performance on data. |
| 23 | + • Synthetic Data: We provide functions to generate synthetic datasets: (a) Binary data with two Gaussian clusters for two classes; (b) Multiclass data with several clusters, one per class. This allows testing the model. |
| 24 | + • CSV Export: Using Python’s csv module, we can save the true labels and predicted labels (or probabilities) to CSV files for external analysis. |
| 25 | +
|
| 26 | +
|
| 27 | +These components together form a clean, modular implementation. The following sections detail the code with explanations. |
| 28 | +
|
| 29 | +
|
| 30 | +LogisticRegression Class Implementation |
| 31 | +
|
| 32 | +
|
| 33 | +Below is the LogisticRegression class. It initializes parameters (learning rate, epochs, etc.), and defines private methods for adding an intercept term, sigmoid, and softmax. The fit method detects binary vs multiclass and runs gradient descent accordingly, updating weights. The predict_prob and predict methods compute probabilities and class labels. |
| 34 | +import numpy as np |
| 35 | +
|
| 36 | +class LogisticRegression: |
| 37 | + """ |
| 38 | + Logistic Regression for binary and multiclass classification. |
| 39 | + """ |
| 40 | + def __init__(self, lr=0.01, epochs=1000, fit_intercept=True, verbose=False): |
| 41 | + self.lr = lr # Learning rate for gradient descent |
| 42 | + self.epochs = epochs # Number of iterations |
| 43 | + self.fit_intercept = fit_intercept # Whether to add intercept (bias) |
| 44 | + self.verbose = verbose # Print loss during training if True |
| 45 | + self.weights = None |
| 46 | + self.multi_class = False # Will be determined at fit time |
| 47 | +
|
| 48 | + def _add_intercept(self, X): |
| 49 | + """Add intercept term (column of ones) to feature matrix.""" |
| 50 | + intercept = np.ones((X.shape[0], 1)) |
| 51 | + return np.concatenate((intercept, X), axis=1) |
| 52 | +
|
| 53 | + def _sigmoid(self, z): |
| 54 | + """Sigmoid function for binary logistic.""" |
| 55 | + return 1 / (1 + np.exp(-z)) |
| 56 | +
|
| 57 | + def _softmax(self, Z): |
| 58 | + """Softmax function for multiclass logistic.""" |
| 59 | + exp_Z = np.exp(Z - np.max(Z, axis=1, keepdims=True)) |
| 60 | + return exp_Z / np.sum(exp_Z, axis=1, keepdims=True) |
| 61 | +
|
| 62 | + def fit(self, X, y): |
| 63 | + """ |
| 64 | + Train the logistic regression model using gradient descent. |
| 65 | + Supports binary (sigmoid) and multiclass (softmax) based on y. |
| 66 | + """ |
| 67 | + X = np.array(X) |
| 68 | + y = np.array(y) |
| 69 | + n_samples, n_features = X.shape |
| 70 | +
|
| 71 | + # Add intercept if needed |
| 72 | + if self.fit_intercept: |
| 73 | + X = self._add_intercept(X) |
| 74 | + n_features += 1 |
| 75 | +
|
| 76 | + # Determine classes and mode (binary vs multiclass) |
| 77 | + unique_classes = np.unique(y) |
| 78 | + if len(unique_classes) > 2: |
| 79 | + self.multi_class = True |
| 80 | + else: |
| 81 | + self.multi_class = False |
| 82 | +
|
| 83 | + # ----- Multiclass case ----- |
| 84 | + if self.multi_class: |
| 85 | + n_classes = len(unique_classes) |
| 86 | + # Map original labels to 0...n_classes-1 |
| 87 | + class_to_index = {c: idx for idx, c in enumerate(unique_classes)} |
| 88 | + y_indices = np.array([class_to_index[c] for c in y]) |
| 89 | + # Initialize weight matrix (features x classes) |
| 90 | + self.weights = np.zeros((n_features, n_classes)) |
| 91 | +
|
| 92 | + # One-hot encode y |
| 93 | + Y_onehot = np.zeros((n_samples, n_classes)) |
| 94 | + Y_onehot[np.arange(n_samples), y_indices] = 1 |
| 95 | +
|
| 96 | + # Gradient descent |
| 97 | + for epoch in range(self.epochs): |
| 98 | + scores = X.dot(self.weights) # Linear scores (n_samples x n_classes) |
| 99 | + probs = self._softmax(scores) # Probabilities (n_samples x n_classes) |
| 100 | + # Compute gradient (features x classes) |
| 101 | + gradient = (1 / n_samples) * X.T.dot(probs - Y_onehot) |
| 102 | + # Update weights |
| 103 | + self.weights -= self.lr * gradient |
| 104 | +
|
| 105 | + if self.verbose and epoch % 100 == 0: |
| 106 | + # Compute current loss (categorical cross-entropy) |
| 107 | + loss = -np.sum(Y_onehot * np.log(probs + 1e-15)) / n_samples |
| 108 | + print(f"[Epoch {epoch}] Multiclass loss: {loss:.4f}") |
| 109 | +
|
| 110 | + # ----- Binary case ----- |
| 111 | + else: |
| 112 | + # Convert y to 0/1 if not already |
| 113 | + if not np.array_equal(unique_classes, [0, 1]): |
| 114 | + # Map the two classes to 0 and 1 |
| 115 | + class0, class1 = unique_classes |
| 116 | + y_binary = np.where(y == class1, 1, 0) |
| 117 | + else: |
| 118 | + y_binary = y.copy().astype(int) |
| 119 | +
|
| 120 | + # Initialize weights vector (features,) |
| 121 | + self.weights = np.zeros(n_features) |
| 122 | +
|
| 123 | + # Gradient descent |
| 124 | + for epoch in range(self.epochs): |
| 125 | + linear_model = X.dot(self.weights) # (n_samples,) |
| 126 | + probs = self._sigmoid(linear_model) # (n_samples,) |
| 127 | + # Gradient for binary cross-entropy |
| 128 | + gradient = (1 / n_samples) * X.T.dot(probs - y_binary) |
| 129 | + self.weights -= self.lr * gradient |
| 130 | +
|
| 131 | + if self.verbose and epoch % 100 == 0: |
| 132 | + # Compute binary cross-entropy loss |
| 133 | + loss = -np.mean( |
| 134 | + y_binary * np.log(probs + 1e-15) + |
| 135 | + (1 - y_binary) * np.log(1 - probs + 1e-15) |
| 136 | + ) |
| 137 | + print(f"[Epoch {epoch}] Binary loss: {loss:.4f}") |
| 138 | +
|
| 139 | + def predict_prob(self, X): |
| 140 | + """ |
| 141 | + Compute probability estimates. Returns a 1D array for binary or |
| 142 | + a 2D array (n_samples x n_classes) for multiclass. |
| 143 | + """ |
| 144 | + X = np.array(X) |
| 145 | + # Add intercept if the model used it |
| 146 | + if self.fit_intercept: |
| 147 | + X = self._add_intercept(X) |
| 148 | + scores = X.dot(self.weights) |
| 149 | + if self.multi_class: |
| 150 | + return self._softmax(scores) |
| 151 | + else: |
| 152 | + return self._sigmoid(scores) |
| 153 | +
|
| 154 | + def predict(self, X): |
| 155 | + """ |
| 156 | + Predict class labels for samples in X. |
| 157 | + Returns integer class labels (0,1 for binary, or 0...C-1 for multiclass). |
| 158 | + """ |
| 159 | + probs = self.predict_prob(X) |
| 160 | + if self.multi_class: |
| 161 | + # Choose class with highest probability |
| 162 | + return np.argmax(probs, axis=1) |
| 163 | + else: |
| 164 | + # Threshold at 0.5 for binary |
| 165 | + return (probs >= 0.5).astype(int) |
| 166 | +The class implements the sigmoid and softmax internally. During fit(), we check the number of classes: if more than 2, we set self.multi_class=True and perform multinomial logistic regression. We one-hot encode the target vector and update a weight matrix with softmax probabilities. Otherwise, we do standard binary logistic regression, converting labels to 0/1 if needed and updating a weight vector. In both cases we use batch gradient descent on the cross-entropy loss (we add a small epsilon 1e-15 to logs for numerical stability). Progress (loss) can be printed if verbose=True. |
| 167 | +
|
| 168 | +
|
| 169 | +Evaluation Metrics |
| 170 | +
|
| 171 | +
|
| 172 | +We define helper functions for accuracy and cross-entropy loss. Accuracy is the fraction of correct predictions . For loss, we compute the appropriate cross-entropy: |
| 173 | +def accuracy_score(y_true, y_pred): |
| 174 | + """Accuracy = (# correct predictions) / (total samples).""" |
| 175 | + y_true = np.array(y_true) |
| 176 | + y_pred = np.array(y_pred) |
| 177 | + return np.mean(y_true == y_pred) |
| 178 | +
|
| 179 | +def binary_cross_entropy(y_true, y_prob): |
| 180 | + """ |
| 181 | + Binary cross-entropy loss. |
| 182 | + y_true: true binary labels (0 or 1), y_prob: predicted probabilities for class 1. |
| 183 | + """ |
| 184 | + y_true = np.array(y_true) |
| 185 | + y_prob = np.clip(np.array(y_prob), 1e-15, 1-1e-15) |
| 186 | + return -np.mean(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob)) |
| 187 | +
|
| 188 | +def categorical_cross_entropy(y_true, y_prob): |
| 189 | + """ |
| 190 | + Categorical cross-entropy loss for multiclass. |
| 191 | + y_true: true labels (0...C-1), y_prob: array of predicted probabilities (n_samples x C). |
| 192 | + """ |
| 193 | + y_true = np.array(y_true, dtype=int) |
| 194 | + y_prob = np.clip(np.array(y_prob), 1e-15, 1-1e-15) |
| 195 | + # One-hot encode true labels |
| 196 | + n_samples, n_classes = y_prob.shape |
| 197 | + one_hot = np.zeros_like(y_prob) |
| 198 | + one_hot[np.arange(n_samples), y_true] = 1 |
| 199 | + # Compute cross-entropy |
| 200 | + loss_vec = -np.sum(one_hot * np.log(y_prob), axis=1) |
| 201 | + return np.mean(loss_vec) |
| 202 | +The binary cross-entropy matches the formula , and categorical cross-entropy aligns with the standard softmax loss . We clip probabilities to avoid log(0). Accuracy is straightforward (correct/total) . |
| 203 | +
|
| 204 | +
|
| 205 | +Synthetic Data Generation |
| 206 | +
|
| 207 | +
|
| 208 | +To test the model, we generate synthetic datasets: |
| 209 | +
|
| 210 | + • Binary classification data: Create two Gaussian clusters in 2D. For example, class 0 around mean [-2,-2] and class 1 around [2,2]. |
| 211 | + • Multiclass data: Create several Gaussian clusters (one per class) spread out in feature space. |
| 212 | +
|
| 213 | +
|
| 214 | +The code below demonstrates simple generators: |
| 215 | +import numpy as np |
| 216 | +
|
| 217 | +def generate_binary_data(n_samples=100, n_features=2, random_state=None): |
| 218 | + """ |
| 219 | + Generate synthetic binary classification data. |
| 220 | + Returns (X, y) where X is (n_samples x n_features), y in {0,1}. |
| 221 | + """ |
| 222 | + rng = np.random.RandomState(random_state) |
| 223 | + # Half samples for class 0, half for class 1 |
| 224 | + n0 = n_samples // 2 |
| 225 | + n1 = n_samples - n0 |
| 226 | + # Class 0 around mean -2, class 1 around +2 |
| 227 | + mean0 = -2 * np.ones(n_features) |
| 228 | + mean1 = 2 * np.ones(n_features) |
| 229 | + X0 = rng.randn(n0, n_features) + mean0 |
| 230 | + X1 = rng.randn(n1, n_features) + mean1 |
| 231 | + X = np.vstack((X0, X1)) |
| 232 | + y = np.array([0]*n0 + [1]*n1) |
| 233 | + return X, y |
| 234 | +
|
| 235 | +def generate_multiclass_data(n_samples=150, n_features=2, n_classes=3, random_state=None): |
| 236 | + """ |
| 237 | + Generate synthetic multiclass data with n_classes Gaussian clusters. |
| 238 | + """ |
| 239 | + rng = np.random.RandomState(random_state) |
| 240 | + X = [] |
| 241 | + y = [] |
| 242 | + samples_per_class = n_samples // n_classes |
| 243 | + for cls in range(n_classes): |
| 244 | + # Random cluster center for each class |
| 245 | + center = rng.uniform(-5, 5, size=n_features) |
| 246 | + Xi = rng.randn(samples_per_class, n_features) + center |
| 247 | + yi = [cls] * samples_per_class |
| 248 | + X.append(Xi) |
| 249 | + y.extend(yi) |
| 250 | + X = np.vstack(X) |
| 251 | + y = np.array(y) |
| 252 | + return X, y |
| 253 | +These functions use NumPy to create normally-distributed points. We fix cluster centers to separate classes. Each class’s points are in a distinct region, making the classification problem learnable by logistic regression. |
| 254 | +
|
| 255 | +
|
| 256 | +Demo: Training and Evaluation |
| 257 | +
|
| 258 | +
|
| 259 | +Finally, we test the implementation on synthetic data. We train on both binary and multiclass cases, then evaluate accuracy and loss, and export results to CSV. |
| 260 | +# Generate and test on binary data |
| 261 | +X_bin, y_bin = generate_binary_data(n_samples=200, n_features=2, random_state=42) |
| 262 | +model_bin = LogisticRegression(lr=0.1, epochs=1000) |
| 263 | +model_bin.fit(X_bin, y_bin) |
| 264 | +y_prob_bin = model_bin.predict_prob(X_bin) # probabilities for class 1 |
| 265 | +y_pred_bin = model_bin.predict(X_bin) # predicted classes 0 or 1 |
| 266 | +
|
| 267 | +acc_bin = accuracy_score(y_bin, y_pred_bin) |
| 268 | +loss_bin = binary_cross_entropy(y_bin, y_prob_bin) |
| 269 | +print(f"Binary Classification - Accuracy: {acc_bin:.2f}, Cross-Entropy Loss: {loss_bin:.2f}") |
| 270 | +For multiclass: |
| 271 | +# Generate and test on multiclass data |
| 272 | +X_multi, y_multi = generate_multiclass_data(n_samples=300, n_features=2, n_classes=3, random_state=1) |
| 273 | +model_multi = LogisticRegression(lr=0.1, epochs=1000) |
| 274 | +model_multi.fit(X_multi, y_multi) |
| 275 | +y_prob_multi = model_multi.predict_prob(X_multi) # (n_samples x 3) probabilities |
| 276 | +y_pred_multi = model_multi.predict(X_multi) # predicted labels 0,1,2 |
| 277 | +
|
| 278 | +acc_multi = accuracy_score(y_multi, y_pred_multi) |
| 279 | +loss_multi = categorical_cross_entropy(y_multi, y_prob_multi) |
| 280 | +print(f"Multiclass Classification - Accuracy: {acc_multi:.2f}, Cross-Entropy Loss: {loss_multi:.2f}") |
| 281 | +These print statements show how well the model fits the training data. |
| 282 | +
|
| 283 | +
|
| 284 | +CSV Export |
| 285 | +
|
| 286 | +
|
| 287 | +To save predictions and true labels, we use Python’s csv module. For example: |
| 288 | +import csv |
| 289 | +
|
| 290 | +# Export binary results |
| 291 | +with open('binary_results.csv', mode='w', newline='') as f: |
| 292 | + writer = csv.writer(f) |
| 293 | + writer.writerow(["TrueLabel", "PredictedLabel"]) |
| 294 | + for true, pred in zip(y_bin, y_pred_bin): |
| 295 | + writer.writerow([true, pred]) |
| 296 | +
|
| 297 | +# Export multiclass results |
| 298 | +with open('multiclass_results.csv', mode='w', newline='') as f: |
| 299 | + writer = csv.writer(f) |
| 300 | + writer.writerow(["TrueLabel", "PredictedLabel"]) |
| 301 | + for true, pred in zip(y_multi, y_pred_multi): |
| 302 | + writer.writerow([true, pred]) |
| 303 | +This writes two CSV files with columns for true and predicted labels. One can later analyze or plot these results externally. |
| 304 | +
|
| 305 | +
|
| 306 | +Summary |
| 307 | +
|
| 308 | +
|
| 309 | +The above implementation provides a clear, object-oriented logistic regression model in pure Python. It handles both binary and multiclass cases by using sigmoid and softmax functions respectively. Training uses batch gradient descent to minimize cross-entropy loss . We include methods for prediction, accuracy calculation , and loss computation, as well as utilities for generating synthetic data and exporting results. This modular design can be extended (e.g., adding regularization or optimization improvements) but already demonstrates the key mechanics of logistic regression end-to-end. |
| 310 | +
|
| 311 | +References: Logistic regression concepts and loss functions ; accuracy metric . |
| 312 | + |
| 313 | +Morten Hjorth-Jensen, Michigan State University and University of Oslo, Norway. http://mhjgit.github.io/info/doc/web |
0 commit comments