Skip to content

Commit a093015

Browse files
committed
update
1 parent f85d5a1 commit a093015

File tree

2 files changed

+702
-0
lines changed

2 files changed

+702
-0
lines changed

doc/Programs/Regression/Logreg.txt

Lines changed: 313 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,313 @@
1+
From: Morten Hjorth-Jensen <[email protected]>
2+
Subject: Logreg
3+
Date: May 29, 2025 at 11:43:36 AM GMT+2
4+
To: Morten Hjorth-Jensen <[email protected]>
5+
6+
7+
Logistic Regression from Scratch
8+
9+
10+
Logistic regression is a foundational binary classification technique that models the probability of a sample belonging to class 1 using the sigmoid (logistic) function. In binary classification, it outputs a probability  between 0 and 1, typically thresholded at 0.5 to decide class labels . For multiclass problems (more than two classes), we generalize logistic regression via the softmax function (also known as multinomial logistic regression), which produces a probability distribution over all classes . In both cases, training involves minimizing the cross-entropy loss (also called log-loss) using gradient descent . Accuracy is commonly used to evaluate classification performance (the fraction of correctly predicted samples) , alongside the cross-entropy value to measure model fit.
11+
12+
Our goal is to build an object-oriented Python implementation of logistic regression without scikit-learn, using only standard libraries like NumPy and CSV. The implementation will support both binary and multiclass modes, use gradient descent for training, and include methods for prediction and evaluation. It will also include synthetic data generation (to test the model) and functionality to export predictions and true labels to CSV.
13+
14+
15+
Design and Features
16+
17+
18+
• Class Structure: A LogisticRegression class encapsulates model parameters (weights, bias) and methods for training (fit) and prediction (predict, predict_proba). The design handles binary vs multiclass internally by checking how many unique target labels are present.
19+
• Probability Functions: For binary classification we use the sigmoid function. For multiclass, we use the softmax function, defined as \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}, which produces a valid probability distribution over classes .
20+
• Training with Gradient Descent: We implement batch gradient descent on the cross-entropy loss. For binary logistic regression the loss is
-\frac{1}{N}\sum_i [y_i\log(p_i) + (1-y_i)\log(1-p_i)]
where p_i=\sigma(\mathbf{w}^T\mathbf{x}i) . For multiclass, the loss (categorical cross-entropy) for one-hot true labels y and predicted probabilities \hat{y} is
-\frac{1}{N}\sum_i \sum{k=1}^C y_{ik}\log(\hat{y}_{ik})
. We compute gradients of these losses w.r.t. the weights and iteratively update the model.
21+
• Prediction: The predict_proba method returns probabilities (sigmoid or softmax). The predict method applies a threshold (0.5) for binary or takes argmax for multiclass.
22+
• Metrics: We include functions for accuracy and cross-entropy loss. Accuracy is defined as the fraction of correct predictions . We compute binary cross-entropy or categorical cross-entropy to quantify model performance on data.
23+
• Synthetic Data: We provide functions to generate synthetic datasets: (a) Binary data with two Gaussian clusters for two classes; (b) Multiclass data with several clusters, one per class. This allows testing the model.
24+
• CSV Export: Using Python’s csv module, we can save the true labels and predicted labels (or probabilities) to CSV files for external analysis.
25+
26+
27+
These components together form a clean, modular implementation. The following sections detail the code with explanations.
28+
29+
30+
LogisticRegression Class Implementation
31+
32+
33+
Below is the LogisticRegression class. It initializes parameters (learning rate, epochs, etc.), and defines private methods for adding an intercept term, sigmoid, and softmax. The fit method detects binary vs multiclass and runs gradient descent accordingly, updating weights. The predict_prob and predict methods compute probabilities and class labels.
34+
import numpy as np
35+
36+
class LogisticRegression:
37+
    """
38+
    Logistic Regression for binary and multiclass classification.
39+
    """
40+
    def __init__(self, lr=0.01, epochs=1000, fit_intercept=True, verbose=False):
41+
        self.lr = lr                  # Learning rate for gradient descent
42+
        self.epochs = epochs          # Number of iterations
43+
        self.fit_intercept = fit_intercept  # Whether to add intercept (bias)
44+
        self.verbose = verbose        # Print loss during training if True
45+
        self.weights = None
46+
        self.multi_class = False      # Will be determined at fit time
47+
48+
    def _add_intercept(self, X):
49+
        """Add intercept term (column of ones) to feature matrix."""
50+
        intercept = np.ones((X.shape[0], 1))
51+
        return np.concatenate((intercept, X), axis=1)
52+
53+
    def _sigmoid(self, z):
54+
        """Sigmoid function for binary logistic."""
55+
        return 1 / (1 + np.exp(-z))
56+
57+
    def _softmax(self, Z):
58+
        """Softmax function for multiclass logistic."""
59+
        exp_Z = np.exp(Z - np.max(Z, axis=1, keepdims=True))
60+
        return exp_Z / np.sum(exp_Z, axis=1, keepdims=True)
61+
62+
    def fit(self, X, y):
63+
        """
64+
        Train the logistic regression model using gradient descent.
65+
        Supports binary (sigmoid) and multiclass (softmax) based on y.
66+
        """
67+
        X = np.array(X)
68+
        y = np.array(y)
69+
        n_samples, n_features = X.shape
70+
71+
        # Add intercept if needed
72+
        if self.fit_intercept:
73+
            X = self._add_intercept(X)
74+
            n_features += 1
75+
76+
        # Determine classes and mode (binary vs multiclass)
77+
        unique_classes = np.unique(y)
78+
        if len(unique_classes) > 2:
79+
            self.multi_class = True
80+
        else:
81+
            self.multi_class = False
82+
83+
        # ----- Multiclass case -----
84+
        if self.multi_class:
85+
            n_classes = len(unique_classes)
86+
            # Map original labels to 0...n_classes-1
87+
            class_to_index = {c: idx for idx, c in enumerate(unique_classes)}
88+
            y_indices = np.array([class_to_index[c] for c in y])
89+
            # Initialize weight matrix (features x classes)
90+
            self.weights = np.zeros((n_features, n_classes))
91+
92+
            # One-hot encode y
93+
            Y_onehot = np.zeros((n_samples, n_classes))
94+
            Y_onehot[np.arange(n_samples), y_indices] = 1
95+
96+
            # Gradient descent
97+
            for epoch in range(self.epochs):
98+
                scores = X.dot(self.weights)          # Linear scores (n_samples x n_classes)
99+
                probs = self._softmax(scores)        # Probabilities (n_samples x n_classes)
100+
                # Compute gradient (features x classes)
101+
                gradient = (1 / n_samples) * X.T.dot(probs - Y_onehot)
102+
                # Update weights
103+
                self.weights -= self.lr * gradient
104+
105+
                if self.verbose and epoch % 100 == 0:
106+
                    # Compute current loss (categorical cross-entropy)
107+
                    loss = -np.sum(Y_onehot * np.log(probs + 1e-15)) / n_samples
108+
                    print(f"[Epoch {epoch}] Multiclass loss: {loss:.4f}")
109+
110+
        # ----- Binary case -----
111+
        else:
112+
            # Convert y to 0/1 if not already
113+
            if not np.array_equal(unique_classes, [0, 1]):
114+
                # Map the two classes to 0 and 1
115+
                class0, class1 = unique_classes
116+
                y_binary = np.where(y == class1, 1, 0)
117+
            else:
118+
                y_binary = y.copy().astype(int)
119+
120+
            # Initialize weights vector (features,)
121+
            self.weights = np.zeros(n_features)
122+
123+
            # Gradient descent
124+
            for epoch in range(self.epochs):
125+
                linear_model = X.dot(self.weights)     # (n_samples,)
126+
                probs = self._sigmoid(linear_model)   # (n_samples,)
127+
                # Gradient for binary cross-entropy
128+
                gradient = (1 / n_samples) * X.T.dot(probs - y_binary)
129+
                self.weights -= self.lr * gradient
130+
131+
                if self.verbose and epoch % 100 == 0:
132+
                    # Compute binary cross-entropy loss
133+
                    loss = -np.mean(
134+
                        y_binary * np.log(probs + 1e-15) + 
135+
                        (1 - y_binary) * np.log(1 - probs + 1e-15)
136+
                    )
137+
                    print(f"[Epoch {epoch}] Binary loss: {loss:.4f}")
138+
139+
    def predict_prob(self, X):
140+
        """
141+
        Compute probability estimates. Returns a 1D array for binary or
142+
        a 2D array (n_samples x n_classes) for multiclass.
143+
        """
144+
        X = np.array(X)
145+
        # Add intercept if the model used it
146+
        if self.fit_intercept:
147+
            X = self._add_intercept(X)
148+
        scores = X.dot(self.weights)
149+
        if self.multi_class:
150+
            return self._softmax(scores)
151+
        else:
152+
            return self._sigmoid(scores)
153+
154+
    def predict(self, X):
155+
        """
156+
        Predict class labels for samples in X.
157+
        Returns integer class labels (0,1 for binary, or 0...C-1 for multiclass).
158+
        """
159+
        probs = self.predict_prob(X)
160+
        if self.multi_class:
161+
            # Choose class with highest probability
162+
            return np.argmax(probs, axis=1)
163+
        else:
164+
            # Threshold at 0.5 for binary
165+
            return (probs >= 0.5).astype(int)
166+
The class implements the sigmoid and softmax internally. During fit(), we check the number of classes: if more than 2, we set self.multi_class=True and perform multinomial logistic regression. We one-hot encode the target vector and update a weight matrix with softmax probabilities. Otherwise, we do standard binary logistic regression, converting labels to 0/1 if needed and updating a weight vector. In both cases we use batch gradient descent on the cross-entropy loss (we add a small epsilon 1e-15 to logs for numerical stability). Progress (loss) can be printed if verbose=True.
167+
168+
169+
Evaluation Metrics
170+
171+
172+
We define helper functions for accuracy and cross-entropy loss. Accuracy is the fraction of correct predictions . For loss, we compute the appropriate cross-entropy:
173+
def accuracy_score(y_true, y_pred):
174+
    """Accuracy = (# correct predictions) / (total samples)."""
175+
    y_true = np.array(y_true)
176+
    y_pred = np.array(y_pred)
177+
    return np.mean(y_true == y_pred)
178+
179+
def binary_cross_entropy(y_true, y_prob):
180+
    """
181+
    Binary cross-entropy loss.
182+
    y_true: true binary labels (0 or 1), y_prob: predicted probabilities for class 1.
183+
    """
184+
    y_true = np.array(y_true)
185+
    y_prob = np.clip(np.array(y_prob), 1e-15, 1-1e-15)
186+
    return -np.mean(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))
187+
188+
def categorical_cross_entropy(y_true, y_prob):
189+
    """
190+
    Categorical cross-entropy loss for multiclass.
191+
    y_true: true labels (0...C-1), y_prob: array of predicted probabilities (n_samples x C).
192+
    """
193+
    y_true = np.array(y_true, dtype=int)
194+
    y_prob = np.clip(np.array(y_prob), 1e-15, 1-1e-15)
195+
    # One-hot encode true labels
196+
    n_samples, n_classes = y_prob.shape
197+
    one_hot = np.zeros_like(y_prob)
198+
    one_hot[np.arange(n_samples), y_true] = 1
199+
    # Compute cross-entropy
200+
    loss_vec = -np.sum(one_hot * np.log(y_prob), axis=1)
201+
    return np.mean(loss_vec)
202+
The binary cross-entropy matches the formula  , and categorical cross-entropy aligns with the standard softmax loss . We clip probabilities to avoid log(0). Accuracy is straightforward (correct/total) .
203+
204+
205+
Synthetic Data Generation
206+
207+
208+
To test the model, we generate synthetic datasets:
209+
210+
• Binary classification data: Create two Gaussian clusters in 2D. For example, class 0 around mean [-2,-2] and class 1 around [2,2].
211+
• Multiclass data: Create several Gaussian clusters (one per class) spread out in feature space.
212+
213+
214+
The code below demonstrates simple generators:
215+
import numpy as np
216+
217+
def generate_binary_data(n_samples=100, n_features=2, random_state=None):
218+
    """
219+
    Generate synthetic binary classification data.
220+
    Returns (X, y) where X is (n_samples x n_features), y in {0,1}.
221+
    """
222+
    rng = np.random.RandomState(random_state)
223+
    # Half samples for class 0, half for class 1
224+
    n0 = n_samples // 2
225+
    n1 = n_samples - n0
226+
    # Class 0 around mean -2, class 1 around +2
227+
    mean0 = -2 * np.ones(n_features)
228+
    mean1 =  2 * np.ones(n_features)
229+
    X0 = rng.randn(n0, n_features) + mean0
230+
    X1 = rng.randn(n1, n_features) + mean1
231+
    X = np.vstack((X0, X1))
232+
    y = np.array([0]*n0 + [1]*n1)
233+
    return X, y
234+
235+
def generate_multiclass_data(n_samples=150, n_features=2, n_classes=3, random_state=None):
236+
    """
237+
    Generate synthetic multiclass data with n_classes Gaussian clusters.
238+
    """
239+
    rng = np.random.RandomState(random_state)
240+
    X = []
241+
    y = []
242+
    samples_per_class = n_samples // n_classes
243+
    for cls in range(n_classes):
244+
        # Random cluster center for each class
245+
        center = rng.uniform(-5, 5, size=n_features)
246+
        Xi = rng.randn(samples_per_class, n_features) + center
247+
        yi = [cls] * samples_per_class
248+
        X.append(Xi)
249+
        y.extend(yi)
250+
    X = np.vstack(X)
251+
    y = np.array(y)
252+
    return X, y
253+
These functions use NumPy to create normally-distributed points. We fix cluster centers to separate classes. Each class’s points are in a distinct region, making the classification problem learnable by logistic regression.
254+
255+
256+
Demo: Training and Evaluation
257+
258+
259+
Finally, we test the implementation on synthetic data. We train on both binary and multiclass cases, then evaluate accuracy and loss, and export results to CSV.
260+
# Generate and test on binary data
261+
X_bin, y_bin = generate_binary_data(n_samples=200, n_features=2, random_state=42)
262+
model_bin = LogisticRegression(lr=0.1, epochs=1000)
263+
model_bin.fit(X_bin, y_bin)
264+
y_prob_bin = model_bin.predict_prob(X_bin)      # probabilities for class 1
265+
y_pred_bin = model_bin.predict(X_bin)           # predicted classes 0 or 1
266+
267+
acc_bin = accuracy_score(y_bin, y_pred_bin)
268+
loss_bin = binary_cross_entropy(y_bin, y_prob_bin)
269+
print(f"Binary Classification - Accuracy: {acc_bin:.2f}, Cross-Entropy Loss: {loss_bin:.2f}")
270+
For multiclass:
271+
# Generate and test on multiclass data
272+
X_multi, y_multi = generate_multiclass_data(n_samples=300, n_features=2, n_classes=3, random_state=1)
273+
model_multi = LogisticRegression(lr=0.1, epochs=1000)
274+
model_multi.fit(X_multi, y_multi)
275+
y_prob_multi = model_multi.predict_prob(X_multi)     # (n_samples x 3) probabilities
276+
y_pred_multi = model_multi.predict(X_multi)          # predicted labels 0,1,2
277+
278+
acc_multi = accuracy_score(y_multi, y_pred_multi)
279+
loss_multi = categorical_cross_entropy(y_multi, y_prob_multi)
280+
print(f"Multiclass Classification - Accuracy: {acc_multi:.2f}, Cross-Entropy Loss: {loss_multi:.2f}")
281+
These print statements show how well the model fits the training data.
282+
283+
284+
CSV Export
285+
286+
287+
To save predictions and true labels, we use Python’s csv module. For example:
288+
import csv
289+
290+
# Export binary results
291+
with open('binary_results.csv', mode='w', newline='') as f:
292+
    writer = csv.writer(f)
293+
    writer.writerow(["TrueLabel", "PredictedLabel"])
294+
    for true, pred in zip(y_bin, y_pred_bin):
295+
        writer.writerow([true, pred])
296+
297+
# Export multiclass results
298+
with open('multiclass_results.csv', mode='w', newline='') as f:
299+
    writer = csv.writer(f)
300+
    writer.writerow(["TrueLabel", "PredictedLabel"])
301+
    for true, pred in zip(y_multi, y_pred_multi):
302+
        writer.writerow([true, pred])
303+
This writes two CSV files with columns for true and predicted labels. One can later analyze or plot these results externally.
304+
305+
306+
Summary
307+
308+
309+
The above implementation provides a clear, object-oriented logistic regression model in pure Python. It handles both binary and multiclass cases by using sigmoid and softmax functions respectively. Training uses batch gradient descent to minimize cross-entropy loss . We include methods for prediction, accuracy calculation , and loss computation, as well as utilities for generating synthetic data and exporting results. This modular design can be extended (e.g., adding regularization or optimization improvements) but already demonstrates the key mechanics of logistic regression end-to-end.
310+
311+
References: Logistic regression concepts and loss functions ; accuracy metric .
312+
313+
Morten Hjorth-Jensen, Michigan State University and University of Oslo, Norway. http://mhjgit.github.io/info/doc/web

0 commit comments

Comments
 (0)