| ID |
|---|
| $x$ |
| True $y$ |
| Pred $y$ |
| Correct |
| Confidence |
The Sigmoid (Logistic) Function maps any real value to a probability between 0 and 1:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
The sigmoid has several crucial properties: its range is bounded to $(0, 1)$, making outputs interpretable as probabilities; it is smooth and differentiable everywhere, enabling gradient-based optimization; it is monotonic, with large positive values yielding $\sigma(z) \approx 1$ and large negative values yielding $\sigma(z) \approx 0$; and it is symmetric around $z=0$ where $\sigma(0) = 0.5$.
Logistic Regression Model: The model computes a weighted combination of inputs and passes it through the sigmoid to obtain a probability:
$$h(\mathbf{x}) = P(y=1 \mid \mathbf{x}) = \sigma(\alpha x + \beta) = \frac{1}{1 + e^{-(\alpha x + \beta)}}$$
To classify, we use a threshold (typically 0.5): predict class 1 if $h(\mathbf{x}) \geq 0.5$, otherwise predict class 0. The decision boundary occurs where $\alpha x + \beta = 0$ (where the sigmoid equals 0.5).
Loss Functions: Unlike regression, we cannot use mean squared error (MSE) for classification because combining MSE with the sigmoid output creates a non-convex cost surface with multiple local minima. Instead, we use cross-entropy loss (also called log loss or negative log-likelihood), which is designed for probabilistic classification and remains convex:
$$J(\alpha, \beta) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log(h(x^{(i)})) + (1-y^{(i)})\log(1-h(x^{(i)}))\right]$$
Cross-entropy heavily penalizes confident wrong predictions: predicting probability 0.9 for class 1 when the true label is 0 incurs much larger loss than predicting 0.6. This encourages the model to be calibrated and honest about uncertainty.
Practical guidance: Use sigmoid activation for classification to obtain probabilistic outputs and smooth decision boundaries. The sigmoid ensures predictions stay in the valid probability range [0,1] and provides confidence estimates. Cross-entropy loss is the standard choice for training classification models because it properly handles probabilistic predictions and enables efficient gradient-based optimization.
Developed by Kevin Yu & Panagiotis Angeloudis