The regularized cost function adds a penalty term to the standard mean squared error:
$$J(\boldsymbol{\beta}) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_{\boldsymbol{\beta}}(\mathbf{x}^{(i)}) - y^{(i)} \right)^2 + \lambda \cdot \text{penalty}(\boldsymbol{\beta})$$
where $\lambda \geq 0$ is the regularization parameter controlling penalty strength. When $\lambda = 0$, we recover ordinary least squares (no regularization). As $\lambda$ increases, we constrain parameter magnitudes, favoring simpler models.
L2 Regularization (Ridge) adds the sum of squared parameter values: $\lambda \sum_{j=1}^{n} \beta_j^2$. Ridge regression shrinks all coefficients towards zero proportionally but never eliminates features entirely. This is particularly effective when many features are relevant but should have modest contributions. Ridge tends to perform better when features are correlated, as it distributes weight among correlated features rather than arbitrarily selecting one.
L1 Regularization (Lasso) adds the sum of absolute parameter values: $\lambda \sum_{j=1}^{n} |\beta_j|$. The L1 penalty drives some parameters to exactly zero, performing automatic feature selection. Lasso is preferred when we believe only a sparse subset of features are truly relevant, producing interpretable models by eliminating irrelevant features. However, when many features are correlated, Lasso tends to arbitrarily select one and zero out the others.
Practical guidance: Use Ridge when most features are relevant and should contribute modestly. Use Lasso when you need automatic feature selection and interpretable models with sparse coefficients. Both methods require tuning the regularization strength $\lambda$ (typically via cross-validation). Feature scaling is essential before applying regularization to ensure penalties apply fairly across all features.
Developed by Kevin Yu & Panagiotis Angeloudis