This interactive demo shows how gradient descent optimizes linear regression parameters. Unlike analytical solutions, gradient descent iteratively improves parameters by following the negative gradient of the cost function.

You'll see the algorithm step-by-step, watching how the regression line evolves and tracking the optimization path on the cost surface. This is the foundation for training machine learning models.
• Click "Step" to perform one gradient descent iteration. Watch the regression line evolve (left panel) and the optimization path on the cost surface (right panel)
• Use "Re-initialize" to start from new random parameters, or load different datasets to explore various optimization landscapes
• Adjust the learning rate η to see different convergence behaviors—too small is slow, too large causes oscillation

What learning rate gives the fastest convergence without overshooting? Try values from 0.01 to 0.5.
Data & Regression Line (top-left): Normalized data points, current regression line, residuals (dashed red lines showing prediction errors)
Cost Surface (top-right): Cost surface contours, current position (red dot), optimization path (yellow line)
Cost Improvement (bottom): Bar chart showing |ΔJ| (absolute cost reduction) for each iteration. Watch how the improvements diminish as the algorithm converges to the optimal solution
Readouts: Current iteration, normalized parameters α and β, cost J, and gradient norm ||J||
Caption: Shows what happened in each step (gradient computation, parameter updates, and ΔJ)
• As convergence approaches, both the gradient norm and |ΔJ| decrease toward zero
Data & Regression Line
Cost Surface
Cost Improvement per Step (ΔJ)
Optimization History
Iteration 0
α -1.300
β -0.900
Cost J 2.265
ΔJ
Iteration: 0
α: -1.300
Cost J: 2.265
β: -0.900
-1.3
-0.9
0.100

Mathematical Foundations

Gradient descent is an iterative optimization algorithm that finds parameter values minimizing the cost function J(α,β)=12ni=1n(α+βxiyi)2 for our linear regression model μ(x)=α+βx. Rather than solving analytically, gradient descent starts with an initial guess for the parameters and repeatedly adjusts them in the direction that most decreases the cost.

At each iteration, we update the parameters according to ααηJα and ββηJβ, where the gradients are Jα=1ni=1n(α+βxiyi) and Jβ=1ni=1n(α+βxiyi)xi.

The learning rate η is crucial: too small and the algorithm takes tiny steps requiring many iterations; too large and it may overshoot the minimum and diverge. A moderate η leads to steady decrease in J over iterations. As we approach the minimum, the gradient magnitude decreases, effectively reducing step size even with constant η.

Developed by Kevin Yu & Panagiotis Angeloudis