K-Means Clustering

K-means is one of the most popular unsupervised machine learning algorithms for clustering data into groups. It partitions data into k clusters by finding centroids that minimize the within-cluster sum of squared distances.

This algorithm is widely used in:

  • Market segmentation - grouping customers by behavior
  • Image processing - color quantization and compression
  • Data mining - discovering patterns in large datasets
  • Civil engineering - analyzing spatial patterns in infrastructure data

The algorithm is iterative and guaranteed to converge, though not necessarily to the global optimum.

Mathematical Foundation

K-means minimizes the within-cluster sum of squared distances (WCSS):

$$J = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2$$

Where:

  • $k$ = number of clusters
  • $C_i$ = set of points in cluster $i$
  • $\mu_i$ = centroid of cluster $i$
  • $||x - \mu_i||^2$ = squared Euclidean distance

Algorithm Steps:

  1. Initialize: Place $k$ centroids randomly: $\mu_1, \mu_2, ..., \mu_k$
  2. Assign: For each point $x_j$, assign to nearest centroid: $$c_j = \arg\min_i ||x_j - \mu_i||^2$$
  3. Update: Recalculate centroids as cluster means: $$\mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x$$
  4. Repeat: Steps 2-3 until convergence (centroids stop moving)

Convergence: The algorithm converges when centroids no longer change position between iterations, or the change is below a threshold.

How to Use This Demo

Basic Controls:

  • Adjust K slider - Change the number of clusters to see different groupings
  • Step Forward - Run one iteration of the algorithm
  • Final Results - Run until convergence
  • Reset Training - Reset algorithm state (keeps data points)
  • Generate New Data - Create fresh random data points

Interactive Features:

  • Add data points - Enable "Adding Data Points" and click anywhere on the plot
  • Drag centroids - Click and drag the solid circles to manually position centroids
  • Show distance lines - Toggle to always show connections between points and centroids
  • Hover effects - Hover over centroids to see their cluster connections

Visualization Guide:

  • Hollow circles: Data points, colored by cluster assignment
  • Solid circles: Centroids showing cluster centers
  • Colors: Each cluster has a unique color
  • Lines: Connect points to their assigned centroids

Understanding K-Means

Choosing K:

  • Try different values of K to see how cluster quality changes
  • Too few clusters may miss important patterns
  • Too many clusters may overfit to noise
  • In practice, use methods like the "elbow method" or silhouette analysis

Algorithm Behavior:

  • Initial centroid placement affects final results
  • K-means finds spherical clusters (assumption of similar variance)
  • Sensitive to outliers - they can pull centroids away from natural centers
  • Always converges, but may find local optima

Practical Considerations:

  • Scale your features before applying k-means
  • Run multiple times with different initializations
  • Consider k-means++ for smarter initialization
  • Validate results with domain knowledge

Experiment Ideas:

  • Add points in obvious clusters and see if k-means finds them
  • Try dragging centroids to suboptimal positions
  • Create elongated or irregular shaped clusters
  • Observe how outliers affect the clustering
3
0
Ready
20
Objective: $\min \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2$
Minimize within-cluster sum of squared distances