K-Means Clustering

K-means is one of the most popular unsupervised machine learning algorithms for clustering data into groups. It partitions data into k clusters by finding centroids that minimize the within-cluster sum of squared distances.

This algorithm is widely used in:

  • Market segmentation - grouping customers by behavior
  • Image processing - color quantization and compression
  • Data mining - discovering patterns in large datasets
  • Civil engineering - analyzing spatial patterns in infrastructure data

The algorithm is iterative and guaranteed to converge, though not necessarily to the global optimum.

Mathematical Foundation

K-means minimizes the within-cluster sum of squared distances (WCSS):

J=i=1kxCi||xμi||2

Where:

  • k = number of clusters
  • Ci = set of points in cluster i
  • μi = centroid of cluster i
  • ||xμi||2 = squared Euclidean distance

Algorithm Steps:

  1. Initialize: Place k centroids randomly: μ1,μ2,...,μk
  2. Assign: For each point xj, assign to nearest centroid: cj=argmini||xjμi||2
  3. Update: Recalculate centroids as cluster means: μi=1|Ci|xCix
  4. Repeat: Steps 2-3 until convergence (centroids stop moving)

Convergence: The algorithm converges when centroids no longer change position between iterations, or the change is below a threshold.

How to Use This Demo

Basic Controls:

  • Adjust K slider - Change the number of clusters to see different groupings
  • Step Forward - Run one iteration of the algorithm
  • Final Results - Run until convergence
  • Reset Training - Reset algorithm state (keeps data points)
  • Generate New Data - Create fresh random data points

Interactive Features:

  • Add data points - Enable "Adding Data Points" and click anywhere on the plot
  • Drag centroids - Click and drag the solid circles to manually position centroids
  • Show distance lines - Toggle to always show connections between points and centroids
  • Hover effects - Hover over centroids to see their cluster connections

Visualization Guide:

  • Hollow circles: Data points, colored by cluster assignment
  • Solid circles: Centroids showing cluster centers
  • Colors: Each cluster has a unique color
  • Lines: Connect points to their assigned centroids

Understanding K-Means

Choosing K:

  • Try different values of K to see how cluster quality changes
  • Too few clusters may miss important patterns
  • Too many clusters may overfit to noise
  • In practice, use methods like the "elbow method" or silhouette analysis

Algorithm Behavior:

  • Initial centroid placement affects final results
  • K-means finds spherical clusters (assumption of similar variance)
  • Sensitive to outliers - they can pull centroids away from natural centers
  • Always converges, but may find local optima

Practical Considerations:

  • Scale your features before applying k-means
  • Run multiple times with different initializations
  • Consider k-means++ for smarter initialization
  • Validate results with domain knowledge

Experiment Ideas:

  • Add points in obvious clusters and see if k-means finds them
  • Try dragging centroids to suboptimal positions
  • Create elongated or irregular shaped clusters
  • Observe how outliers affect the clustering
3
0
Ready
20
Objective: mini=1kxCi||xμi||2
Minimize within-cluster sum of squared distances