K-Means Clustering
K-means is one of the most popular unsupervised machine learning algorithms for clustering data into groups. It partitions data into k clusters by finding centroids that minimize the within-cluster sum of squared distances.
This algorithm is widely used in:
- Market segmentation - grouping customers by behavior
- Image processing - color quantization and compression
- Data mining - discovering patterns in large datasets
- Civil engineering - analyzing spatial patterns in infrastructure data
The algorithm is iterative and guaranteed to converge, though not necessarily to the global optimum.
Mathematical Foundation
K-means minimizes the within-cluster sum of squared distances (WCSS):
$$J = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2$$
Where:
- $k$ = number of clusters
- $C_i$ = set of points in cluster $i$
- $\mu_i$ = centroid of cluster $i$
- $||x - \mu_i||^2$ = squared Euclidean distance
Algorithm Steps:
- Initialize: Place $k$ centroids randomly: $\mu_1, \mu_2, ..., \mu_k$
- Assign: For each point $x_j$, assign to nearest centroid:
$$c_j = \arg\min_i ||x_j - \mu_i||^2$$
- Update: Recalculate centroids as cluster means:
$$\mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x$$
- Repeat: Steps 2-3 until convergence (centroids stop moving)
Convergence: The algorithm converges when centroids no longer change position between iterations, or the change is below a threshold.
How to Use This Demo
Basic Controls:
- Adjust K slider - Change the number of clusters to see different groupings
- Step Forward - Run one iteration of the algorithm
- Final Results - Run until convergence
- Reset Training - Reset algorithm state (keeps data points)
- Generate New Data - Create fresh random data points
Interactive Features:
- Add data points - Enable "Adding Data Points" and click anywhere on the plot
- Drag centroids - Click and drag the solid circles to manually position centroids
- Show distance lines - Toggle to always show connections between points and centroids
- Hover effects - Hover over centroids to see their cluster connections
Visualization Guide:
- Hollow circles: Data points, colored by cluster assignment
- Solid circles: Centroids showing cluster centers
- Colors: Each cluster has a unique color
- Lines: Connect points to their assigned centroids
Understanding K-Means
Choosing K:
- Try different values of K to see how cluster quality changes
- Too few clusters may miss important patterns
- Too many clusters may overfit to noise
- In practice, use methods like the "elbow method" or silhouette analysis
Algorithm Behavior:
- Initial centroid placement affects final results
- K-means finds spherical clusters (assumption of similar variance)
- Sensitive to outliers - they can pull centroids away from natural centers
- Always converges, but may find local optima
Practical Considerations:
- Scale your features before applying k-means
- Run multiple times with different initializations
- Consider k-means++ for smarter initialization
- Validate results with domain knowledge
Experiment Ideas:
- Add points in obvious clusters and see if k-means finds them
- Try dragging centroids to suboptimal positions
- Create elongated or irregular shaped clusters
- Observe how outliers affect the clustering