K-Means Clustering
K-means is one of the most popular unsupervised machine learning algorithms for clustering data into groups. It partitions data into k clusters by finding centroids that minimize the within-cluster sum of squared distances.
This algorithm is widely used in:
- Market segmentation - grouping customers by behavior
- Image processing - color quantization and compression
- Data mining - discovering patterns in large datasets
- Civil engineering - analyzing spatial patterns in infrastructure data
The algorithm is iterative and guaranteed to converge, though not necessarily to the global optimum.
Mathematical Foundation
K-means minimizes the within-cluster sum of squared distances (WCSS):
Where:
- = number of clusters
- = set of points in cluster
- = centroid of cluster
- = squared Euclidean distance
Algorithm Steps:
- Initialize: Place centroids randomly:
- Assign: For each point , assign to nearest centroid:
- Update: Recalculate centroids as cluster means:
- Repeat: Steps 2-3 until convergence (centroids stop moving)
Convergence: The algorithm converges when centroids no longer change position between iterations, or the change is below a threshold.
How to Use This Demo
Basic Controls:
- Adjust K slider - Change the number of clusters to see different groupings
- Step Forward - Run one iteration of the algorithm
- Final Results - Run until convergence
- Reset Training - Reset algorithm state (keeps data points)
- Generate New Data - Create fresh random data points
Interactive Features:
- Add data points - Enable "Adding Data Points" and click anywhere on the plot
- Drag centroids - Click and drag the solid circles to manually position centroids
- Show distance lines - Toggle to always show connections between points and centroids
- Hover effects - Hover over centroids to see their cluster connections
Visualization Guide:
- Hollow circles: Data points, colored by cluster assignment
- Solid circles: Centroids showing cluster centers
- Colors: Each cluster has a unique color
- Lines: Connect points to their assigned centroids
Understanding K-Means
Choosing K:
- Try different values of K to see how cluster quality changes
- Too few clusters may miss important patterns
- Too many clusters may overfit to noise
- In practice, use methods like the "elbow method" or silhouette analysis
Algorithm Behavior:
- Initial centroid placement affects final results
- K-means finds spherical clusters (assumption of similar variance)
- Sensitive to outliers - they can pull centroids away from natural centers
- Always converges, but may find local optima
Practical Considerations:
- Scale your features before applying k-means
- Run multiple times with different initializations
- Consider k-means++ for smarter initialization
- Validate results with domain knowledge
Experiment Ideas:
- Add points in obvious clusters and see if k-means finds them
- Try dragging centroids to suboptimal positions
- Create elongated or irregular shaped clusters
- Observe how outliers affect the clustering