import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
import seaborn as sns
from sklearn.cluster import KMeans
import time
Segmentation: Creating a K Means function
In this post I will briefly expain K-means and attempt to to build my own k-means function used in clustering and segmentation.
1. Background on K Means and Clustering
K-means is a popular method in unsupervised machine learning used to clasify datapoints into distinct groups. K means partitions data points into some \(K\) number of distinct non-overlapping subsets or clusters that minimizes the variation within each cluster. It is typically used for its simplicity and efficiency in processing large data sets - it’s computational complexity is generally \(O(nkt)\), where \(n\) is the number of data points, \(k\) is the number of clusters, and \(t\) is the number of iterations. This makes k-means faster compared to other clustering algorithms like hierarchical clustering, especially as data size grows.
K means is used in a variety of business applications such as Market Segmentation, Document Clustering, and Image Segmentation. However, it does have some limitations that need to be taken into account when using this unsupervised machine learning method. First, K-means assumes clusters are spherical and equally sized, which can distort cluster assignments in real-world data. This could change how data points are assigned to each centroid. Additionally, it requires specifying \(k\) in advance, which is not always intuitive, and might distort one’s ability to cluster. However, we can use metrics such as within-cluster-sum-of-squares and silhouette scores to calculate our ideal number of K. Finally, the algorithm is sensitive to the initial selection of centroids, which can lead to different clustering results.
2. Steps in Creating a K Means function:
To create a Kmeans algorithm by scratch, we will follow 4 steps:- Initialization
- Assignment
- Update
- Iterate and Convergence
1. Initialization: Start by choosing k inital centroids
The first step in creating a Kmeans function is initialization. Suppose we have \(n\) points an X, Y axis, where each datapoint \(i\) corresponds to a specific row in a dataset. We can select \(k\) random points to serve as the inital centroids for the k-mean algorithm. Random initialization helps in spreading out the centroids initially, though it might affect the convergence and outcome, making the algorithm sensitive to the initial positions.
2. Assignment: Assign each data point to the nearst centroid
We then assign our datapoints to the closest centroid based on their euclidean distance. \[ \text{Euclidean Distance} = \sqrt{{x_i}^2 + {y_i}^2} \]
The points will be clusetered based on the centroid that minimizes this distance value. This forms clusters of points around centroids based on proximity. For a dataframe or a data set with more data that can be fit into 2 dimensions (i.e. x and y) we can continue to add additional points to our distance formula to calculate the Euclidean distance. Therefore, if we have a dataset with 3 columns, such as X, Y, and Z, we can find our distance between the points as: \[ \text{Distance} = \sqrt{{x_i}^2 + {y_i}^2 + {z_i}^2} \]
3. Update: Recalculate centroids as the mean of all points assigned to each cluster
After all points have been assigned to centroids, we recalculate each centroid’s position as the mean (average) of all points that have been assigned to that centroid. This adjustment of centroids to the center of their respective clusters is what potentially reduces the total variance within each cluster.
4. Iterate and Convergence: Repeat the assignement and update steps until the centroids do not change or the changes are below a certain threshold.
After updating the centroids, the shift in their positions (measured as the Euclidean distance between the old and new positions) is calculated. If the maximum shift across all centroids is less than a predetermined tolerance level, the algorithm stops, assuming it has reached convergence. This check prevents endless adjustments when centroids have effectively stabilized.
3. Creating a KMeans function
Initialize Centroids:
def initialize_centroids(data_np, k):
= np.random.choice(data_np.shape[0], size=k, replace=False)
indices = data_np[indices]
centroids
0], data_np[:, 1], color='gray', alpha=0.5, label='Data Points')
plt.scatter(data_np[:, 0], centroids[:, 1], color='red', s=200, alpha=0.5, label='Initial Centroids')
plt.scatter(centroids[:, 'Initialization of Centroids')
plt.title(
plt.legend()
plt.show()
return centroids
Here we are randomly selecting some \(k\) number of centroids from our datapoints and labeling them on our graph.
Assign Clusters:
def assign_clusters(data_np, centroids):
= np.sqrt(((data_np - centroids[:, np.newaxis])**2).sum(axis=2))
distances = np.argmin(distances, axis=0)
nearest_centroids
0], data_np[:, 1], c=nearest_centroids, s=50, cmap='viridis', label='Clusters')
plt.scatter(data_np[:, 0], centroids[:, 1], s=200, color='red', alpha=0.5, label='Centroids')
plt.scatter(centroids[:, 'Assignment to Nearest Centroid')
plt.title(
plt.legend()
plt.show()
return nearest_centroids
We are now assigning points to their respective centroids with the goal of minimizing their distance.
Update Centroids:
def update_centroids(data_np, nearest_centroids, k):
= np.array([data_np[nearest_centroids == j].mean(axis=0) for j in range(k)])
new_centroids
0], data_np[:, 1], c=nearest_centroids, s=50, cmap='viridis', label='Clusters')
plt.scatter(data_np[:, 0], new_centroids[:, 1], s=200, color='blue', alpha=0.5, label='Updated Centroids')
plt.scatter(new_centroids[:, 'Update Centroids')
plt.title(
plt.legend()
plt.show()
return new_centroids
We are now updating our centroid and cluster assignment based on the mean of the assigned points.
K Means
def k_means(data, k, max_iters=100):
= data.values
data_np = initialize_centroids(data_np, k)
centroids = [centroids.copy()] # To store centroids positions at each iteration
trajectory
for i in range(max_iters):
= assign_clusters(data_np, centroids)
nearest_centroids = update_centroids(data_np, nearest_centroids, k)
centroids
trajectory.append(centroids.copy())
if np.all(trajectory[-1] == trajectory[-2]):
break
return centroids, nearest_centroids, trajectory
Movement of our centroids
def plot_centroid_movements(data, trajectory):
= data.values
data_np =(10, 8))
plt.figure(figsize0], data_np[:, 1], color='gray', alpha=0.5, label='Data Points')
plt.scatter(data_np[:,
# Plotting each centroid's trajectory
= ['red', 'blue', 'green', 'purple']
colors for i, trace in enumerate(np.array(trajectory).transpose(1, 0, 2)):
# Plot line with 'o' markers for centroids
0], trace[:, 1], '-o', color=colors[i % len(colors)], label=f'Centroid {i+1}')
plt.plot(trace[:,
# Adding annotations for each centroid in each iteration
for idx, point in enumerate(trace):
str(idx), # this will be the iteration number
plt.annotate(0], point[1]), # this specifies the position
(point[="offset points", # how to position the text
textcoords=(0,10), # distance from text to points (x,y)
xytext='center', # horizontal alignment can be left, right or center
ha=colors[i % len(colors)])
color
'Movement of Centroids Over Iterations')
plt.title('Feature 1')
plt.xlabel('Feature 2')
plt.ylabel(
plt.legend() plt.show()
Here we are creating a plot to track the movement of our centroids over the various code iterations.
4. Running our KMeans
= sns.load_dataset('penguins')
penguins # Check for missing values
print(penguins.isnull().sum())
# Drop rows with missing values (optional, based on how you want to handle missing data)
=True)
penguins.dropna(inplace= penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']] penguins_numerical
species 0
island 0
bill_length_mm 2
bill_depth_mm 2
flipper_length_mm 2
body_mass_g 2
sex 11
dtype: int64
= time.time()
start_time = k_means(penguins_numerical, k=4)
centroids, labels, trajectory
plot_centroid_movements(penguins_numerical, trajectory)
= time.time() - start_time my_time
5. Evaluating Optimal number of Clusters
To evaluate the optimal number of clusters for k-means clustering, we can calculate both the Within-Cluster Sum of Squares (WCSS) and the Silhouette Score for various values of \(k\). These metrics provide insight into the quality of the clustering process, with WCSS indicating how compact the clusters are and Silhouette Score measuring how well-separated the clusters are.
def calculate_metrics(data, k_range):
= []
wcss = []
silhouette_scores
for k in k_range:
= KMeans(n_clusters=k, random_state=0)
kmeans = kmeans.fit_predict(data)
labels
# Calculate WCSS (Inertia)
wcss.append(kmeans.inertia_)
# Calculate Silhouette Score
if k > 1: # Silhouette score requires at least 2 clusters
= silhouette_score(data, labels)
score
silhouette_scores.append(score)else:
None)
silhouette_scores.append(
return wcss, silhouette_scores
def plot_metrics(k_range, wcss, silhouette_scores):
= plt.subplots()
fig, ax1
# WCSS Plot
= 'tab:red'
color 'Number of Clusters (k)')
ax1.set_xlabel('WCSS', color=color)
ax1.set_ylabel(='o', color=color)
ax1.plot(k_range, wcss, marker='y', labelcolor=color)
ax1.tick_params(axis
# Silhouette Score Plot
= ax1.twinx() # instantiate a second axes that shares the same x-axis
ax2 = 'tab:blue'
color 'Silhouette Score', color=color)
ax2.set_ylabel(1:], silhouette_scores[1:], marker='o', color=color) # Skip the first entry (None)
ax2.plot(k_range[='y', labelcolor=color)
ax2.tick_params(axis
'WCSS and Silhouette Score for Different k')
plt.title( plt.show()
= range(2, 8)
k_range = calculate_metrics(penguins_numerical, k_range)
wcss, silhouette_scores plot_metrics(k_range, wcss, silhouette_scores)
In the graph above, the red line represents the WCSS score. The WCSS Score is a measure of the total distance between each point in a cluster and the centroid of that cluster. The goal in k-means clustering is to minimize this value. We can use the elbow method to choose the k at which the WCSS begins to decrease at a slower rate. The point known as the “elbow point,” indicates that adding more clusters beyond this point does not provide much better modeling of the data. In the graph, there is a noticeable elbow around k=4, suggesting that increasing the number of clusters from 3 to 4 significantly decreases WCSS, but further increases in k (to 5, 6, or 7) result in smaller reductions in WCSS.
The Silhouette Score is represented by the blue line and is a measure of how similar an object is to its own cluster compared to other clusters. The value ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. A higher silhouette score means that the clusters are well separated and clearly defined. From the graph, the silhouette score generally decreases as k increases. The highest score is at k=3, suggesting that at this level, the clusters are most distinct from each other compared to higher values of k.
6. Evaluating our custom Kmeans to Built-in function
We are now going to compare our kmeans function to the built in function via sci-kit learn.
def plot_clusters(data, labels, centroids):
# Convert DataFrame to numpy array if necessary
if isinstance(data, pd.DataFrame):
= data.values
data =(10, 8))
plt.figure(figsize0], data[:, 1], c=labels, s=50, cmap='viridis', alpha=0.7, label='Cluster Points')
plt.scatter(data[:, 0], centroids[:, 1], c='red', s=200, marker='X', label='Centroids')
plt.scatter(centroids[:, 'Cluster Visualization with Centroids')
plt.title('Feature 1')
plt.xlabel('Feature 2')
plt.ylabel(
plt.legend()
plt.show()
= 4
k = KMeans(n_clusters=k, random_state=0)
kmeans = kmeans.fit_predict(penguins_numerical)
labels1 = kmeans.cluster_centers_
centroids1 plot_clusters(penguins_numerical, labels1, centroids1)
# Start the timer
= time.time()
start_time # Create and fit the KMeans model
= KMeans(n_clusters=4, random_state=0)
kmeans
kmeans.fit(penguins_numerical)
# End the timer
= time.time() - start_time
sklearn_time print("Execution Time for Scikit-learn's KMeans: ", sklearn_time)
print("Execution Time for My k_means: ", my_time)
Execution Time for Scikit-learn's KMeans: 0.005724906921386719
Execution Time for My k_means: 2.431870937347412
We see that when using the built in kmeans function to our own function, there is no noticeable difference in the classification of the clusters. However, we do see quite a difference in the execution time between my kmeans function and the built-in function. We can see in terms of the execution speed, the built in library still outperforms our own kmeans function.