Example of K-Means Clustering in Python

K-Means Clustering is a concept that falls under Unsupervised Learning. This algorithm can be used to find groups within unlabeled data. To demonstrate this concept, we’ll review a simple example of K-Means Clustering in Python.

Topics to be covered:

  • Creating a DataFrame for two-dimensional dataset
  • Finding the centroids of 3 clusters, and then of 4 clusters

Example of K-Means Clustering in Python

To start, let’s review a simple example with the following two-dimensional dataset:

import pandas as pd

data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46],
        'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7]
       }
  
df = pd.DataFrame(data, columns=['x', 'y'])
print(df)

Run the code in Python, and you’ll get the following DataFrame:

     x   y
0   25  79
1   34  51
2   22  53
3   27  78
4   33  59
5   33  74
6   31  73
7   22  57
8   35  69
9   34  75
10  67  51
11  54  32
12  57  40
13  43  47
14  50  53
15  57  36
16  59  35
17  52  58
18  65  59
19  47  50
20  49  25
21  48  20
22  35  14
23  33  12
24  44  20
25  45   5
26  38  29
27  43  27
28  51   8
29  46   7

Next, you’ll see how to use sklearn to find the centroids of 3 clusters, and then of 4 clusters.

K-Means Clustering in Python – 3 clusters

Once you created the DataFrame based on the above data, you’ll need to import 2 additional Python modules:

In the code below, you can specify the number of clusters. For this example, assign 3 clusters as follows:

KMeans(n_clusters=3).fit(df)

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46],
        'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7]
       }
  
df = pd.DataFrame(data, columns=['x', 'y'])
  
kmeans = KMeans(n_clusters=3).fit(df)
centroids = kmeans.cluster_centers_
print(centroids)

plt.scatter(df['x'], df['y'], c= kmeans.labels_.astype(float), s=50, alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50)
plt.show()

Run the code in Python, and you’ll see 3 clusters with 3 distinct centroids:

[[29.6  66.8]
 [43.2  16.7]
 [55.1  46.1]]

Note that the center of each cluster represents the mean of all the observations that belong to that cluster.

Additionally, the observations that belong to a given cluster are closer to the center of that cluster, in comparison to the centers of other clusters.

K-Means Clustering in Python – 4 clusters

Let’s now see what would happen if you use 4 clusters instead. In that case, you’ll need to change the n_clusters from 3 to 4:

KMeans(n_clusters=4).fit(df)

The full Python code for 4 clusters would look like this:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46],
        'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7]
       }
  
df = pd.DataFrame(data, columns=['x', 'y'])
  
kmeans = KMeans(n_clusters=4).fit(df)
centroids = kmeans.cluster_centers_
print(centroids)

plt.scatter(df['x'], df['y'], c= kmeans.labels_.astype(float), s=50, alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50)
plt.show()

Run the code, and you’ll now see 4 clusters with 4 distinct centroids:

[[27.75       55.        ]
 [43.2        16.7       ]
 [55.1        46.1       ]
 [30.83333333 74.66666667]]

That’s it. You can learn more about the application of K-Means Clusters in Python by visiting the sklearn documentation.