Example of K-Means Clustering in Python

K-Means Clustering is a concept that falls under Unsupervised Learning. This algorithm can be used to find groups within unlabeled data. To demonstrate this concept, we’ll review a simple example of K-Means Clustering in Python.

Topics to be covered:

  • Creating a DataFrame for two-dimensional dataset
  • Finding the centroids of 3 clusters, and then of 4 clusters

Example of K-Means Clustering in Python

To start, let’s review a simple example with the following two-dimensional dataset:

import pandas as pd

data = {
    'x': [25, 34, 22, 27, 33, 33, 31, 22, 35, 34, 67, 54, 57, 43, 50, 57, 59, 52, 65, 47, 49, 48, 35, 33, 44, 45, 38,
          43, 51, 46],
    'y': [79, 51, 53, 78, 59, 74, 73, 57, 69, 75, 51, 32, 40, 47, 53, 36, 35, 58, 59, 50, 25, 20, 14, 12, 20, 5, 29, 27,
          8, 7]
    }

df = pd.DataFrame(data)
print(df)

Run the code in Python, and you’ll get the following DataFrame:

     x   y
0   25  79
1   34  51
2   22  53
3   27  78
4   33  59
5   33  74
6   31  73
7   22  57
8   35  69
9   34  75
10  67  51
11  54  32
12  57  40
13  43  47
14  50  53
15  57  36
16  59  35
17  52  58
18  65  59
19  47  50
20  49  25
21  48  20
22  35  14
23  33  12
24  44  20
25  45   5
26  38  29
27  43  27
28  51   8
29  46   7

Next, you’ll see how to use sklearn to find the centroids of 3 clusters, and then of 4 clusters.

K-Means Clustering in Python – 3 clusters

Once you created the DataFrame based on the above data, you’ll need to import 2 additional Python modules:

In the code below, you can specify the number of clusters. For this example, assign 3 clusters as follows:

KMeans(n_clusters=3).fit(df)

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

data = {
    'x': [25, 34, 22, 27, 33, 33, 31, 22, 35, 34, 67, 54, 57, 43, 50, 57, 59, 52, 65, 47, 49, 48, 35, 33, 44, 45, 38,
          43, 51, 46],
    'y': [79, 51, 53, 78, 59, 74, 73, 57, 69, 75, 51, 32, 40, 47, 53, 36, 35, 58, 59, 50, 25, 20, 14, 12, 20, 5, 29, 27,
          8, 7]
    }

df = pd.DataFrame(data)

kmeans = KMeans(n_clusters=3).fit(df)
centroids = kmeans.cluster_centers_
print(centroids)

plt.scatter(df['x'], df['y'], c=kmeans.labels_.astype(float), s=50, alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50)
plt.show()

Run the code in Python, and you’ll see 3 clusters with 3 distinct centroids:

[[29.6  66.8]
 [43.2  16.7]
 [55.1  46.1]]

Note that the center of each cluster represents the mean of all the observations that belong to that cluster.

Additionally, the observations that belong to a given cluster are closer to the center of that cluster, in comparison to the centers of other clusters.

K-Means Clustering in Python – 4 clusters

Let’s now see what would happen if you use 4 clusters instead. In that case, you’ll need to change the n_clusters from 3 to 4:

KMeans(n_clusters=4).fit(df)

The full Python code for 4 clusters would look like this:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

data = {
    'x': [25, 34, 22, 27, 33, 33, 31, 22, 35, 34, 67, 54, 57, 43, 50, 57, 59, 52, 65, 47, 49, 48, 35, 33, 44, 45, 38,
          43, 51, 46],
    'y': [79, 51, 53, 78, 59, 74, 73, 57, 69, 75, 51, 32, 40, 47, 53, 36, 35, 58, 59, 50, 25, 20, 14, 12, 20, 5, 29, 27,
          8, 7]
    }

df = pd.DataFrame(data)

kmeans = KMeans(n_clusters=4).fit(df)
centroids = kmeans.cluster_centers_
print(centroids)

plt.scatter(df['x'], df['y'], c=kmeans.labels_.astype(float), s=50, alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50)
plt.show()

Run the code, and you’ll now see 4 clusters with 4 distinct centroids:

[[27.75       55.        ]
 [43.2        16.7       ]
 [55.1        46.1       ]
 [30.83333333 74.66666667]]

That’s it. You can learn more about the application of K-Means Clusters in Python by visiting the sklearn documentation.