K-Means Clustering is a concept that falls under Unsupervised Learning. This algorithm can be used to find groups within unlabeled data. To demonstrate this concept, we’ll review a simple example of K-Means Clustering in Python.
Topics to be covered:
- Creating a DataFrame for two-dimensional dataset
- Finding the centroids of 3 clusters, and then of 4 clusters
Example of K-Means Clustering in Python
To start, let’s review a simple example with the following two-dimensional dataset:
import pandas as pd data = { 'x': [25, 34, 22, 27, 33, 33, 31, 22, 35, 34, 67, 54, 57, 43, 50, 57, 59, 52, 65, 47, 49, 48, 35, 33, 44, 45, 38, 43, 51, 46], 'y': [79, 51, 53, 78, 59, 74, 73, 57, 69, 75, 51, 32, 40, 47, 53, 36, 35, 58, 59, 50, 25, 20, 14, 12, 20, 5, 29, 27, 8, 7] } df = pd.DataFrame(data) print(df)
Run the code in Python, and you’ll get the following DataFrame:
x y
0 25 79
1 34 51
2 22 53
3 27 78
4 33 59
5 33 74
6 31 73
7 22 57
8 35 69
9 34 75
10 67 51
11 54 32
12 57 40
13 43 47
14 50 53
15 57 36
16 59 35
17 52 58
18 65 59
19 47 50
20 49 25
21 48 20
22 35 14
23 33 12
24 44 20
25 45 5
26 38 29
27 43 27
28 51 8
29 46 7
Next, you’ll see how to use sklearn to find the centroids of 3 clusters, and then of 4 clusters.
K-Means Clustering in Python – 3 clusters
Once you created the DataFrame based on the above data, you’ll need to import 2 additional Python modules:
- matplotlib – for creating charts in Python
- sklearn – for applying the K-Means Clustering in Python
In the code below, you can specify the number of clusters. For this example, assign 3 clusters as follows:
KMeans(n_clusters=3).fit(df)
import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans data = { 'x': [25, 34, 22, 27, 33, 33, 31, 22, 35, 34, 67, 54, 57, 43, 50, 57, 59, 52, 65, 47, 49, 48, 35, 33, 44, 45, 38, 43, 51, 46], 'y': [79, 51, 53, 78, 59, 74, 73, 57, 69, 75, 51, 32, 40, 47, 53, 36, 35, 58, 59, 50, 25, 20, 14, 12, 20, 5, 29, 27, 8, 7] } df = pd.DataFrame(data) kmeans = KMeans(n_clusters=3).fit(df) centroids = kmeans.cluster_centers_ print(centroids) plt.scatter(df['x'], df['y'], c=kmeans.labels_.astype(float), s=50, alpha=0.5) plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50) plt.show()
Run the code in Python, and you’ll see 3 clusters with 3 distinct centroids:
[[29.6 66.8]
[43.2 16.7]
[55.1 46.1]]
Note that the center of each cluster represents the mean of all the observations that belong to that cluster.
Additionally, the observations that belong to a given cluster are closer to the center of that cluster, in comparison to the centers of other clusters.
K-Means Clustering in Python – 4 clusters
Let’s now see what would happen if you use 4 clusters instead. In that case, you’ll need to change the n_clusters from 3 to 4:
KMeans(n_clusters=4).fit(df)
The full Python code for 4 clusters would look like this:
import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans data = { 'x': [25, 34, 22, 27, 33, 33, 31, 22, 35, 34, 67, 54, 57, 43, 50, 57, 59, 52, 65, 47, 49, 48, 35, 33, 44, 45, 38, 43, 51, 46], 'y': [79, 51, 53, 78, 59, 74, 73, 57, 69, 75, 51, 32, 40, 47, 53, 36, 35, 58, 59, 50, 25, 20, 14, 12, 20, 5, 29, 27, 8, 7] } df = pd.DataFrame(data) kmeans = KMeans(n_clusters=4).fit(df) centroids = kmeans.cluster_centers_ print(centroids) plt.scatter(df['x'], df['y'], c=kmeans.labels_.astype(float), s=50, alpha=0.5) plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50) plt.show()
Run the code, and you’ll now see 4 clusters with 4 distinct centroids:
[[27.75 55. ]
[43.2 16.7 ]
[55.1 46.1 ]
[30.83333333 74.66666667]]
That’s it. You can learn more about the application of K-Means Clusters in Python by visiting the sklearn documentation.