K-Means Clustering is a concept that falls under Unsupervised Learning. This algorithm can be used to find groups within unlabeled data.
Topics to be covered:
- Creating a DataFrame for a two-dimensional dataset
- Finding the centroids of 3 clusters, and then of 4 clusters
Example of K-Means Clustering in Python
To start, here is an example of a two-dimensional dataset:
import pandas as pd
data = {
"x": [25, 34, 22, 27, 33, 33, 31, 22, 35, 34, 67, 54, 57, 43, 50, 57, 59, 52, 65, 47, 49, 48, 35, 33, 44, 45, 38,
43, 51, 46],
"y": [79, 51, 53, 78, 59, 74, 73, 57, 69, 75, 51, 32, 40, 47, 53, 36, 35, 58, 59, 50, 25, 20, 14, 12, 20, 5, 29, 27,
8, 7]
}
df = pd.DataFrame(data)
print(df)
Run the code in Python, and you’ll get the following DataFrame:
x y
0 25 79
1 34 51
2 22 53
3 27 78
4 33 59
5 33 74
6 31 73
7 22 57
8 35 69
9 34 75
10 67 51
11 54 32
12 57 40
13 43 47
14 50 53
15 57 36
16 59 35
17 52 58
18 65 59
19 47 50
20 49 25
21 48 20
22 35 14
23 33 12
24 44 20
25 45 5
26 38 29
27 43 27
28 51 8
29 46 7
Find the Centroids of 3 Clusters
First, install the Matplotlib package. This package will be used to create the chart in Python.
pip install matplotlib
Then install the sklearn package. This package will be used to apply the K-Means Clustering in Python.
pip install scikit-learn
You can then specify the number of clusters. For example, assign 3 clusters as follows:
KMeans(n_clusters=3)
The complete code to find the centroids of 3 clusters:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
data = {
"x": [25, 34, 22, 27, 33, 33, 31, 22, 35, 34, 67, 54, 57, 43, 50, 57, 59, 52, 65, 47, 49, 48, 35, 33, 44, 45, 38,
43, 51, 46],
"y": [79, 51, 53, 78, 59, 74, 73, 57, 69, 75, 51, 32, 40, 47, 53, 36, 35, 58, 59, 50, 25, 20, 14, 12, 20, 5, 29, 27,
8, 7]
}
df = pd.DataFrame(data)
kmeans = KMeans(n_clusters=3)
kmeans.fit(df)
centroids = kmeans.cluster_centers_
print(centroids)
plt.scatter(df["x"], df["y"], c=kmeans.labels_.astype(float), s=50, alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1], c="red", s=50)
plt.show()
Run the code in Python, and you’ll see 3 clusters with 3 distinct centroids:
[[29.6 66.8]
[43.2 16.7]
[55.1 46.1]]
Note that the center of each cluster represents the mean of all the observations that belong to that cluster.
Additionally, the observations that belong to a given cluster are closer to the center of that cluster, in comparison to the centers of other clusters.
Find the Centroids of 4 Clusters
In this case, change the n_clusters from 3 to 4:
KMeans(n_clusters=4)
The full Python code for 4 clusters:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
data = {
"x": [25, 34, 22, 27, 33, 33, 31, 22, 35, 34, 67, 54, 57, 43, 50, 57, 59, 52, 65, 47, 49, 48, 35, 33, 44, 45, 38,
43, 51, 46],
"y": [79, 51, 53, 78, 59, 74, 73, 57, 69, 75, 51, 32, 40, 47, 53, 36, 35, 58, 59, 50, 25, 20, 14, 12, 20, 5, 29, 27,
8, 7]
}
df = pd.DataFrame(data)
kmeans = KMeans(n_clusters=4)
kmeans.fit(df)
centroids = kmeans.cluster_centers_
print(centroids)
plt.scatter(df["x"], df["y"], c=kmeans.labels_.astype(float), s=50, alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1], c="red", s=50)
plt.show()
Run the code, and you’ll now see 4 clusters with 4 distinct centroids:
[[27.75 55. ]
[43.2 16.7 ]
[55.1 46.1 ]
[30.83333333 74.66666667]]
That’s it. You can learn more about the application of K-Means Clusters in Python by visiting the sklearn documentation.