K-means Clustering in Python

Table of Contents:

  1. Types of Learning
  2. K-means Clustering
  3. Advantages of K-means Clustering
  4. K-means Clustering Algorithm
  5. Implementation of K-means Clustering

Types of Learning

There are three types of learning algorithms in machine learning.

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement Learning

K-means Clustering

K-means is one of the methods used in unsupervised learning. In this method, the developers do not give any data labels to the dataset, which means they do not classify the whole data in specified categories or groups. The K in the algorithm’s name points to the k number of groups in which the algorithm is supposed to divide data. This algorithm’s primary purpose is to classify data into various categories and the number of groups represented by k, which could be any number. The users do it by analyzing the data features and working on them by keeping in view their similarities.

Cluster means a group of data, so this algorithm classifies n given data items into k clusters. The center points of each cluster are called means or cluster centroid. These points decide which item belongs to which cluster. Each data item is analyzed and assigned to the cluster whose mean is closest to the item. Since the number of groups is not pre-defined in this algorithm, it is up to developers to choose the K numbers.

Advantages of K-means Clustering

K-means is an iterative algorithm used to identify the correct number of groups the data is supposed to classify. This algorithm proves helpful in business dealings by identifying groups from the unsupervised complex data. New data can be added to the previous one to analyze the correct working of the algorithm. Some advantages of this learning algorithm are the following:

Fast Computation

K-means clustering proves ways faster than hierarchical clustering when there are vast numbers of variables involved. In this case, keeping the number of groups small proves time-efficient with faster computations.

Tightly Bound Clusters

Opposite to hierarchical clustering, k-means clustering helps in producing clusters that are closer to each other. This clustering is essential in the case of working with globular clusters.

Simpler Implementation

It is easy and simpler to implement. The choice of defining and updating the number of groups gives the privilege to developers to use it the way they find it helpful.

Easy Adaption

It helps to work with larger datasets and ensures the convergence of data. This clustering works well for the provided dataset and correctly analyzes the newly-added examples.


K-means clustering generalizes to the clusters that possess different types of shapes and sizes easily.

Disadvantages of K-means Clustering

Despite having many advantages of the algorithm, it also has some shortcomings. Following are some of the disadvantages found in k-means clustering:

Difficulty with the K value

Choosing k values manually in the algorithm can raise problems in providing accurate solutions. Predicting the right value of k is a difficult task.

Dependency on Initial Values

The solution provided by the algorithm mainly depends on the initially selected values, which can distract the algorithm from providing suitable clusters. The different selection of initial partitions often results in different final clusters. Therefore, the solution is to change the values of k several times and find the best solution.

Varying Size and Density Problem

If the original data sets have varying sizes and densities, it becomes difficult to form clusters of data using k-means clustering. In this case, generalization is the suggested solution.

Cluster of Outliers

Since this algorithm works on classification by using similarities in data items, the outliers can be clustered separately from the other data. Outliers can also disturb centroids. The possible solution in this scenario is to remove outliers from the data before clustering.

Dimensionality Problem

The curse of dimensionality exists in k-means clustering. Since each dimension is responsible for representing a different attribute, the algorithm required data transformation before analysis. Measurements of different attributes may require different types of scales, which can distort cluster analysis. Therefore, dimension reduction is needed in such cases to prevent the results from being misleading.

K-means Clustering Algorithm

K-means clustering algorithm works on the iterative process where k is varied in every iteration to find optimal results. This algorithm takes k and the provided data set, which consists of data points having various features. Initially, it uses estimation for k centroids, which may be opted randomly from the data itself. These k centroids make k number of clusters by acquiring nearest points. The developers use the method of squared Euclidean distance in defining which data point belongs to which cluster. When the algorithm has made the initial clusters, it then updates the centroids. The next step is to process the data points in each cluster and determine their mean to update the new values of centroids, after which it repeats the analysis.

This analysis process finally stops when it has found the optimal clusters. The algorithm makes the stopping decision by using facts such as the maximum number of iterations are reached, or the data points no longer change their respective clusters. The values of k are tested over a defined range to find the best solution. Therefore, the developers repeat the process several times to find the optimal value of k.

Implementation of K-means Clustering

Python is a user-friendly language that provides a straightforward implementation of supervised and unsupervised machine learning algorithms. Although users can choose to write their own functions for these algorithms, the Sklearn library has many useful functions to apply these algorithms to the users’ datasets.

Following are the essential facts of K-means clustering before implementing it:

  • The K-means algorithm takes a function with a finite domain set. Therefore, it takes a finite number of iteration to converge and eventually converge.
  • Its computational cost is O (k*n*d), where k is the number of specified clusters, n is the number of data points, and d is the number of attributes.
  • K-means technique is fast and efficient, as discussed in the above paragraphs.
  • There is no efficient method to find the optimal number of clusters or “k.” Therefore, the user needs to run the algorithm many times with varying values of k and compare the results to choose a better value.

Below is the simple implementation of K-means in Python. In python, the users can call the KMeans function and the number of clusters (“k”) in the argument to perform K-Means clustering. They can then test it on the testing data by giving some points and asking it to predict their clusters’ positions. Additionally, the users can play with the value “k” and observe the resulting values.

import numpy as np
from sklearn.cluster import KMeans

#sample set
X_array = np.array ([[1,2], [1,4], [1,0], [10,2], [10,4], [10,0]])
Kmean = KMeans (n_clusters=2 ).fit(X_array)

#printing the cluster labels of the X_array
Kmean.predict([[0,0], [12, 13]])

#printing cluster centers