AN 856: K-Mean Clustering with the Intel® FPGA SDK for OpenCL™

ID 683395
Date 6/12/2018
Public

1.1. K-Mean Clustering Algorithm

If you have a set of observations or data represented with , and each data includes features and no labels, k-mean clustering provides a simple method to group the whole data set into k clusters , where data in the same cluster are more similar to each other.

Each data () is labeled with the cluster , where the distance of the data point to the centroid of cluster is minimum compare to all other clusters.

The objective is , where is the mean of the data points in cluster .

Different methods for initializing this unsupervised clustering algorithm exist, and this initialization is important for reducing the complexity of the algorithm (the number of iterations required) and the results of clustering (local optimum).

Typically, if the initial centroids are closer to the final centroid, you can achieve better clustering results and lower complexity. This application note does not cover initialization methods.

In this example, we pick data as the centroids of the clusters randomly:
  1. Pick data points from the input data set as centroids for clusters.
  2. Find the distance of each data into all centroids and label the data with the cluster that has the minimum distance to its centroid.
  3. Calculate the new centroid of all clusters as a mean of the labeled data from the previous step.
  4. Check the error.

    If the centroids are no longer changing, the algorithm has converged, and you can exit the algorithm. Otherwise, repeat steps 2- 4.