Clustering and k-means
We now venture into our first application, which is clustering with the k-means algorithm. Clustering is a data mining exercise where we take a bunch of data and find groups of points that are similar to each other. K-means is an algorithm that is great for finding clusters in many types of datasets.
For more about cluster and k-means, see the scikit-learn documentation on its k-means algorithm or watch this video:
First up, we are going to need to generate some samples. We could generate the samples randomly, but that is likely to either give us very sparse points, or just one big group - not very exciting for clustering.
Instead, we are going to start by generating three centroids, and then randomly choose (with a normal distribution) around that point. First up, here is a method for doing this:
Put this code in functions.py
The way this works is to create
n_clusters different centroids at random (using
np.random.random((1, n_features))) and using those as the centre points for
tf.random_normal function generates normally distributed random values, which we then add to the current centre point.
This creates a blob of points around that center.
We then record the centroids (
centroids.append) and the generated samples (
Finally, we create “One big list of samples” using
tf.concat, and convert the centroids to a TensorFlow Variable as well, also using
create_samples method in a file called
functions.py allows us to import these methods into our scripts for this (and the next!) lesson.
Create a new file called
generate_samples.py, which has the following code:
This just sets up the number of clusters and features (I recommend keeping the number of features at 2, allowing us to visualise them later), and the number of samples to generate. Increasing the embiggen_factor will increase the “spread” or the size of the clusters. I chose a value here that provides good learning opportunity, as it generates visually identifiable clusters.
To visualise the results, lets create a plotting function using
matplotlib. Add this code to
Put this code in functions.py
All this code does is plots out the samples from each cluster using a different colour, and creates a big magenta X where the centroid is. The centroid is given as an argument, which will be handy later on.
generate_samples.py to import this function by adding
from functions import plot_clusters to the top of the file.
Then, add this line of code to the bottom:
generate_samples.py should now give you the following plot:
The k-means algorithm starts with the choice of the initial centroids, which are just random guesses of the actual centroids in the data. The following function will randomly choose a number of samples from the dataset to act as this initial guess:
Put this code in functions.py
This code first creates an index for each sample (using
tf.range(0, n_samples), and then randomly shuffles it.
From there, we choose a fixed number (
n_clusters) of indices using
These indices correlated to our initial centroids, which are then grouped together using
tf.gather to form our array of initial centroids.
Add this new
choose_random_centorids function to
functions.py, and create a new script (or update your previous one) to the following:
The major change here is that we create a variable for these initial centroids, and compute its value in the session.
We then plot out those first guesses to
plot_cluster, rather than the actual centroids that were used to generate the data.
Running this will net a similar image to above, but the centroids will be in random positions. Try running this script a few times, noting that the centroids move around quite a bit.
After starting with some guess for the centroid locations, the k-means algorithm then updates those guesses based on the data. The process is to assign each sample a cluster number, representing the centroid it is closest to. After that, the centroids are updated to be the means of all samples assigned to that cluster. The following code handles the assign to nearest cluster step:
Note that I’ve borrowed some code from this page which has a different type of k-means algorithm, and lots of other useful information.
The way this works is to compute the distance between each sample and each centroid, which occurs through the
distances = line.
The distance computation here is the Euclidean distance.
An important point here is that
tf.subtract will automatically expand the size of the two arguments.
This means that having our samples as a matrix, and the centroids as a column vector will produce the pairwise subtraction between them.
In order to do this, we use
tf.expand_dims to create an extra dimension for both samples and centroids, forcing this behaviour of
The next bit of code hands the update centroids bit:
This code takes the nearest indices for each sample, and grabs those out as separate groups using
From here, we use
tf.reduce_mean on a single group to find the average of that group, forming its new centroid.
From here, we just
tf.concat them together to form our new centroids.
Now we have the piece in place, we can add these calls to our script (or create a new one):
This code will:
- Generate samples from initial centroids
- Randomly choose initial centroids
- Associate each sample to its nearest centroid
- Update each centroid to be the mean of the samples associated to it
This is a single iteration of k-means! I encourage you to take a shot at the exercises, which turns this into an iterative version.
Stuck? Looking for more content?
If you are looking for solutions on the exercises, or just want to see how I solved them, then our solutions bundle is what you are after. Buying the bundle gives you free updates for life - meaning when we add a new lesson, you get an updated bundle with the solutions. It's just $7, and it also helps us to keep running the site with free lessons.
1) The seed option passed to
generate_samples ensures that the samples that are “randomly” generated are consistent everytime you run the script. We didn’t pass on the seed to the
choose_random_centroids function, which means those initial centroids are different each time the script is run.
Update the script to include a new seed for random centroids.
2) The k-means algorithm is performed iteratively, where the updated centroids from the previous iteration are used to assign clusters, which are then used to update the centroids, and so on.
In other words, the algorithm alternates between calling
Update the code to perform this iteration 10 times, before stopping. You’ll find that the resulting centroids are much closer on average with more iterations of k-means.
(For those that have experience with k-means, a future tutorial will look at convergence functions and other stopping criteria.)