TensorFlow™ on Databricks

Clustering and k-means

We now venture into our first application, which is clustering with the k-means algorithm. Clustering is a data mining exercise where we take a bunch of data and find groups of points that are similar to each other. K-means is an algorithm that is great for finding clusters in many types of datasets.

For more about cluster and k-means, see the scikit-learn documentation on its k-means algorithm or watch this video:

Generating Samples

First up, we are going to need to generate some samples. We could generate the samples randomly, but that is likely to either give us very sparse points, or just one big group - not very exciting for clustering.

Instead, we are going to start by generating three centroids, and then randomly choose (with a normal distribution) around that point. First up, here is a method for doing this:

Put this code in functions.py

The way this works is to create n_clusters different centroids at random (using np.random.random((1, n_features))) and using those as the centre points for tf.random_normal. The tf.random_normal function generates normally distributed random values, which we then add to the current centre point. This creates a blob of points around that center. We then record the centroids (centroids.append) and the generated samples (slices.append(samples)). Finally, we create “One big list of samples” using tf.concat, and convert the centroids to a TensorFlow Variable as well, also using tf.concat.

Saving this create_samples method in a file called functions.py allows us to import these methods into our scripts for this (and the next!) lesson. Create a new file called generate_samples.py, which has the following code:

This just sets up the number of clusters and features (I recommend keeping the number of features at 2, allowing us to visualise them later), and the number of samples to generate. Increasing the embiggen_factor will increase the “spread” or the size of the clusters. I chose a value here that provides good learning opportunity, as it generates visually identifiable clusters.

To visualise the results, lets create a plotting function using matplotlib. Add this code to functions.py:

Put this code in functions.py

All this code does is plots out the samples from each cluster using a different colour, and creates a big magenta X where the centroid is. The centroid is given as an argument, which will be handy later on.

Update the generate_samples.py to import this function by adding from functions import plot_clusters to the top of the file. Then, add this line of code to the bottom:

Running generate_samples.py should now give you the following plot:

Initialisation

The k-means algorithm starts with the choice of the initial centroids, which are just random guesses of the actual centroids in the data. The following function will randomly choose a number of samples from the dataset to act as this initial guess:

Put this code in functions.py

This code first creates an index for each sample (using tf.range(0, n_samples), and then randomly shuffles it. From there, we choose a fixed number (n_clusters) of indices using tf.slice. These indices correlated to our initial centroids, which are then grouped together using tf.gather to form our array of initial centroids.

Add this new choose_random_centorids function to functions.py, and create a new script (or update your previous one) to the following:

The major change here is that we create a variable for these initial centroids, and compute its value in the session. We then plot out those first guesses to plot_cluster, rather than the actual centroids that were used to generate the data.

Running this will net a similar image to above, but the centroids will be in random positions. Try running this script a few times, noting that the centroids move around quite a bit.

Updating Centroids

After starting with some guess for the centroid locations, the k-means algorithm then updates those guesses based on the data. The process is to assign each sample a cluster number, representing the centroid it is closest to. After that, the centroids are updated to be the means of all samples assigned to that cluster. The following code handles the assign to nearest cluster step:

Note that I’ve borrowed some code from this page which has a different type of k-means algorithm, and lots of other useful information.

The way this works is to compute the distance between each sample and each centroid, which occurs through the distances =line. The distance computation here is the Euclidean distance. An important point here is that tf.subtract will automatically expand the size of the two arguments. This means that having our samples as a matrix, and the centroids as a column vector will produce the pairwise subtraction between them. In order to do this, we use tf.expand_dims to create an extra dimension for both samples and centroids, forcing this behaviour of tf.subtract.

The next bit of code hands the update centroids bit:

This code takes the nearest indices for each sample, and grabs those out as separate groups using tf.dynamic_partition. From here, we use tf.reduce_mean on a single group to find the average of that group, forming its new centroid. From here, we just tf.concat them together to form our new centroids.

Now we have the piece in place, we can add these calls to our script (or create a new one):

This code will:

Generate samples from initial centroids
Randomly choose initial centroids
Associate each sample to its nearest centroid
Update each centroid to be the mean of the samples associated to it

This is a single iteration of k-means! I encourage you to take a shot at the exercises, which turns this into an iterative version.

1) The seed option passed to generate_samples ensures that the samples that are “randomly” generated are consistent everytime you run the script. We didn’t pass on the seed to the choose_random_centroids function, which means those initial centroids are different each time the script is run. Update the script to include a new seed for random centroids.

2) The k-means algorithm is performed iteratively, where the updated centroids from the previous iteration are used to assign clusters, which are then used to update the centroids, and so on. In other words, the algorithm alternates between calling assign_to_nearest and update_centroids. Update the code to perform this iteration 10 times, before stopping. You’ll find that the resulting centroids are much closer on average with more iterations of k-means. (For those that have experience with k-means, a future tutorial will look at convergence functions and other stopping criteria.)