In [1]:

```
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
```

`matplotlib`

and `seaborn`

packages to plot the charts in this notebook.

In [2]:

```
plot_size = 14
plot_width = 5
plot_height = 5
params = {'legend.fontsize': 'large',
'figure.figsize': (plot_width,plot_height),
'axes.labelsize': plot_size,
'axes.titlesize': plot_size,
'xtick.labelsize': plot_size*0.75,
'ytick.labelsize': plot_size*0.75,
'axes.titlepad': 25}
plt.rcParams.update(params)
```

As we are also using the outputs from the charts to create the illustrations in our lecture slides, we require an additional level of control over the appearance of the charts.

The above commands are therefore used to override the default font and any dimension parameters.

In [3]:

```
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
```

We use the `KMeans`

algorithm implementation from the `sklearn`

package. `sklearn`

also provides several other clustering algorithms under the `sklearn.cluster`

namespace. You can read more about the other algorithms here.

Also from `sklearn`

, we are using the `make_blobs`

command to create a random dataset. As the command name implies, it creates a vertex arrangements that is concentrated around a defined number of "blobs".

In [4]:

```
num_customers = 40
```

In [5]:

```
coord, clust_true = make_blobs(n_samples=num_customers,
centers=3,
cluster_std=1,
random_state = 2)
```

The `make_blobs`

command returns two outputs, which we save on the `coord`

and `clust_true`

variables. Let's see what the contain.

In terms of input parameters, aside from the number of elements that we want to create, we can also define how many blobs we are using (`centers`

) and also how far away will the samples be from each "blob" centroid.

As the outputs are random, we are using `random_state`

to provide a "seed" to the underlying random number generator used by the function, which will ensure that we get the same results, every time.

In [6]:

```
coord
```

Out[6]:

array([[ -1.54915892, -7.25010857], [ 0.61758013, -1.36802291], [ 1.36369409, 0.06608172], [ 0.35857025, -0.7851559 ], [ -0.89752435, -5.42677013], [ -3.71486953, -9.36874886], [ -4.25209341, -3.4847562 ], [ 1.60459034, -1.24558156], [ -0.8748411 , 0.43763252], [ -1.99653623, -4.77782225], [ 1.20936556, -3.15216453], [ -1.23856256, -10.59940081], [ -2.02797291, -9.47245011], [ -0.74104364, -10.07763506], [ -0.77722054, -10.72676345], [ -1.91775697, -10.66908765], [ -2.27031954, -4.83274261], [ -0.72864791, -7.18926735], [ -3.63296701, -3.34704806], [ 0.67974136, -0.52254041], [ -1.97416044, -3.32681457], [ -1.61892392, -9.71765939], [ -0.76794095, -2.14509066], [ -2.15820985, -9.63790953], [ -1.29923245, -8.30647414], [ -2.01196044, -3.52563248], [ -2.33805418, -10.39048298], [ -1.06834753, -2.658024 ], [ 1.49510676, -2.13776585], [ 1.42674589, -0.01517292], [ 1.99361544, -1.67464467], [ -2.70131918, -9.63497056], [ -1.02353151, -10.47025441], [ -1.6322142 , -3.06730015], [ 2.46092757, -1.62922949], [ -1.78211322, -3.47052225], [ -2.69138291, -1.80881652], [ 0.16411427, -1.20584193], [ 0.99325932, -0.75119958], [ -2.24589423, -2.5508473 ]])

In [7]:

```
clust_true
```

Out[7]:

array([0, 1, 1, 1, 2, 0, 2, 1, 1, 2, 1, 0, 0, 0, 0, 0, 2, 0, 2, 1, 2, 0, 2, 0, 0, 2, 0, 2, 1, 1, 1, 0, 0, 2, 1, 2, 2, 1, 1, 2])

It appears that `coord`

is a two-dimensional array of X,Y coordinates. `clust_true`

returns the index of the cluster that `make_blobs`

things that each coordinate should belong to.

It would be interesting to refer to this array later on, but we are not going to be using it for the purposed of our analysis. We will determine the clusters on our own, with the help of the `KMeans`

algorithm.

In [8]:

```
plt.scatter(coord[:, 0],
coord[:, 1],
s=plot_size*2,
cmap='viridis');
```

`scatter`

command to plot the customer locations. The `s`

parameter defines the size nodes, while the `cmap`

parameters picks a color scheme from the `seaborn`

library.

`centers`

and `cluster_std`

?¶In [9]:

```
model = KMeans(n_clusters=2)
model.fit(coord)
clust_pred = model.predict(coord)
```

Running the algorithm involves 3 distinct steps:

1) We initialise a `KMeans`

instance, and decide how many clusters we are going to seek (here: 2)

2) We "train" the model using the coordinates using the `fit()`

function.

3) using the `predict()`

function, calculate the centroids.

In [10]:

```
plt.scatter(coord[:, 0],
coord[:, 1],
c = clust_pred,
s=plot_size*2,
cmap='Accent')
centers = model.cluster_centers_
plt.scatter(centers[:, 0],
centers[:, 1],
c = 'red',
s=plot_size*10,
alpha=0.5);
```

We are using the `scatter`

command twice. The first time, we use it to plot the customer coordinates.

Instead of a color, we supply to `c`

the predicted index of the cluster. The `Accent`

color scheme, will automatically transform this to a visually appealing (hopefully) combination of colours.

The second time that we use `scatter`

we add the centroids of our customers to the original graph.

As a rule of thumb, every additional `matplotlib`

command within a Jupyter cell will simply update the previous plot that was created within the cell.

In [11]:

```
model.inertia_
```

Out[11]:

172.85989892525026

`.inertia_`

field, which provides us a measure of how well the `KMeans`

has performed

In [12]:

```
from yellowbrick.cluster import KElbowVisualizer
visualizer = KElbowVisualizer(model, k=(2,12),timings=False)
visualizer.fit(coord) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
```

Out[12]:

<AxesSubplot:title={'center':'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

As discussed in the class, we are using the elbow method to automatically identify a suggested number of clusters, without supplying any additional contextual information about the problem.

This technique usually works well as a preliminary step while performing an initial shift through data. It must, however, be followed up with a proper facility location analysis.

In this case we are using the `KElbowVisualizer`

that is provided by the `yellowbrick`

package. With the parameter `k`

we are instructing the algorithm to check all values between 2 and 12 clusters.

The term "distortion" refers to the within-cluster sum of squares difference between all elements and their corresponding centroids, and is therefore equivalent to "inertia"