import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
stations = pd.read_csv('datasets/lu_stations.csv')
stations
links = pd.read_csv('datasets/lu_links.csv')
links
No surprises here - let's now convert these into a graph.
G = nx.Graph()
G.add_nodes_from(stations['id'])
G.add_edges_from(list(zip(links['station1'], links['station2'])))
nx.draw(G)
We have such a large number of nodes, and this ends up being a very busy graph. We can amend the way that the nodes are plotted, so that it looks a bit nicer. We can do this using the node_size
parameter.
nx.draw(G, node_size = 6)
But it remains a bit difficult to see - what if we could we make it a bit bigger?
This is possible using a few more advanced matplotlib
features. You see, in every new cell we are creating a new instance of a matplotlib
chart. Thanks to the pyplot
library within matplotlib
, the chart creation is quite similar to the one found in Matlab - so some concepts might look familiar.
To modify the size of the figure, we simply have to initialise the chart ourselves, usiing the plt.figure()
command, and then specify its size using the figsize
command.
If you want to more help with the transition from Matlab to Python, you can read this very helpful guide, or follow this DataCamp course.
plt.figure(figsize=(16,10))
nx.draw(G, node_size = 40)
Much better, but now that we have a better look at it, this certainly doesn't look anything like the London Tube.
Ah! But of course! We forgot to add the coordinates.
plt.figure(figsize=(16,10))
coords = list(zip(stations['longitude'],stations['latitude']))
pos = dict(zip(stations['id'], coords))
nx.draw(G,pos,node_size = 40)
What if we wanted to only illustrate the subgraph of the network that lies within Zones 1?
We can do that easily using using the zone
column in the stations
dataframe - note that the authors of that list chose use "half" values to denote stations that lie in two zones at the same time. Therefore Archway
station is described in being in zone 2.5
, when in official maps it is placed on the boundaries of zones 2 and 3.
Therefore, if we want to obtain all the nodes that are found in zone 1, we would really have to obtain the stations with a zone
value of <2
- it we used <=1
to filter the list, we would have excluded stations that lie in the zone boundary, such as Earls Court.
stations_z1 = pd.read_csv('datasets/lu_stations.csv')
stations_z1 = stations_z1[stations_z1['zone']<2]
len(stations_z1)
We filtered the stations using a condition applied on the zone
columnn. This effectively says:
"Look at the zone
column within the stations_z1
dataframe, and select the rows where its value is less than 2. Now return a new dataframe, that contains only these rows".
We can now proceed to filter the stations. The stations
dataframe does not contain any information on zones, but we can do this by filtering the list by checking whether both stations in each edge are found within out filtered list of Zone 1 station.
To do this, we first create a list of all "allowed" node IDs. We then filter the list be exluding any link whose endpoints do not belong in Zone 1.
allowed_stations = list(stations_z1['id'])
links_z1 = pd.read_csv('datasets/lu_links.csv')
links_z1 = links_z1.loc[links_z1['station1'].isin(allowed_stations)]
len(links_z1)
We have now the list down to 57 nodes, which have an allowed station in the station1
column. Let's now apply the filter to station2
.
links_z1 = links_z1.loc[links_z1['station2'].isin(allowed_stations)]
len(links_z1)
Let's now visualise the part of the network:
G_z1 = nx.Graph()
G_z1.add_nodes_from(stations_z1['id'])
G_z1.add_edges_from(list(zip(links_z1['station1'], links_z1['station2'])))
plt.figure(figsize=(16,10))
coords = list(zip(stations_z1['longitude'],stations_z1['latitude']))
pos = dict(zip(stations_z1['id'], coords))
nx.draw(G_z1, pos, node_size = 60)
We can now add a list of our centralities.
I am going to use a lambda
function to add station names into a column, based on dictionary and the value of the ID
column. There are much easier ways to achieve this, but I wanted to take this opportunity to show you the lambda
feature in action.
dict_names = dict(zip(stations['id'],stations['name']))
centralities = pd.DataFrame()
centralities['ID'] = G.nodes()
centralities['Names'] = centralities["ID"].map(lambda x:dict_names[x])
centralities['degree_centr'] = nx.degree_centrality(G).values()
centralities['closeness_centr'] = nx.closeness_centrality(G).values()
centralities['betweenness_centr'] = nx.betweenness_centrality(G).values()
centralities['eigenvector_centr'] = nx.eigenvector_centrality(G).values()
Let us now obtain our "Top 10" lists.
centralities.sort_values(by='degree_centr', ascending=False).head(10).reset_index()[['Names','degree_centr']]
centralities.sort_values(by='closeness_centr', ascending=False).head(10).reset_index()[['Names','closeness_centr']]
centralities.sort_values(by='betweenness_centr', ascending=False).head(10).reset_index()[['Names','betweenness_centr']]
centralities.sort_values(by='eigenvector_centr', ascending=False).head(10).reset_index()[['Names','eigenvector_centr']]
The overwhelming message here is that Green Park is definitely important!
Let's now visualise these values:
plt.figure(figsize=(16,10))
coords = list(zip(stations['longitude'],stations['latitude']))
pos = dict(zip(stations['id'], coords))
nx.draw(G, pos, with_labels = False, node_color = list(centralities['degree_centr']))
plt.figure(figsize=(16,10))
coords = list(zip(stations['longitude'],stations['latitude']))
pos = dict(zip(stations['id'], coords))
nx.draw(G, pos, with_labels = False, node_color = list(centralities['betweenness_centr']))
At this point we will apply the K-Means algorithm approach in order to split the network into zones. I would strongly advise that you pause this notebook for the time being, and instead have a look at the next one in this serios (2.4 - Kmeans clustering), which introduces the fundamentals of this method in more detail.
You can return back to this notebook once you had a look on that.
To proceed, we import the KMeans
cluster from sklearn
and the numpy
package.
from sklearn.cluster import KMeans
import numpy as np
The Kmeans
algorithm can work with any dataset - in our case we will simply apply it to an array of the station coordinates.
coord = np.array(list(zip(stations['longitude'],stations['latitude'])))
model = KMeans(n_clusters=10)
model.fit(coord)
clust_pred = model.predict(coord)
We use the same code as in Notebook 2.3 to plot the cluster memberships.
plot_size = 20
plot_width = 10
plot_height = 10
params = {'legend.fontsize': 'large',
'figure.figsize': (plot_width,plot_height),
'axes.labelsize': plot_size,
'axes.titlesize': plot_size,
'xtick.labelsize': plot_size*0.5,
'ytick.labelsize': plot_size*0.50,
'axes.titlepad': 25}
plt.rcParams.update(params)
plt.scatter(coord[:, 0],
coord[:, 1],
c = clust_pred,
s=plot_size*2,
cmap='Accent')
centers = model.cluster_centers_
plt.scatter(centers[:, 0],
centers[:, 1],
c = 'red',
s=plot_size*10,
alpha=0.5);
Next, we are going to use the KElbowVisualizer
from the yellowbrick
to pick a number of clusters.
from yellowbrick.cluster import KElbowVisualizer
visualizer = KElbowVisualizer(model, k=(2,12),timings=False)
visualizer.fit(coord) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
model = KMeans(n_clusters=5)
model.fit(coord)
clust_pred = model.predict(coord)
plt.scatter(coord[:, 0],
coord[:, 1],
c = clust_pred,
s=plot_size*2,
cmap='Accent')
centers = model.cluster_centers_
plt.scatter(centers[:, 0],
centers[:, 1],
c = 'red',
s=plot_size*10,
alpha=0.5);
These look relalatively sensible upon first sight. Let us now store the cluster memberships into the stations
dataframe.
stations['cluster'] = clust_pred
stations
We are lucky enough to have an indication of travel demand patterns in the Night Tube, thanks to the Oyster data that were made available released from Transport for London under their Rolling Origin and Demand Survey (RODS). The entire set of data can be found at:
http://crowding.data.tfl.gov.uk
Before going any further let's load the demand file.
demand = pd.read_csv('datasets/lu_od.csv')
demand
We can see in the above that demands are provided in an "directional" manner - as a flow of customers from one node to another.
To make a better sense of the dataset, we will seek to aggregate these flows into a sum of departures and arrivals for each station.
To do this, we start by creatign two empty dictionaries - to store the total flows. The initial flow value for each station will be zero, and we will be adding flows to the correct place in the dictionary one by one.
We set up these dictionaries by zipping a list of station IDs and an empty array of zeroes - this should be exactly as long as the list of stations. We can do this using the np.zeros(len(stations))
command.
dict_from = dict(zip(stations['id'],np.zeros(len(stations))))
dict_to = dict(zip(stations['id'],np.zeros(len(stations))))
In the next step, we will be going into each node in our set one by one, and we will be summing the flow values on our demand table for the current node, with respect to the origin_id
and dest_id
columns, respectively.
for node in dict_from:
dict_from[node] = demand.loc[demand['origin_id'] == node, 'demand'].sum()
for node in dict_to:
dict_to[node] = demand.loc[demand['dest_id'] == node, 'demand'].sum()
We can now store these values into our stations
dataframe.
stations['tot_departures'] = dict_from.values()
stations['tot_arrivals'] = dict_to.values()
stations