Transport Analytics Training Series - Last Revision: October 2022

Studying the London Underground¶

Now that we have experimented with a few small networks, we are ready to look at a more substantial dataset - the London Underground!

In [1]:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt

Part 1 - Drawing the London Underground network¶

We have placed a simplified dataset of the LU network structure under the data-london-underground folder, which we will load using pandas.

In [2]:
stations = pd.read_csv('data-london-underground/lu_stations.csv')
stations
Out[2]:
id latitude longitude name zone
0 1 51.5028 -0.2801 Acton Town 3.0
1 8 51.5653 -0.1353 Archway 2.5
2 9 51.6164 -0.1331 Arnos Grove 4.0
3 10 51.5586 -0.1059 Arsenal 2.0
4 11 51.5226 -0.1571 Baker Street 1.0
... ... ... ... ... ...
139 296 51.5120 -0.2239 White City 2.0
140 297 51.5492 -0.2215 Willesden Green 2.5
141 303 51.5975 -0.1097 Wood Green 3.0
142 301 51.6070 0.0341 Woodford 4.0
143 302 51.6179 -0.1856 Woodside Park 4.0

144 rows × 5 columns

This is about half the number of stations compared to the real world network. We did that in order to keep things simple for the purposes of this analysis.

In [3]:
links = pd.read_csv('data-london-underground/lu_links.csv')
links
Out[3]:
station1 station2 line time
0 1 234 10 4
1 1 265 10 4
2 8 124 9 3
3 8 264 9 2
4 9 31 10 3
... ... ... ... ...
164 257 258 9 2
165 261 302 9 3
166 266 303 10 2
167 279 285 7 2
168 288 302 9 1

169 rows × 4 columns

No surprises here - let's now convert these into a graph.

In [4]:
G = nx.Graph()
G.add_nodes_from(stations['id'])
G.add_edges_from(list(zip(links['station1'], links['station2'])))

nx.draw(G)

We have such a large number of nodes, and this ends up being a very busy graph. We can amend the way that the nodes are plotted, so that it looks a bit nicer. We can do this using the node_size parameter.

In [5]:
nx.draw(G, node_size = 6)

But it remains a bit difficult to see - what if we could we make it a bit bigger?

This is possible using a few more advanced matplotlib features. You see, in every new cell we are creating a new instance of a matplotlib chart. Thanks to the pyplot library within matplotlib, the chart creation is quite similar to the one found in Matlab - so some concepts might look familiar.

To modify the size of the figure, we simply have to initialise the chart ourselves, usiing the plt.figure() command, and then specify its size using the figsize command.

If you want to more help with the transition from Matlab to Python, you can read this very helpful guide, or follow this DataCamp course.

In [6]:
plt.figure(figsize=(16,10))
nx.draw(G, node_size = 40)

Much better, but now that we have a better look at it, this certainly doesn't look anything like the London Tube.

Ah! But of course! We forgot to add the coordinates.

In [7]:
plt.figure(figsize=(16,10))

coords = list(zip(stations['longitude'],stations['latitude']))
pos = dict(zip(stations['id'], coords))
nx.draw(G,pos,node_size = 40)

Part 2 - Extraction of network subgraphs¶

What if we wanted to only illustrate the subgraph of the network that lies within Zones 1?

We can do that easily using using the zone column in the stations dataframe - note that the authors of that list chose use "half" values to denote stations that lie in two zones at the same time. Therefore Archway station is described in being in zone 2.5, when in official maps it is placed on the boundaries of zones 2 and 3.

Therefore, if we want to obtain all the nodes that are found in zone 1, we would really have to obtain the stations with a zone value of <2 - it we used <=1 to filter the list, we would have excluded stations that lie in the zone boundary, such as Earls Court.

In [8]:
stations_z1 = pd.read_csv('data-london-underground/lu_stations.csv')
stations_z1 = stations_z1[stations_z1['zone']<2]
len(stations_z1)
Out[8]:
36

We filtered the stations using a condition applied on the zone columnn. This effectively says:

"Look at the zone column within the stations_z1 dataframe, and select the rows where its value is less than 2. Now return a new dataframe, that contains only these rows".

We can now proceed to filter the stations. The stations dataframe does not contain any information on zones, but we can do this by filtering the list by checking whether both stations in each edge are found within out filtered list of Zone 1 station.

To do this, we first create a list of all "allowed" node IDs. We then filter the list be exluding any link whose endpoints do not belong in Zone 1.

In [9]:
allowed_stations = list(stations_z1['id'])

links_z1 = pd.read_csv('data-london-underground/lu_links.csv')
links_z1 = links_z1.loc[links_z1['station1'].isin(allowed_stations)]
len(links_z1)
Out[9]:
57

We have now the list down to 57 nodes, which have an allowed station in the station1 column. Let's now apply the filter to station2.

In [10]:
links_z1 = links_z1.loc[links_z1['station2'].isin(allowed_stations)]
len(links_z1)
Out[10]:
54

Let's now visualise the part of the network:

In [11]:
G_z1 = nx.Graph()
G_z1.add_nodes_from(stations_z1['id'])
G_z1.add_edges_from(list(zip(links_z1['station1'], links_z1['station2'])))

plt.figure(figsize=(16,10))
coords = list(zip(stations_z1['longitude'],stations_z1['latitude']))
pos = dict(zip(stations_z1['id'], coords))
nx.draw(G_z1, pos, node_size = 60)

Part 3 - Obtaining centrality metrics¶

We can now add a list of our centralities.

I am going to use a lambda function to add station names into a column, based on dictionary and the value of the ID column. There are much easier ways to achieve this, but I wanted to take this opportunity to show you the lambda feature in action.

In [12]:
dict_names = dict(zip(stations['id'],stations['name']))
In [13]:
centralities = pd.DataFrame()
centralities['ID'] = G.nodes()
centralities['Names'] = centralities["ID"].map(lambda x:dict_names[x])
centralities['degree_centr'] = nx.degree_centrality(G).values()
centralities['closeness_centr'] = nx.closeness_centrality(G).values()
centralities['betweenness_centr'] = nx.betweenness_centrality(G).values()
centralities['eigenvector_centr'] = nx.eigenvector_centrality(G).values()

Let us now obtain our "Top 10" lists.

In [14]:
centralities.sort_values(by='degree_centr', ascending=False).head(10).reset_index()[['Names','degree_centr']]
Out[14]:
Names degree_centr
0 Green Park 0.041958
1 Oxford Circus 0.034965
2 Waterloo 0.034965
3 Leicester Square 0.027972
4 Bond Street 0.027972
5 Euston 0.027972
6 Finsbury Park 0.027972
7 Piccadilly Circus 0.027972
8 Stockwell 0.027972
9 Tottenham Court Road 0.027972
In [15]:
centralities.sort_values(by='closeness_centr', ascending=False).head(10).reset_index()[['Names','closeness_centr']]
Out[15]:
Names closeness_centr