Now that we have experimented with a few small networks, we are ready to look at a more substantial dataset - the London Underground!
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
We have placed a simplified dataset of the LU network structure under the data-london-underground
folder, which we will load using pandas
.
stations = pd.read_csv('data-london-underground/lu_stations.csv')
stations
id | latitude | longitude | name | zone | |
---|---|---|---|---|---|
0 | 1 | 51.5028 | -0.2801 | Acton Town | 3.0 |
1 | 8 | 51.5653 | -0.1353 | Archway | 2.5 |
2 | 9 | 51.6164 | -0.1331 | Arnos Grove | 4.0 |
3 | 10 | 51.5586 | -0.1059 | Arsenal | 2.0 |
4 | 11 | 51.5226 | -0.1571 | Baker Street | 1.0 |
... | ... | ... | ... | ... | ... |
139 | 296 | 51.5120 | -0.2239 | White City | 2.0 |
140 | 297 | 51.5492 | -0.2215 | Willesden Green | 2.5 |
141 | 303 | 51.5975 | -0.1097 | Wood Green | 3.0 |
142 | 301 | 51.6070 | 0.0341 | Woodford | 4.0 |
143 | 302 | 51.6179 | -0.1856 | Woodside Park | 4.0 |
144 rows × 5 columns
This is about half the number of stations compared to the real world network. We did that in order to keep things simple for the purposes of this analysis.
links = pd.read_csv('data-london-underground/lu_links.csv')
links
station1 | station2 | line | time | |
---|---|---|---|---|
0 | 1 | 234 | 10 | 4 |
1 | 1 | 265 | 10 | 4 |
2 | 8 | 124 | 9 | 3 |
3 | 8 | 264 | 9 | 2 |
4 | 9 | 31 | 10 | 3 |
... | ... | ... | ... | ... |
164 | 257 | 258 | 9 | 2 |
165 | 261 | 302 | 9 | 3 |
166 | 266 | 303 | 10 | 2 |
167 | 279 | 285 | 7 | 2 |
168 | 288 | 302 | 9 | 1 |
169 rows × 4 columns
No surprises here - let's now convert these into a graph.
G = nx.Graph()
G.add_nodes_from(stations['id'])
G.add_edges_from(list(zip(links['station1'], links['station2'])))
nx.draw(G)
We have such a large number of nodes, and this ends up being a very busy graph. We can amend the way that the nodes are plotted, so that it looks a bit nicer. We can do this using the node_size
parameter.
nx.draw(G, node_size = 6)
But it remains a bit difficult to see - what if we could we make it a bit bigger?
This is possible using a few more advanced matplotlib
features. You see, in every new cell we are creating a new instance of a matplotlib
chart. Thanks to the pyplot
library within matplotlib
, the chart creation is quite similar to the one found in Matlab - so some concepts might look familiar.
To modify the size of the figure, we simply have to initialise the chart ourselves, usiing the plt.figure()
command, and then specify its size using the figsize
command.
If you want to more help with the transition from Matlab to Python, you can read this very helpful guide, or follow this DataCamp course.
plt.figure(figsize=(16,10))
nx.draw(G, node_size = 40)
Much better, but now that we have a better look at it, this certainly doesn't look anything like the London Tube.
Ah! But of course! We forgot to add the coordinates.
plt.figure(figsize=(16,10))
coords = list(zip(stations['longitude'],stations['latitude']))
pos = dict(zip(stations['id'], coords))
nx.draw(G,pos,node_size = 40)
What if we wanted to only illustrate the subgraph of the network that lies within Zones 1?
We can do that easily using using the zone
column in the stations
dataframe - note that the authors of that list chose use "half" values to denote stations that lie in two zones at the same time. Therefore Archway
station is described in being in zone 2.5
, when in official maps it is placed on the boundaries of zones 2 and 3.
Therefore, if we want to obtain all the nodes that are found in zone 1, we would really have to obtain the stations with a zone
value of <2
- it we used <=1
to filter the list, we would have excluded stations that lie in the zone boundary, such as Earls Court.
stations_z1 = pd.read_csv('data-london-underground/lu_stations.csv')
stations_z1 = stations_z1[stations_z1['zone']<2]
len(stations_z1)
36
We filtered the stations using a condition applied on the zone
columnn. This effectively says:
"Look at the zone
column within the stations_z1
dataframe, and select the rows where its value is less than 2. Now return a new dataframe, that contains only these rows".
We can now proceed to filter the stations. The stations
dataframe does not contain any information on zones, but we can do this by filtering the list by checking whether both stations in each edge are found within out filtered list of Zone 1 station.
To do this, we first create a list of all "allowed" node IDs. We then filter the list be exluding any link whose endpoints do not belong in Zone 1.
allowed_stations = list(stations_z1['id'])
links_z1 = pd.read_csv('data-london-underground/lu_links.csv')
links_z1 = links_z1.loc[links_z1['station1'].isin(allowed_stations)]
len(links_z1)
57
We have now the list down to 57 nodes, which have an allowed station in the station1
column. Let's now apply the filter to station2
.
links_z1 = links_z1.loc[links_z1['station2'].isin(allowed_stations)]
len(links_z1)
54
Let's now visualise the part of the network:
G_z1 = nx.Graph()
G_z1.add_nodes_from(stations_z1['id'])
G_z1.add_edges_from(list(zip(links_z1['station1'], links_z1['station2'])))
plt.figure(figsize=(16,10))
coords = list(zip(stations_z1['longitude'],stations_z1['latitude']))
pos = dict(zip(stations_z1['id'], coords))
nx.draw(G_z1, pos, node_size = 60)
We can now add a list of our centralities.
I am going to use a lambda
function to add station names into a column, based on dictionary and the value of the ID
column. There are much easier ways to achieve this, but I wanted to take this opportunity to show you the lambda
feature in action.
dict_names = dict(zip(stations['id'],stations['name']))
centralities = pd.DataFrame()
centralities['ID'] = G.nodes()
centralities['Names'] = centralities["ID"].map(lambda x:dict_names[x])
centralities['degree_centr'] = nx.degree_centrality(G).values()
centralities['closeness_centr'] = nx.closeness_centrality(G).values()
centralities['betweenness_centr'] = nx.betweenness_centrality(G).values()
centralities['eigenvector_centr'] = nx.eigenvector_centrality(G).values()
Let us now obtain our "Top 10" lists.
centralities.sort_values(by='degree_centr', ascending=False).head(10).reset_index()[['Names','degree_centr']]
Names | degree_centr | |
---|---|---|
0 | Green Park | 0.041958 |
1 | Oxford Circus | 0.034965 |
2 | Waterloo | 0.034965 |
3 | Leicester Square | 0.027972 |
4 | Bond Street | 0.027972 |
5 | Euston | 0.027972 |
6 | Finsbury Park | 0.027972 |
7 | Piccadilly Circus | 0.027972 |
8 | Stockwell | 0.027972 |
9 | Tottenham Court Road | 0.027972 |
centralities.sort_values(by='closeness_centr', ascending=False).head(10).reset_index()[['Names','closeness_centr']]
Names | closeness_centr | |
---|---|---|