Geographic information system (GIS) is a system to capture and analyse spatial and geographical data.
It is often divided into two types of data:
For the Transport Systems module, you will be working with the vector data.
Vector GIS data is commonly stored in the shapefile format with related attribute data. These can be opened by commercial software packages such as ArcGIS by ESRI, creator of shapefiles, or QGIS.
Whilst the term shapefile sounds like there would be one singular file, a shapefile is a collection of multiple basic files:
All the files that form one shapefile needs to have the same prefix, e.g. roads.shp, roads.shx, roads.dbf, etc.
When using Python to process shapefiles, we will read in *.shp file, but all the other required files need to be present in the same folder.
One more important aspect to consider when working with GIS data is ensuring you are working in the same coordinate reference system (CRS) throughout your entire project. In this course, we will be using EPSG:4326 (WGS 84). This is the latitude and longitude CRS you will find when you Google London's latitude and longitude: 51.509865, -0.118092. However, shapefile geometry order is longitude then latitude.
Now that we understand what the files are for, let's install and import the necessary packages for us to be able to work with shapefiles.
Within your conda environment tsenv
, you will already have a few libraries installed from previous sessions:
We will now install geopandas, which is pandas for geospatial data. To do this, we need to first ensure to install GDAL (Geospatial Data Abstraction Library), which is a translator library for raster and vector geospatial data formats.
Regardless of having a Windows or Mac or Linux operating system, instead of pip install ..
, we will be using conda install ..
:
conda install -c conda-forge gdal
conda install geopandas
If this didn't work, especially for those of you, who are using Windows, try using an alternative method following a YouTube video here.
Please contact your GTAs, if you are having trouble installing GDAL and/or Geopandas.
As part of the GeoPandas, a package called Shapely will be downloaded. This helps us work with GIS data of points, polygons, and multipolygons.
Now, let's start coding.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from shapely.geometry import Point, Polygon, MultiPolygon
import geopandas as gpd
from sklearn.cluster import KMeans
from matplotlib import cm
from pulp import *
%matplotlib inline
You can read in a shapefile the following way by refering to the *.shp file.
ldn_df = gpd.read_file('LSOA_GreaterLondon_Households.shp')
ldn_df.head()
LSOA11CD | LSOA11NM | HHOLDS | AVHHOLDSZ | geometry | |
---|---|---|---|---|---|
0 | E01000001 | City of London 001A | 876 | 1.7 | POLYGON ((-0.09726 51.52158, -0.09649 51.52028... |
1 | E01000002 | City of London 001B | 830 | 1.7 | POLYGON ((-0.08810 51.51941, -0.08927 51.51752... |
2 | E01000003 | City of London 001C | 817 | 1.5 | POLYGON ((-0.09676 51.52325, -0.09644 51.52282... |
3 | E01000005 | City of London 001E | 467 | 2.1 | POLYGON ((-0.07320 51.51000, -0.07551 51.50974... |
4 | E01000006 | Barking and Dagenham 016A | 543 | 3.1 | POLYGON ((0.09118 51.53909, 0.09328 51.53787, ... |
The example shapefile contains Lower Super Output Areas (LSOA) in Greater London. LSOAs are a geographic hierarchy designed to improve the reporting of small area statistics in England and Wales.
In this example, our shapefile contains the LSOA zone codes (or IDs), the name of the LSOAs, the number of households in each area and the average household size for each zone. Then most importantly, the data also contains the geometry details of the polygon, which forms the outline of each LSOA zone.
As mentioned, it is very important to note that the geometry in Geopandas is in order of longitude then latitude. If you get these mixed up, instead of analysing London, you may be analysing bits of the sea next to Somalia!
In previous sessions, you have already learned how to works with pandas dataframes. Geopandas dataframe are the same. We will make a graph using our new ldn_df
. Latitude defines your y and longitutde defines your x.
We will use matplotlib.pyplot
that has been imported as plt
to plot the average household size of different LSOAs in London.
# The following is optional. You can set any minimum x and y and maximum x and y for your plot frame.
minx, miny, maxx, maxy = ldn_df.total_bounds
fig, ax = plt.subplots(1, figsize = (15,15))
# I have calculated the minimum and maximum latitudes and longitudes, so that my graph has very little whitespace.
# This is optional. You could have more white space, but ensure, you show your entire graph
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
# add a title
ax.set_title('Average household size per LSOA in London', fontdict={'fontsize': '15', 'fontweight': '3'})
# We will remove the axis for better presentation
ax.axis('off')
# We will annotate and add source
ax.annotate('Source: London Datastore, LSOAs in London as of 2011',xy=(0.1, .08), xycoords='figure fraction', horizontalalignment='left', verticalalignment='bottom', fontsize=12, color='#555555')
# Plot using a rainbow colormap
ldn_df.plot(ax=ax,alpha=1, column='AVHHOLDSZ', cmap='Spectral')
# Add colorbar
sm = plt.cm.ScalarMappable(cmap='Spectral', norm=plt.Normalize(vmin=ldn_df.AVHHOLDSZ.min(), vmax=ldn_df.AVHHOLDSZ.max()))# empty array for the data range
cbar = fig.colorbar(sm, shrink=0.5)