## How to work with GIS¶

Geographic information system (GIS) is a system to capture and analyse spatial and geographical data.

It is often divided into two types of data:

1. Vector data consisting of point, line (or arc), and polygon data. Polygon consists of geometries, which is formed of points. You may come across multipolygons, which are essentially multiple polygons grouped together.
2. Raster data consisting of continuous or discrete features of surfaces and areas, such as topological information, population density, or satellite imagery.

For the Transport Systems module, you will be working with the vector data.

Vector GIS data is commonly stored in the shapefile format with related attribute data. These can be opened by commercial software packages such as ArcGIS by ESRI, creator of shapefiles, or QGIS.

Whilst the term shapefile sounds like there would be one singular file, a shapefile is a collection of multiple basic files:

• *.shp   -   Main file (required) with the geometry data -- this often has the largest file size
• *.shx   -   Index file (required) with the geometry data index
• *.dbf   -   Attribute file (required) with the attribute information of features and geometry -- even with zero attributes, this is required.

• *.prj   -   The file that stores the coordinate system information (optional).
• *.cpg   -   The file that can be used to specify the codepage for identifying the character set to be used (optional).
• *.sbn   -   One of the two files that store spatial index of the features (optional).
• *.sbx   -   One of the two files that store spatial index of the features (optional).

All the files that form one shapefile needs to have the same prefix, e.g. roads.shp, roads.shx, roads.dbf, etc.

When using Python to process shapefiles, we will read in *.shp file, but all the other required files need to be present in the same folder.

One more important aspect to consider when working with GIS data is ensuring you are working in the same coordinate reference system (CRS) throughout your entire project. In this course, we will be using EPSG:4326 (WGS 84). This is the latitude and longitude CRS you will find when you Google London's latitude and longitude: 51.509865, -0.118092. However, shapefile geometry order is longitude then latitude.

### Installation requirement¶

Now that we understand what the files are for, let's install and import the necessary packages for us to be able to work with shapefiles.

Within your conda environment tsenv, you will already have a few libraries installed from previous sessions:

• numpy
• pandas
• matplotlib
• scikit-learn (or sklearn)
• pulp

We will now install geopandas, which is pandas for geospatial data. To do this, we need to first ensure to install GDAL (Geospatial Data Abstraction Library), which is a translator library for raster and vector geospatial data formats.

Regardless of having a Windows or Mac or Linux operating system, instead of pip install .., we will be using conda install ..:

conda install -c conda-forge gdal
conda install geopandas

If this didn't work, especially for those of you, who are using Windows, try using an alternative method following a YouTube video here.

As part of the GeoPandas, a package called Shapely will be downloaded. This helps us work with GIS data of points, polygons, and multipolygons.

### Working with GeoPandas¶

Now, let's start coding.

You can read in a shapefile the following way by refering to the *.shp file.

The example shapefile contains Lower Super Output Areas (LSOA) in Greater London. LSOAs are a geographic hierarchy designed to improve the reporting of small area statistics in England and Wales.

In this example, our shapefile contains the LSOA zone codes (or IDs), the name of the LSOAs, the number of households in each area and the average household size for each zone. Then most importantly, the data also contains the geometry details of the polygon, which forms the outline of each LSOA zone.

As mentioned, it is very important to note that the geometry in Geopandas is in order of longitude then latitude. If you get these mixed up, instead of analysing London, you may be analysing bits of the sea next to Somalia!

In previous sessions, you have already learned how to works with pandas dataframes. Geopandas dataframe are the same. We will make a graph using our new ldn_df. Latitude defines your y and longitutde defines your x.

We will use matplotlib.pyplot that has been imported as plt to plot the average household size of different LSOAs in London.