This report was created for the final assignment for the Geographic Information Systems and Science module led by Dr. Andy MacLachlan at the Centre for Advanced Spatial Analysis as part of the MSc Urban Spatial Science module.
The original report starts below:
In this report, I will identify whether there are any patterns in accidents involving e-scooters in the Greater London Area. The questions that will be addressed in this report will be:
Are there any areas that have a higher rate of e-scooter accidents compared to the average in London, when taking into account the traffic?
I assume that the more e-scooters that run through the area, there will be more accidents that they are involved in. After confirming this, I will seek to find areas with higher density of e-scooter accidents in comparison with the traffic volume of e-scooters.
The datasets used are as follows:
This dataset is provided by the Department for Transport, and includes the spatial distribution and its characteristics for the United Kingdom. The characteristics include the road type, speed limit, number and type of vehicles involved, casualties, and other data involving the time and situation of the data.
The active travel counts data is collected at 2,297 points across the Greater London area every 15 minutes. This includes the data for e-scooters, allowing our analysis on the traffic possible. We will analyse the average number of e-scooters in an area in order to standardise the number of accidents.
This will be the basis of our analysis.
The data-process will be as follows.
For road safety data:
The active travel data has a separate file for the locations and the actual values. I will:
After the data cleaning is complete, the data is combined by borough to conduct the analysis.
The spatial boundaries that I will use are the London boroughs. Smaller statistical areas may be able to catch the characteristics of specific areas, but the traffic count data has only 2,297 points, which is not enough to have valid data for all 625 wards.
The spatial distribution will be calculated using the following models:
First of all, the required libraries need to be loaded.
library(tidyverse)
library(sf)
library(ggspatial)
library(spdep)
library(broom)
library(tmap)
library(here)
library(janitor)
Here, I will read in all of the necessary data. The road safety data, geometry data for London, and the active travel counts are loaded.
The active travel counts are separated into zones, so they are combined together at this stage to enable easier data wrangling in the future.
# read in casualties
casualty <- read_csv(
here::here("data", "road_safety", "dft-road-casualty-statistics-casualty-2022.csv")
)
# read in collisions
collision <- read_csv(
here::here("data", "road_safety", "dft-road-casualty-statistics-collision-2022.csv")
)
# read in e-scooter data
e_scooter <- read_csv(
here::here("data", "road_safety",
"dft-road-casualty-statistics-vehicle-e-scooter-2020-Latest-Published-Year.csv"))
# read in London LSOA shapefiles
# transform into EPSG:27700 again to make able to join,
# even though it is already in EPSG:27700
ward <- st_read(
here::here("data", "statistical-gis-boundaries-london",
"statistical-gis-boundaries-london","ESRI",
"London_Ward_CityMerged.shp")
) %>%
st_transform(., 27700)
## Reading layer `London_Ward_CityMerged' from data source
## `C:\Users\Soki\OneDrive - University College London\01_Courses\CASA0005_GIS\Forked\london_escooters\data\statistical-gis-boundaries-london\statistical-gis-boundaries-london\ESRI\London_Ward_CityMerged.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 625 features and 7 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
## Projected CRS: OSGB36 / British National Grid
borough <- st_read(
here::here("data", "statistical-gis-boundaries-london",
"statistical-gis-boundaries-london","ESRI",
"London_Borough_Excluding_MHW.shp")
) %>%
st_transform(., 27700)
## Reading layer `London_Borough_Excluding_MHW' from data source
## `C:\Users\Soki\OneDrive - University College London\01_Courses\CASA0005_GIS\Forked\london_escooters\data\statistical-gis-boundaries-london\statistical-gis-boundaries-london\ESRI\London_Borough_Excluding_MHW.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 33 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
## Projected CRS: OSGB36 / British National Grid
# read in active travel counts data
# clean the name of columns
count_points <- read_csv(here::here("data", "active_travel_counts", "0-Count locations.csv")) %>%
janitor::clean_names(.)
# read in the data for active travel counts
# central area
count_central <- read_csv(here::here("data", "active_travel_counts", "2022-Central.csv")) %>%
janitor::clean_names(.)
# inner area (separated into 2)
count_inner1 <- read_csv(here::here("data", "active_travel_counts", "2022-Inner-Part1.csv")) %>%
janitor::clean_names(.)
count_inner2 <- read_csv(here::here("data", "active_travel_counts", "2022-Inner-Part2.csv")) %>%
janitor::clean_names(.)
# outer area
count_outer <- read_csv(here::here("data", "active_travel_counts", "2022-Outer.csv")) %>%
janitor::clean_names(.)
# combine all data together
count_all <- bind_rows(count_central, count_inner1, count_inner2, count_outer)
Next, the csv data for collisions and active travel count data that has been read in will be transformed into spatial data. The data that has invalid location data is dropped, since it is difficult to consider the spatial characteristics in the following analysis.
# transform the collisions into spatial data
collision_sf <- collision %>%
drop_na(., c("location_easting_osgr", "location_northing_osgr")) %>%
st_as_sf(., coords = c("location_easting_osgr", "location_northing_osgr"), crs = 27700)
# transform the active travel counts into spatial data
count_points_sf <- count_points %>%
drop_na(., c("easting_uk_grid", "northing_uk_grid")) %>%
st_as_sf(., coords = c("easting_uk_grid", "northing_uk_grid"), crs = 27700)
The dataset for Road Safety includes the whole of London, therefore we will only use the spatial subset for the Greater London area. This will be combined with an inner join, to extract only the accident data involving e-scooters. The e-scooter accidents dataset includes data outside of London as well as data from previous years, so an inner join will extract the e-scooter accidents in London that happened in 2022.
# filter the collision data
# make spatial subset of dataset
collision_sf <- collision_sf[borough, ]
# join with e-scooter data, to get all collisions involving e-scooters
e_scooter_london <- collision_sf %>%
inner_join(., e_scooter, by = c("accident_index" = "accident_index"))
Next, I will join the data with the polygons to look at the number of accidents that occur in each statistical area. Accidents are to be standardised by area for a fair comparison between areas.
# join the number of accidents to borough
borough_accidents <- borough %>%
mutate(accidents = lengths(st_intersects(., e_scooter_london)))
# join the number of accidents to ward
ward_accidents <- ward %>%
mutate(accidents = lengths(st_intersects(., e_scooter_london)))
# standardise using the area
borough_accidents <- borough_accidents %>%
mutate(accidents_per_km2 = accidents / HECTARES * 100)
# ward_accidents <- ward_accidents %>%
# mutate(accidents_per_km2 = accidents / HECTARES * 100)
The spatial distribution of collisions is shown below.
ggplot() +
# draw wards
geom_sf(
data = borough_accidents,
color = NA,
aes(fill = accidents_per_km2)
) +
# e-scooter data
geom_sf(
data = e_scooter_london,
color = alpha("white", 0.5),
size = 0.5
) +
# add north arrow
annotation_north_arrow(
location = "tr",
style = north_arrow_fancy_orienteering(
text_col = NA
)
)
A spatial clustering can be observed in the centre of London, but the number of e-scooters are expected to be higher as well, so cannot immediately conclude there is significant danger in these areas.
For the active travel counts, the data will be filtered by e-scooters. They will then be summarised by the points to get the average traffic for each counting point. This will then be summarised by borough, to find the average traffic of e-scooters within the borough.
# filter by e-scooters, and then summarise by the counting point
count_summarise <- count_all %>%
filter(mode == "E-scooters") %>%
group_by(unq_id) %>%
summarise(counts = mean(count))
# join the data to the spatial data
count_points_sf <- count_points_sf %>%
left_join(., count_summarise, by = c("site_id" = "unq_id")) %>%
# drop na values for counts
drop_na(., "counts")
# summarise average traffic by ward
# dropping the geometry for ease of join later
count_by_ward <- count_points_sf %>%
st_join(., ward_accidents) %>%
st_drop_geometry(.) %>%
group_by(GSS_CODE) %>%
summarise(average_count = mean(counts))
# join back to ward
ward_accidents <- ward_accidents %>%
left_join(., count_by_ward, by = c("GSS_CODE" = "GSS_CODE"))
# summarise average traffic by borough
# dropping the geometry for ease of join later
count_by_borough <- count_points_sf %>%
st_join(., borough_accidents) %>%
st_drop_geometry(.) %>%
group_by(GSS_CODE) %>%
summarise(average_count = mean(counts))
# join back to borough
borough_accidents <- borough_accidents %>%
left_join(., count_by_borough, by = c("GSS_CODE" = "GSS_CODE"))
Now, the dataset has the number of accidents per square km that occurred in 2022 and the average number of e-scooters per 15 minutes.
The statistical value we seek to use is the accident-traffic ratio, and this is calculated here.
First, we will consider if there is a correlation between the number of e-scooters and the accidents that involve them.
# draw a scatter plot of average counts and accidents
ggplot(
data = borough_accidents,
aes(x = average_count, y = accidents_per_km2)
) +
# plot the points
geom_point() +
# draw the regression line
geom_smooth(method= "lm") +
# set theme
theme_classic()
The regression model is described as follows:
lm_ratio <- lm(borough_accidents$accidents_per_km2 ~ borough_accidents$average_count)
summary(lm_ratio)
##
## Call:
## lm(formula = borough_accidents$accidents_per_km2 ~ borough_accidents$average_count)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44696 -0.07332 -0.00563 0.03563 0.71136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.11076 0.06406 -1.729 0.0938 .
## borough_accidents$average_count 7.03044 0.72315 9.722 6.26e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.202 on 31 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.7451
## F-statistic: 94.52 on 1 and 31 DF, p-value: 6.258e-11
We see a highly correlation between the traffic and the number of accidents that occur in an area. This has an adjusted R-squared value of 0.745, which indicates that 74.5% of the difference can be explained through this variable.
Now we will observe whether the spatial aspects play a role in the number of accidents that occur in the Greater London area.
The following is a map showing the accident-traffic ratio per borough.
# set mode to interactive
tmap_mode("view")
# add shape
tm_shape(borough_accident_ratio) +
# add fill
tm_fill("ratio", title = "Accidents-Traffic Ratio") +
# add title
tm_layout(title = "Ratio of Accidents per km2 to the E-scooter Traffic")
# ggplot() +
# geom_sf(
# data = borough_accident_ratio,
# aes(fill = ratio),
# color = NA
# ) +
#
# # set scale to remove outlier
# scale_fill_distiller(
# palette = "Oranges"
# )
The map shows there are several boroughs with a high ratio of accidents compared to the traffic volume of e-scooters. The London Boroughs of Lambeth and Islington show the highest ratio of accidents involving e-scooters when standardised with the traffic. The difference indicates there may be spatial factors that impact the number of accidents.
Now, I will identify whether the number of accidents has a spatial autocorrelation, or it has complete spatial randomness.
The Global Moran’s I will indicate whether there is a spatial autocorrelation. This ranges from -1 to 1, a figure closer to 1 showing spatial clustering, while a negative number shows values dispersed. The Local Moran’s I will indicate whether each value is significantly higher or lower compared to the neighbouring values.
For this analysis, we need to define neighbouring areas. The possible definitions are:
The result for the Queen’s case is shown below.
queens_nb <- borough_accident_ratio %>%
poly2nb(., queen = TRUE)
plot(queens_nb, st_geometry(borough_accident_ratio), col = "red")
plot(borough_accident_ratio$geometry, add = TRUE)
The results for K nearest neighbour model is shown below. We have
used k = 4
as our definition, as it is the smallest number
that has allowed all boroughs with shared edges are considered to be
neighbouring.
# define 4 nearest neighbours
knn_nb <- borough_accident_ratio %>%
st_centroid(.) %>%
st_geometry(.) %>%
knearneigh(., k = 4) %>%
knn2nb(.)
plot(knn_nb, st_geometry(borough_accident_ratio), col = "red")
plot(borough_accident_ratio$geometry, add = TRUE)
We must note that the in the Queen’s case, the Thames is separating the north and south of London, which does not reflect the actual network of roads connected by bridges.
Therefore, we will use the 4 nearest neighbours as our definition of neighbouring. The row standardised spatial weight matrix is calculated as below.
The Global Moran’s I is calculated as follows.
global_i <- borough_accident_ratio %>%
st_drop_geometry() %>%
dplyr::select(ratio) %>%
pull() %>%
moran.test(., knn_lw)
global_i
##
## Moran I test under randomisation
##
## data: .
## weights: knn_lw
##
## Moran I statistic standard deviate = 3.2156, p-value = 0.0006509
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.29630926 -0.03125000 0.01037684
The Global Moran’s I value has a positive value, indicating that there is a clustering of similar values in the distribution of accidents. The null hypothesis of complete spatial randomness can be rejected, and we can now assume that there is a positive spatial autocorrelation for this value.
The local Moran’s I is calculated to check where the clustering and dispersed values are distributed. The Local Moran’s I is calculated below.
# calculate moran's I
local_i <- borough_accident_ratio %>%
st_drop_geometry() %>%
dplyr::select(ratio) %>%
pull() %>%
localmoran(., knn_lw) %>%
as_tibble(.)
# join back to the dataset
borough_moran <- borough_accident_ratio %>%
mutate(i_local = local_i$Ii) %>%
mutate(i_zscore = local_i$Z.Ii)
Once the Moran’s I is calculated for all boroughs, the data is merged back to the original dataset. Below, we have plotted the local Moran’s I value for the accident-traffic ratio.
# create breaks
breaks <- c(-1000, -2.58, -1.96, -1.65, 1.65, 1.96, 2.58, 1000)
# run ggplot
ggplot() +
geom_sf(
data = borough_moran,
aes(fill = as.numeric(i_zscore)),
color = "white"
) +
# create scale
scale_fill_fermenter(
type = "div",
palette = "BrBG",
breaks = breaks,
labels = c("", -2.58, -1.96, -1.65, 1.65, 1.96, 2.58, ""),
name = "Z Score",
limits = c(-1000, 1000)
) +
# set labels
labs(
title = "Z-score of Local Moran's I for the accident-traffic ratio"
) +
# add north arrow
annotation_north_arrow(
location = "tr",
style = north_arrow_fancy_orienteering(
text_col = NA
)
) +
# set theme
theme_minimal() +
theme(
axis.text = element_blank(),
axis.ticks = element_blank()
)
This shows that the boroughs near the centre of London have higher rate of accidents compared to its neighbours. Interestingly, the western London areas of Hounslow and Wandworth also have significantly higher values compared to its surroundings. On the other hand, we can see the borough of Harrow being safer than its surroundings.
We have analysed the number of accidents that involved e-scooters by borough, standardised by the area and the average number of e-scooter traffic. We have found some clustering of relatively unsafe areas in the centre of London, and some in the western London area.
The results indicate that some areas have a higher probability of accidents than others. This has successfully located areas that need further investigation, in order to make the London roads safer.
Further analysis can be done in order to search other factors that can influence the number of accidents.
The limitations of this report is that the dataset is summarised in a fairly large area of a borough. This analysis has dropped the following information that might have been useful:
The wards or smaller administrative boundaries should be considered for a detailed analysis, in order to further identify the areas that have higher values of accidents. For this, we must investigate further into the active travel counts dataset, and find an appropriate way to interpolate values for areas without valid numbers.