How effective is self testing for measuring the extent of infection in a population? Over the course of the pandemic, the UK Government relied heavily (although not entirely) on individuals to test themselves and report the result. Data on new positive tests was documented and used in public messaging. The Office for National Statistics (ONS), meanwhile, sent out a standard number of PCR tests to a random sample of the population every week and used this data to create a predictive model, estimating what the total prevalence of Covid-19 was in the population. Whilst it is never wise to assume a prediction is 100% accurate, the ONS estimates were (in my opinion) the closest approximation there is likely to be of what the actual Covid prevalence in the population was (i.e including aysmptomatic cases or cases where people did not test themselves). With this in mind, by comparing the government’s positive test data with the ONS population estimates, I hoped to visualise how effective self testing and self reporting was at indicating the extent of the pandemic.
There may also be time-contingent variances in the reliability of the government data; as mentioned the ONS sent a consistent test to a consistent sample, and did not vary this method. The Government’s positive test data, on the other hand, was vulnerable to changing biases over time. By relying on self report, the data may have been biased by whether a positive test was beneficial for someone to report. For example, during furlough there was little reason not to report a positive test, however when a positive test meant people may have to isolate from work and subsequently lose money, there was now an incentive to not report. Which tests were accepted as a positive result (i.e deciding whether to accept lateral flow results or not) also varied. In summary, the ONS consistently used a consistent method on a consistent sample, whereas the government relied on self report over periods with varying incentives to report, with varying tests being used. Therefore, I wanted to visualise any differences in variance between the two metrics - whilst differences in scale were a given, did both metrics report peaks and troughs in the same periods, of similar sizes?
I was also interested to see how the two metrics behaved over the different stages of the lockdown, for this reason only the data from England was analysed as the different nations of the UK had differing responses to the pandemic. This meant that if I used Covid data from the entire UK, when trying to factor in the impact of lockdown measures I would have had to factor separate lockdown measures from all four nations. I felt that this would undermine the validity of the comparison and, therefore, decided to use the data from the largest population (England). Data from the Institute for Government was used to plot this, as they had compiled a timeline of the Government’s response to the pandemic. This was used in alongside data from the government’s own announcements archive, in order to confirm the IfG timeline and fill in missing data.
In conclusion, the aims of this project were:
Government positive test data.
ONS modelled estimates.
Covid response timeline.
Covid response announcements
library(dplyr)
library(here)
library(tidyverse)
library(ggplot2)
library(hablar)
library(npreg)
library(ggstance)
library(ggformula)
library(gridExtra)
# Loading data
here()
onsfull <- read.csv(here::here("Data", "ons_c19_estimates_england.csv"),
skip = 5,
blank.lines.skip = TRUE)
# Skip top 5 lines of ONS due to irrelevant headings/notes
gvtfull <- read.csv(here::here("Data", "gvtengland.csv"))
Sample of the ONS initial data set. In this set, the items of interest to this project were “Fortnightly.weighted.estimates” and the “X.3” column, as this was the column for the overall population prevalence estimate for the corresponding time period.
Fortnightly.weighted.estimates | X | X.1 | X.2 | X.3 | X.4 | X.5 | X.6 | X.7 | X.8 | X.9 | X.10 | X.11 | X.12 | X.13 | X.14 | X.15 | X.16 | X.17 | X.18 | X.19 | X.20 | X.21 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
27 April 2020 to 10 May 2020 | 0.27 | 0.17 | 0.41 | 148,000 | 94,000 | 222,000 | 1 in 370 | 1 in 590 | 1 in 240 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
04 May 2020 to 17 May 2020 | 0.25 | 0.16 | 0.38 | 137,000 | 85,000 | 208,000 | 1 in 400 | 1 in 630 | 1 in 260 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
11 May 2020 to 24 May 2020 | 0.24 | 0.11 | 0.46 | 133,000 | 62,000 | 250,000 | 1 in 420 | 1 in 910 | 1 in 220 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
17 May 2020 to 30 May 2020 | 0.10 | 0.05 | 0.18 | 53,000 | 25,000 | 99,000 | 1 in 1,000 | 1 in 2,000 | 1 in 560 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
25 May 2020 to 07 June 2020 | 0.06 | 0.02 | 0.12 | 33,000 | 14,000 | 68,000 | 1 in 1,700 | 1 in 5,000 | 1 in 830 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
31 May 2020 to 13 June 2020 | 0.06 | 0.02 | 0.13 | 33,000 | 12,000 | 74,000 | 1 in 1,700 | 1 in 5,000 | 1 in 770 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Government initial data set. This data set was formatted more conveniently, with “date” and “newCasesByPublishDate” being the items of interest.
areaType | areaName | areaCode | date | newCasesByPublishDate | cumCasesByPublishDate |
---|---|---|---|---|---|
nation | England | E92000001 | 29/03/2022 | 64935 | 17667861 |
nation | England | E92000001 | 28/03/2022 | 183459 | 17603000 |
nation | England | E92000001 | 27/03/2022 | 0 | 17419944 |
nation | England | E92000001 | 26/03/2022 | 0 | 17419944 |
nation | England | E92000001 | 25/03/2022 | 63371 | 17419944 |
nation | England | E92000001 | 24/03/2022 | 79373 | 17356644 |
# Select relevant data
onsfull <- onsfull[-c(10, 101:108), ]
# Removing mid-data words and irrelevant tail from data set
onsselect <- onsfull %>% select(1, 5)
# Selecting 1(date info) and 5(estimate info)
It was necessary to remove several rows from this data set, as the authors had included notes to inform the reader of methodological changes in how the population estimates were made. In terms of scalability, I would have preferred to use a loop to search the data automatically, however the format that the dates were entered combined with the non-uniform data (due to the included notes) made this quite complicated. I ultimately decided to remove these rows manually as it seemed to me to be the more time efficient option.
# Separating the ONS dates into two separate columns, split parameter is " to " so this means remaining data can be recognised as datetime data
onsdates <- data.frame(str_split_fixed(onsselect$Fortnightly.weighted.estimates, " to ", n = 2))
# Recognise data as datetime data
onsdate1 <- as.Date(onsdates$X1, format = "%d %B %Y")
onsdate2 <- as.Date(onsdates$X2, format = "%d %B %Y")
onsdf <- data.frame(onsdate1, onsdate2, onsselect$X.3,
stringsAsFactors = TRUE)
# Calculating average date value for each row of data frame
onsdf$date <- as.Date((onsdf$onsdate1 + ((onsdf$onsdate2 - onsdf$onsdate1) / 2)),
format = "%d %B %Y")
Here I split the ONS date column into two columns, as this rendered the data into a format where I could properly assign them as ‘datetime’ data. I then calculated the ‘average’ of the two dates to get a single date I could assign the population estimate to, and subsequently plot.
# Remove NA values
onsdf <- na.omit(onsdf)
# Remove commas from X.3, these apparently caused issues when this variable was passed through ggplot
onsdf[,'covid'] <- gsub(",","", onsdf[,'onsselect.X.3'])
#Cleaning environment
rm(onsdates, onsdate1, onsdate2, onsfull, onsselect)
onsdf <- onsdf %>% convert(num(covid))
#Creating final data set from average date and estimate values
onsdf <- onsdf %>% select(4, 5)
Sample of the final ONS data set:
date | covid |
---|---|
2020-05-03 | 148000 |
2020-05-10 | 137000 |
2020-05-17 | 133000 |
2020-05-23 | 53000 |
2020-05-31 | 33000 |
2020-06-06 | 33000 |
# Selecting relevant data
# Selecting values for England from the whole UK data set
gvtfull <- gvtfull %>% filter(grepl('England', areaName))
# Selecting relevant columns
gvtselect <- gvtfull %>% select(4, 5)
# Recognise the dates as datetime data
date <- as.Date(gvtselect$date, format = "%d/%m/%Y")
covid <- gvtselect$newCasesByPublishDate
gvtdf <- data.frame(covid, date)
# Set 0 vals as NA
gvtdf$covid[gvtdf$covid==0] <- NA
# Remove NA vals
gvtdf <- na.omit(gvtdf)
rm(gvtfull, gvtselect, covid, date)
This set was much easier to prepare. The only real processing required was removing cases where 0 now cases were logged in a day. This was not due to no new cases actually occurring, but due to cases not being logged over the weekend. No data was lost by doing this, as cases occurring over the weekend were entered at the start of the subsequent week. This led to systematic spikes in the data which I will cover in more detail later on.
Sample of the final Government data set:
covid | date | |
---|---|---|
1 | 64935 | 2022-03-29 |
2 | 183459 | 2022-03-28 |
5 | 63371 | 2022-03-25 |
6 | 79373 | 2022-03-24 |
7 | 85916 | 2022-03-23 |
8 | 74804 | 2022-03-22 |
As this data was qualitative (i.e descriptions of lockdown stringency rather than a numerical rating), there was no significant processing required. The stringency rating was visualised by vertical red bars on the plots, with higher stringency associated with higher opacity of red colouration. The scale ran from 0.0 - 0.5 (these being opacity values for the ggplot objects used). I created a data frame of these values for incorporation into the final plot later:
# Adding the start dates of all the stringency bars
date1 <- as.Date(c("2020-05-03","2020-06-01","2020-06-15","2020-09-14","2020-11-05","2020-12-02",
"2021-01-06","2021-03-29","2021-04-12","2021-05-17","2021-07-19","2021-12-08"))
# Adding the end dates of all the stringency bars
date2 <- as.Date(c("2020-06-01","2020-06-15","2020-09-14","2020-11-05","2020-12-02","2021-01-06",
"2021-03-29","2021-04-12","2021-05-17","2021-07-19","2021-12-08","2022-02-24"))
# Adding the pigmentation values to represent stringency
alpha <- c(0.5,0.4,0.3,0.3,0.5,0.4,0.5,0.4,0.3,0.2,0.1,0.2)
#Creating dataframe
strindf <- data.frame(date1,date2,alpha)
rm(date1, date2, alpha)
The criteria for rating were as follows:
0.5 (highest): This score was given to a full lockdown, where no-one except essential workers were permitted to leave the house for work.
0.4: Lockdown remains but with conditional allowances, for example; children being allowed to go back to school. The majority of the population is under heavy restriction, however.
0.3: Partial lifting; lockdown measures remain widely in place, but allowances now include most/the rest of the population. For example; non-essential businesses are still closed, but people can meet in groups of 6 outside.
0.2: Non-essential businesses/practises are allowed to open, however restrictions still remain such as having to eat outside or having restricted capacity.
0.1: Restrictions remain but disruption to life is minimal, examples include having to wear a mask to a restaurant but being able to take it off and eat inside once seated.
0.0: Restrictions fully removed
Variable name | Description |
---|---|
Covid | The estimates or test data for each metric regarding Covid prevalence. Had to be named the same for each in order to build graph |
Date | The datetime data for each data set |
Stringency | The severity level of lockdown measures for a given date |
barfunction() | Function to convert strindf data into geom_rect() objects |
strindf | Data frame containing lockdown stringency information |
onsdf | Processed data frame of ONS data |
gvtdf | Processed data frame of government data |
When building trial plots in preparation for the actual visualisation, it quickly became clear that it would be necessary to smooth this data in some way. As mentioned previously, the Government data set had large, systematic spikes in the data due to the fact that no new cases were processed over the weekend. This meant that every Monday had approximately triple the usual number of new cases. Due to the wide time period included in this visualisation, these large spikes quickly began to make it difficult to interpret the data.
As can be seen, the data gets quite hard to accurately interpret. The large spikes naturally draw the eye to the top op the peak and make the Covid levels look much higher than they actually are. The challenge here was that none of the usual methods of smoothing seemed to be appropriate, usually because they relied on some form of linear formula. Fortunately, this could be dealt with using a spline - a piece-wise polynomial regression which separates the data into bins, running the regression between them. The more bins specified the closer the spline fits the original data.
Much better! With this done, I could get to work on the first visualisation.
The first visualisation I wanted to try was a dual Y axis graph. This was primarily to see in detail how the two metrics varied across the different stages of lockdown. By plotting them together I aimed to provide the simplest comparison.
#Creating function to add stringency bars
barfunction <- function(date1, date2, alpha){
a <- annotate(geom = "rect",
xmin = as.Date(date1), xmax = as.Date(date2), ymin = 0, ymax = Inf,
alpha = alpha, fill = "red")
return(a)
}
##Building graph
coeff <- 0.05
#Adding initial data
compggp <- ggplot(NULL, aes(x = date, y = covid)) +
#Splined ons data
geom_spline(data = onsdf,
aes(x = date, y = covid, colour = "ONS Modeled Estimates"), nknots = 90, size = 1.3) +
#Splined government data
geom_spline(data = gvtdf,
aes(x = date, y = covid/coeff, colour = "Gvt Reported Positive Tests"), nknots = 90, size = 1.3)
#Adding stringency bars by applying barfunction() to the strindf data frame
compggp <- compggp + purrr::pmap(strindf, barfunction)
#Adding aesthetics
compggp <- compggp + scale_x_date(limits = as.Date(c("2020-05-03", NA ))) +
theme_minimal() +
scale_y_continuous(labels = scales::comma) +
labs(x = "Date (year - month)", y = "Covid Cases (estimated and reported)") +
scale_y_continuous(labels = scales::comma, name = "Modeled Population Estimates",
sec.axis = sec_axis(~.*coeff, name = "Reported Positive Tests")) +
labs(title = "Estimated vs Reported Covid Cases over Lockdown",
subtitle = "Data sourced from ONS, UK Government and the Institute for Government",
x = "Date (year - month)", y = "Covid Cases (estimated and reported)") +
scale_colour_manual(name = "",
# Colours must be specified here as adding aesthetics outside of original aes() call
values = c("ONS Modeled Estimates"="khaki4",
"Gvt Reported Positive Tests" = "darkcyan",
"Lockdown Stringency" = "red"))
compggp
#Save graph
ggsave("comp_y_6422.pdf", path = here::here("Figures"))
It is important to note that this graph is NOT intended to provide an accurate comparison of the extent of Covid infection in the population, as the dual Y axis is very misleading in this regard. However, given the scale of the numerical difference in estimates I thought that it would be useful.
As we can see from the graph, at first glance there do seem to be some similarities between the metrics, with similar gradients across the timeline. Whilst, as mentioned, the dual axis is misleading in terms of overall estimates, the metrics seem to covary reasonably consistently across the various stages of the pandemic. I found this slightly surprising, given the potential differences the validity of the data sets (especially how prone the government set may have been non-reporting).
This graph type was handy for visualising the data for both metrics whilst retaining a high level of detail for both.
The second plot was a composite line graph. The benefit of this style of visualisation is that it gives a far more immediately representative visual comparison of the differences in scale between the two metrics.
##Comp line graph
#Adding initial data
ggp <- ggplot(NULL, aes(x = date, y = covid)) +
# Splined ons data
geom_spline(data = onsdf,
aes(x = date, y = covid, colour = "ONS Modeled Estimates"), nknots = 90, size = 1.3) +
# Splined government data
geom_spline(data = gvtdf,
aes(x = date, y = covid, colour = "Gvt Reported Positive Tests"), nknots = 90, size = 1.3)
#Adding lockdown stringency bars by applying barfunction() to strindf data frame
ggp <- ggp + purrr::pmap(strindf, barfunction)
#Adding aesthetics
ggp <- ggp + labs(title = "Estimated vs Reported Covid Cases over lockdown",
subtitle = "Data sourced from ONS, UK Government and Institute for Government",
x = "Date (year - month)", y = "Covid Cases") +
scale_y_continuous(labels = scales::comma) +
scale_x_date(limits = as.Date(c("2020-05-03", NA ))) +
scale_colour_manual(name = "Legend",
values = c("ONS Modeled Estimates"="khaki4",
"Gvt Reported Positive Tests" = "darkcyan",
"Lockdown Stringency"="red"))
ggp
#Save graph
ggsave("comp_line_6422.pdf", path = here::here("Figures"))
The composite graph really shows the main difference between the two metrics. Whilst it can be seen from visualisation 1 that the reported covid tests also vary over the course of the lockdown, the sheer scale of the difference between the reported tests and the estimated total number of cases is quite impressive (of course this is assuming the modeled total is 100% accurate which is unlikely).
When taking both graphs into account, I find it interesting to see the effect the lockdowns had, with especially the most stringent measures having a noticeable correlation with a falling gradient in estimated covid prevalence.
Of interest as well was the possible association with the end of December/beginning of new year with significantly rising levels of infection.
Whilst both of these visualisations had useful aspects, they both had drawbacks as well. I wasn’t happy with the dual y axis due to it not providing a valid visual comparison of the scale of the differences. However, the composite line graph with a single axis, whilst providing a very satisfying comparison of the difference in scale, basically rendered the government data set illegible. So, how to combine the detail of the dual axis graph with the scale of the single axis?
###Shiny and chrome
# Generating ui
library(shiny)
library(ggplot2)
ui <- fluidPage(
h1("Positive tests vs Estimates: How do the metrics compare in scale and variance"),
h4("Graph showing ONS Covid 19 total population prevalence estimates vs number of recorded positive tests over stages of lockdown"),
sidebarLayout(
sidebarPanel(
width = 2,
sliderInput("scalegvt","Scale Positive Test Results by:", min = 1.0, max = 26, value = c(1.0)),
h5("Use the slider above to scale the data for number of reported positive Covid tests.
If you leave it at 1, you can see how the unmodified data sets compare in terms of estimates.
If you scale it up, you can see how the data sets covary over the stages of lockdown."),
br(),
h5("Sources for data:"),
h5(a("ONS Estimate Data", href = "https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/bulletins/coronaviruscovid19infectionsurveypilot/18march2022",
target = "_blank")),
h5(a("Government test data", href = "https://coronavirus.data.gov.uk/details/cases",
target = "_blank")),
h5(a("Covid response timeline", href = "https://www.instituteforgovernment.org.uk/charts/uk-government-coronavirus-lockdowns",
target = "_blank"))
),
mainPanel(
width = 10,
plotOutput("distPlot")
)
)
)
## Creating server
server <- function(input, output) {
coeff1 <- reactive({input$scalegvt
})
output$distPlot <- renderPlot({
compggp <- ggplot(NULL, aes(x = date, y = covid)) +
geom_spline(data = onsdf,
aes(x = date, y = covid, colour = "ONS Modelled Estimates"), nknots = 90, size = 1.3) +
geom_spline(data = gvtdf,
aes(x = date, y = covid*coeff1(), colour = "Gvt Reported Positive Tests"), nknots = 90,
size = 1.3)
#Stringency bars
compggp <- compggp + purrr::pmap(strindf, barfunction)
#Adding aesthetics
compggp <- compggp + scale_x_date(limits = as.Date(c("2020-05-03", NA ))) +
theme_minimal() +
theme(text = element_text(size = 20)) +
scale_y_continuous(labels = scales::comma) +
labs(x = "Date (year - month)", y = "Covid Cases (estimated and reported)") +
scale_y_continuous(labels = scales::comma) +
scale_colour_manual(name = "",
values = c("ONS Estimates"="khaki4",
"Gvt Reported Positive Tests" = "darkcyan",
"Lockdown Stringency" = "red"))
compggp
}, height = 600, width = 1200)
}
# Running application
shinyApp(ui = ui, server = server)
As you can’t run a shiny app in a static markdown file, the final app can be viewed by clicking the link below:
The aim of this interactive graph was to allow the user to scale the data to explore the various facets of the data. For example, if the viewer is interested in the numerical differences between the two data sets, they simply need to leave the scaling factor at 1 which will not change the data in any way. However, if they want to explore how the two metrics varied over the stages of lockdown, they can use the slider to adjust the Gvt data so that they are on the same scale. This allows the viewer to explore all the intended aspects of the data whilst maintaining the single scale.
At one point I did have two sliders, allowing the viewer to scale both of the lines independently, however I ultimately felt that this provided no additional utility and served to make the interface more complicated.
So, what does the final plot tell us?
What would I improve about my visualisation or where would I like to take it further? One option is adding more metrics of covid prevalence, perhaps finding data sets regarding trends in Covid-related work absences or something similar. When beginning this project I wanted to work with very large data sets, so I bee-lined for National data. After having the chance to work on these sets, I would find it very interesting to compare a large amount of metrics and see how they each behaved over the stages of lockdown. I think this could be included quite easily into the final visualisation; adding an extra geom_spline() object and another slider input to the Shiny graph would allow the data to be scaled as appropriate to what the viewer is interested in comparing. The only minor consideration would be how many data points the new set contained, as if it were significantly fewer than the ones used currently the nknots value for the spline would have to be adjusted. If significant numbers of new metrics were being added, it would also likely be worth making a custom geom_spline() function in a similar way to barfunction().
A further improvement I would have liked to make to this graph is, instead of having a scaling factor included on the line, I would have liked to have a composite Y axis with a scale linked to a slider input. The default scale would have been the same as the first axis, however the viewer could ‘zoom’ the second axis in the achieve the same effect as scaling the data, but with a detailed axis to refer to.
A great resource for learning your way around Shiny.
Random strangers on Stackoverflow.com who told me: