Efficacy of Self Testing - comparing two Covid 19 prevalence metrics across stages of the pandemic in England

 

Research Question

 

How effective is self testing for measuring the extent of infection in a population? Over the course of the pandemic, the UK Government relied heavily (although not entirely) on individuals to test themselves and report the result. Data on new positive tests was documented and used in public messaging. The Office for National Statistics (ONS), meanwhile, sent out a standard number of PCR tests to a random sample of the population every week and used this data to create a predictive model, estimating what the total prevalence of Covid-19 was in the population. Whilst it is never wise to assume a prediction is 100% accurate, the ONS estimates were (in my opinion) the closest approximation there is likely to be of what the actual Covid prevalence in the population was (i.e including aysmptomatic cases or cases where people did not test themselves). With this in mind, by comparing the government’s positive test data with the ONS population estimates, I hoped to visualise how effective self testing and self reporting was at indicating the extent of the pandemic.

There may also be time-contingent variances in the reliability of the government data; as mentioned the ONS sent a consistent test to a consistent sample, and did not vary this method. The Government’s positive test data, on the other hand, was vulnerable to changing biases over time. By relying on self report, the data may have been biased by whether a positive test was beneficial for someone to report. For example, during furlough there was little reason not to report a positive test, however when a positive test meant people may have to isolate from work and subsequently lose money, there was now an incentive to not report. Which tests were accepted as a positive result (i.e deciding whether to accept lateral flow results or not) also varied. In summary, the ONS consistently used a consistent method on a consistent sample, whereas the government relied on self report over periods with varying incentives to report, with varying tests being used. Therefore, I wanted to visualise any differences in variance between the two metrics - whilst differences in scale were a given, did both metrics report peaks and troughs in the same periods, of similar sizes?

I was also interested to see how the two metrics behaved over the different stages of the lockdown, for this reason only the data from England was analysed as the different nations of the UK had differing responses to the pandemic. This meant that if I used Covid data from the entire UK, when trying to factor in the impact of lockdown measures I would have had to factor separate lockdown measures from all four nations. I felt that this would undermine the validity of the comparison and, therefore, decided to use the data from the largest population (England). Data from the Institute for Government was used to plot this, as they had compiled a timeline of the Government’s response to the pandemic. This was used in alongside data from the government’s own announcements archive, in order to confirm the IfG timeline and fill in missing data.

In conclusion, the aims of this project were:

  1. To visualise the estimated proportion of people in England who had Covid (asymptomatic or not) compared to the proportion of those people actually testing positive (and reporting it!)
  2. Visualise how closely the government data covaried with the ONS data (i.e did reduced levels of control affect the government data?)
  3. Visualise how both of these metrics reacted to lockdown stringency (how effective was it?).

 

Data processing

 

library(dplyr)
library(here)
library(tidyverse)
library(ggplot2)
library(hablar)
library(npreg)
library(ggstance)
library(ggformula)
library(gridExtra)
# Loading data
here()
onsfull <- read.csv(here::here("Data", "ons_c19_estimates_england.csv"),
                    skip = 5,
                    blank.lines.skip = TRUE) 
# Skip top 5 lines of ONS due to irrelevant headings/notes

gvtfull <- read.csv(here::here("Data", "gvtengland.csv"))

 

Sample of the ONS initial data set. In this set, the items of interest to this project were “Fortnightly.weighted.estimates” and the “X.3” column, as this was the column for the overall population prevalence estimate for the corresponding time period.

 

Fortnightly.weighted.estimates X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10 X.11 X.12 X.13 X.14 X.15 X.16 X.17 X.18 X.19 X.20 X.21
27 April 2020 to 10 May 2020 0.27 0.17 0.41 148,000 94,000 222,000 1 in 370 1 in 590 1 in 240 NA NA NA NA NA NA NA NA NA NA NA NA NA
04 May 2020 to 17 May 2020 0.25 0.16 0.38 137,000 85,000 208,000 1 in 400 1 in 630 1 in 260 NA NA NA NA NA NA NA NA NA NA NA NA NA
11 May 2020 to 24 May 2020 0.24 0.11 0.46 133,000 62,000 250,000 1 in 420 1 in 910 1 in 220 NA NA NA NA NA NA NA NA NA NA NA NA NA
17 May 2020 to 30 May 2020 0.10 0.05 0.18 53,000 25,000 99,000 1 in 1,000 1 in 2,000 1 in 560 NA NA NA NA NA NA NA NA NA NA NA NA NA
25 May 2020 to 07 June 2020 0.06 0.02 0.12 33,000 14,000 68,000 1 in 1,700 1 in 5,000 1 in 830 NA NA NA NA NA NA NA NA NA NA NA NA NA
31 May 2020 to 13 June 2020 0.06 0.02 0.13 33,000 12,000 74,000 1 in 1,700 1 in 5,000 1 in 770 NA NA NA NA NA NA NA NA NA NA NA NA NA

 

Government initial data set. This data set was formatted more conveniently, with “date” and “newCasesByPublishDate” being the items of interest.

 

areaType areaName areaCode date newCasesByPublishDate cumCasesByPublishDate
nation England E92000001 29/03/2022 64935 17667861
nation England E92000001 28/03/2022 183459 17603000
nation England E92000001 27/03/2022 0 17419944
nation England E92000001 26/03/2022 0 17419944
nation England E92000001 25/03/2022 63371 17419944
nation England E92000001 24/03/2022 79373 17356644

 

Preparing ONS data

 

# Select relevant  data
onsfull <- onsfull[-c(10, 101:108), ]
# Removing mid-data words and irrelevant tail from data set

onsselect <- onsfull %>% select(1, 5)
# Selecting 1(date info) and 5(estimate info)

It was necessary to remove several rows from this data set, as the authors had included notes to inform the reader of methodological changes in how the population estimates were made. In terms of scalability, I would have preferred to use a loop to search the data automatically, however the format that the dates were entered combined with the non-uniform data (due to the included notes) made this quite complicated. I ultimately decided to remove these rows manually as it seemed to me to be the more time efficient option.

 

# Separating the ONS dates into two separate columns, split parameter is " to " so this means remaining data can be recognised as datetime data
onsdates <- data.frame(str_split_fixed(onsselect$Fortnightly.weighted.estimates, " to ", n = 2))

# Recognise data as datetime data
onsdate1 <- as.Date(onsdates$X1, format = "%d %B %Y") 
onsdate2 <- as.Date(onsdates$X2, format = "%d %B %Y")

onsdf <- data.frame(onsdate1, onsdate2, onsselect$X.3, 
                    stringsAsFactors = TRUE) 

# Calculating average date value for each row of data frame
onsdf$date <- as.Date((onsdf$onsdate1 + ((onsdf$onsdate2 - onsdf$onsdate1) / 2)), 
                           format = "%d %B %Y")

Here I split the ONS date column into two columns, as this rendered the data into a format where I could properly assign them as ‘datetime’ data. I then calculated the ‘average’ of the two dates to get a single date I could assign the population estimate to, and subsequently plot.

 

# Remove NA values
onsdf <- na.omit(onsdf)

# Remove commas from X.3, these apparently caused issues when this variable was passed through ggplot
onsdf[,'covid'] <- gsub(",","", onsdf[,'onsselect.X.3'])

#Cleaning environment
rm(onsdates, onsdate1, onsdate2, onsfull, onsselect)
onsdf <- onsdf %>% convert(num(covid))

#Creating final data set from average date and estimate values
onsdf <- onsdf %>% select(4, 5)

 

Sample of the final ONS data set:

 

date covid
2020-05-03 148000
2020-05-10 137000
2020-05-17 133000
2020-05-23 53000
2020-05-31 33000
2020-06-06 33000

 

Preparing the Government data set

 

# Selecting relevant data
# Selecting values for England from the whole UK data set
gvtfull <- gvtfull %>% filter(grepl('England', areaName))

# Selecting relevant columns
gvtselect <- gvtfull %>% select(4, 5)

# Recognise the dates as datetime data
date <- as.Date(gvtselect$date, format = "%d/%m/%Y") 
covid <- gvtselect$newCasesByPublishDate
gvtdf <- data.frame(covid, date)

# Set 0 vals as NA
gvtdf$covid[gvtdf$covid==0] <- NA 

# Remove NA vals
gvtdf <- na.omit(gvtdf) 
rm(gvtfull, gvtselect, covid, date)

 

This set was much easier to prepare. The only real processing required was removing cases where 0 now cases were logged in a day. This was not due to no new cases actually occurring, but due to cases not being logged over the weekend. No data was lost by doing this, as cases occurring over the weekend were entered at the start of the subsequent week. This led to systematic spikes in the data which I will cover in more detail later on.

 

Sample of the final Government data set:

 

covid date
1 64935 2022-03-29
2 183459 2022-03-28
5 63371 2022-03-25
6 79373 2022-03-24
7 85916 2022-03-23
8 74804 2022-03-22

 

Institute for Government data

 

As this data was qualitative (i.e descriptions of lockdown stringency rather than a numerical rating), there was no significant processing required. The stringency rating was visualised by vertical red bars on the plots, with higher stringency associated with higher opacity of red colouration. The scale ran from 0.0 - 0.5 (these being opacity values for the ggplot objects used). I created a data frame of these values for incorporation into the final plot later:

 

# Adding the start dates of all the stringency bars
date1 <- as.Date(c("2020-05-03","2020-06-01","2020-06-15","2020-09-14","2020-11-05","2020-12-02",
                   "2021-01-06","2021-03-29","2021-04-12","2021-05-17","2021-07-19","2021-12-08"))

# Adding the end dates of all the stringency bars
date2 <- as.Date(c("2020-06-01","2020-06-15","2020-09-14","2020-11-05","2020-12-02","2021-01-06",
                   "2021-03-29","2021-04-12","2021-05-17","2021-07-19","2021-12-08","2022-02-24"))

# Adding the pigmentation values to represent stringency
alpha <- c(0.5,0.4,0.3,0.3,0.5,0.4,0.5,0.4,0.3,0.2,0.1,0.2)

#Creating dataframe
strindf <- data.frame(date1,date2,alpha)
rm(date1, date2, alpha)

 

The criteria for rating were as follows:

0.5 (highest): This score was given to a full lockdown, where no-one except essential workers were permitted to leave the house for work.

0.4: Lockdown remains but with conditional allowances, for example; children being allowed to go back to school. The majority of the population is under heavy restriction, however.

0.3: Partial lifting; lockdown measures remain widely in place, but allowances now include most/the rest of the population. For example; non-essential businesses are still closed, but people can meet in groups of 6 outside.

0.2: Non-essential businesses/practises are allowed to open, however restrictions still remain such as having to eat outside or having restricted capacity.

0.1: Restrictions remain but disruption to life is minimal, examples include having to wear a mask to a restaurant but being able to take it off and eat inside once seated.

0.0: Restrictions fully removed

 

Code book

 

Variable name Description
Covid The estimates or test data for each metric regarding Covid prevalence. Had to be named the same for each in order to build graph
Date The datetime data for each data set
Stringency The severity level of lockdown measures for a given date
barfunction() Function to convert strindf data into geom_rect() objects
strindf Data frame containing lockdown stringency information
onsdf Processed data frame of ONS data
gvtdf Processed data frame of government data

 

First steps

 

When building trial plots in preparation for the actual visualisation, it quickly became clear that it would be necessary to smooth this data in some way. As mentioned previously, the Government data set had large, systematic spikes in the data due to the fact that no new cases were processed over the weekend. This meant that every Monday had approximately triple the usual number of new cases. Due to the wide time period included in this visualisation, these large spikes quickly began to make it difficult to interpret the data.

 

 

As can be seen, the data gets quite hard to accurately interpret. The large spikes naturally draw the eye to the top op the peak and make the Covid levels look much higher than they actually are. The challenge here was that none of the usual methods of smoothing seemed to be appropriate, usually because they relied on some form of linear formula. Fortunately, this could be dealt with using a spline - a piece-wise polynomial regression which separates the data into bins, running the regression between them. The more bins specified the closer the spline fits the original data.

 

 

Much better! With this done, I could get to work on the first visualisation.

 

Visualisation 1

 

The first visualisation I wanted to try was a dual Y axis graph. This was primarily to see in detail how the two metrics varied across the different stages of lockdown. By plotting them together I aimed to provide the simplest comparison.

 

#Creating function to add stringency bars
barfunction <- function(date1, date2, alpha){
  a <- annotate(geom = "rect",
                xmin = as.Date(date1), xmax = as.Date(date2), ymin = 0, ymax = Inf, 
                alpha = alpha, fill = "red")
  return(a)
}

##Building graph

coeff <- 0.05

#Adding initial data
compggp <- ggplot(NULL, aes(x = date, y = covid)) +
  #Splined ons data
  geom_spline(data = onsdf, 
              aes(x = date, y = covid, colour = "ONS Modeled Estimates"), nknots = 90, size = 1.3) +
  #Splined government data
  geom_spline(data = gvtdf, 
              aes(x = date, y = covid/coeff, colour = "Gvt Reported Positive Tests"), nknots = 90, size = 1.3)

#Adding stringency bars by applying barfunction() to the strindf data frame
compggp <- compggp + purrr::pmap(strindf, barfunction) 

#Adding aesthetics
compggp <- compggp + scale_x_date(limits = as.Date(c("2020-05-03", NA ))) +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Date (year - month)", y = "Covid Cases (estimated and reported)") + 
  scale_y_continuous(labels = scales::comma, name = "Modeled Population Estimates", 
                     sec.axis = sec_axis(~.*coeff, name = "Reported Positive Tests")) +
  labs(title = "Estimated vs Reported Covid Cases over Lockdown", 
       subtitle = "Data sourced from ONS, UK Government and the Institute for Government", 
       x = "Date (year - month)", y = "Covid Cases (estimated and reported)") +
  scale_colour_manual(name = "",
                      # Colours must be specified here as adding aesthetics outside of original aes() call
                      values = c("ONS Modeled Estimates"="khaki4", 
                                 "Gvt Reported Positive Tests" = "darkcyan",
                                 "Lockdown Stringency" = "red"))
compggp

#Save graph
ggsave("comp_y_6422.pdf", path = here::here("Figures"))

 

It is important to note that this graph is NOT intended to provide an accurate comparison of the extent of Covid infection in the population, as the dual Y axis is very misleading in this regard. However, given the scale of the numerical difference in estimates I thought that it would be useful.

 

As we can see from the graph, at first glance there do seem to be some similarities between the metrics, with similar gradients across the timeline. Whilst, as mentioned, the dual axis is misleading in terms of overall estimates, the metrics seem to covary reasonably consistently across the various stages of the pandemic. I found this slightly surprising, given the potential differences the validity of the data sets (especially how prone the government set may have been non-reporting).

This graph type was handy for visualising the data for both metrics whilst retaining a high level of detail for both.

 

Visualisation 2

 

The second plot was a composite line graph. The benefit of this style of visualisation is that it gives a far more immediately representative visual comparison of the differences in scale between the two metrics.

 

##Comp line graph
#Adding initial data
ggp <- ggplot(NULL, aes(x = date, y = covid)) +
  # Splined ons data
  geom_spline(data = onsdf, 
              aes(x = date, y = covid, colour = "ONS Modeled Estimates"), nknots = 90, size = 1.3) +
  # Splined government data
  geom_spline(data = gvtdf, 
              aes(x = date, y = covid, colour = "Gvt Reported Positive Tests"), nknots = 90, size = 1.3) 


#Adding lockdown stringency bars by applying barfunction() to strindf data frame
ggp <- ggp + purrr::pmap(strindf, barfunction) 


#Adding aesthetics
ggp <- ggp + labs(title = "Estimated vs Reported Covid Cases over lockdown", 
       subtitle = "Data sourced from ONS, UK Government and Institute for Government", 
       x = "Date (year - month)", y = "Covid Cases") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_date(limits = as.Date(c("2020-05-03", NA ))) +
  scale_colour_manual(name = "Legend",
                      values = c("ONS Modeled Estimates"="khaki4", 
                                 "Gvt Reported Positive Tests" = "darkcyan",
                                 "Lockdown Stringency"="red"))
ggp

#Save graph
ggsave("comp_line_6422.pdf", path = here::here("Figures"))

 

The composite graph really shows the main difference between the two metrics. Whilst it can be seen from visualisation 1 that the reported covid tests also vary over the course of the lockdown, the sheer scale of the difference between the reported tests and the estimated total number of cases is quite impressive (of course this is assuming the modeled total is 100% accurate which is unlikely).

When taking both graphs into account, I find it interesting to see the effect the lockdowns had, with especially the most stringent measures having a noticeable correlation with a falling gradient in estimated covid prevalence.

Of interest as well was the possible association with the end of December/beginning of new year with significantly rising levels of infection.

 

Thoughts

 

Whilst both of these visualisations had useful aspects, they both had drawbacks as well. I wasn’t happy with the dual y axis due to it not providing a valid visual comparison of the scale of the differences. However, the composite line graph with a single axis, whilst providing a very satisfying comparison of the difference in scale, basically rendered the government data set illegible. So, how to combine the detail of the dual axis graph with the scale of the single axis?

 

Final visualisation

 

###Shiny and chrome

# Generating ui
library(shiny)
library(ggplot2)

ui <- fluidPage(
  h1("Positive tests vs Estimates: How do the metrics compare in scale and variance"),
  h4("Graph showing ONS Covid 19 total population prevalence estimates vs number of recorded positive tests over stages of lockdown"),
  sidebarLayout(
    sidebarPanel(
      width = 2,
      sliderInput("scalegvt","Scale Positive Test Results by:",  min = 1.0, max = 26, value = c(1.0)),
      h5("Use the slider above to scale the data for number of reported positive Covid   tests. 
        If you leave it at 1, you can see how the unmodified data sets compare in terms of estimates. 
        If you scale it up, you can see how the data sets covary over the stages of lockdown."),
      br(),
      h5("Sources for data:"),
      h5(a("ONS Estimate Data", href = "https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/bulletins/coronaviruscovid19infectionsurveypilot/18march2022",
           target = "_blank")),
      h5(a("Government test data", href = "https://coronavirus.data.gov.uk/details/cases", 
           target = "_blank")),
      h5(a("Covid response timeline", href = "https://www.instituteforgovernment.org.uk/charts/uk-government-coronavirus-lockdowns",
           target = "_blank"))
    ),
    
    mainPanel(
      width = 10, 
      plotOutput("distPlot")
    )
  )
)

## Creating server 
server <- function(input, output) {
  coeff1 <- reactive({input$scalegvt
  })

  output$distPlot <- renderPlot({
    compggp <- ggplot(NULL, aes(x = date, y = covid)) +
      geom_spline(data = onsdf, 
                  aes(x = date, y = covid, colour = "ONS Modelled Estimates"), nknots = 90, size = 1.3) +
      geom_spline(data = gvtdf, 
                  aes(x = date, y = covid*coeff1(), colour = "Gvt Reported Positive Tests"), nknots = 90, 
                  size = 1.3)
    
    #Stringency bars
    compggp <- compggp + purrr::pmap(strindf, barfunction)
    
    #Adding aesthetics
    compggp <- compggp + scale_x_date(limits = as.Date(c("2020-05-03", NA ))) +
      theme_minimal() +
      theme(text = element_text(size = 20)) +
      scale_y_continuous(labels = scales::comma) +
      labs(x = "Date (year - month)", y = "Covid Cases (estimated and reported)") + 
      scale_y_continuous(labels = scales::comma) +
      scale_colour_manual(name = "",
                          values = c("ONS Estimates"="khaki4", 
                                     "Gvt Reported Positive Tests" = "darkcyan",
                                     "Lockdown Stringency" = "red"))
    compggp
  }, height = 600, width = 1200)
}

# Running application
shinyApp(ui = ui, server = server)

 

As you can’t run a shiny app in a static markdown file, the final app can be viewed by clicking the link below:

Summary

 

So, what does the final plot tell us?

  • Self testing/reporting on its own is NOT an accurate measure of overall prevalence
    • If the ONS data is treated as accurate, roughly 5% of infected individuals got a test AND tested positive

 

  • The metrics generally varied quite consistently, indicating bias may not have been as significant to the government data as I thought
    • It is interesting that the largest deviation from the ONS data begins roughly when Covid passes became mandatory for getting in venues such as pubs, clubs etc (Dec 2021). Of course this is only a correlation…

 

  • Lockdown stringency had a significant effect on Covid spread
    • The strictest lockdown measures invariably lowered infection levels
    • Reducing measures invariably led to rising levels
    • When measures were officialy abolished in England (Feb 2022), levels began to rise again quickly

 

  • Date may have played significant role in infection levels
    • The times around Christmas and New Year were both associated with large spikes in infection

 

Reflection

 

What would I improve about my visualisation or where would I like to take it further? One option is adding more metrics of covid prevalence, perhaps finding data sets regarding trends in Covid-related work absences or something similar. When beginning this project I wanted to work with very large data sets, so I bee-lined for National data. After having the chance to work on these sets, I would find it very interesting to compare a large amount of metrics and see how they each behaved over the stages of lockdown. I think this could be included quite easily into the final visualisation; adding an extra geom_spline() object and another slider input to the Shiny graph would allow the data to be scaled as appropriate to what the viewer is interested in comparing. The only minor consideration would be how many data points the new set contained, as if it were significantly fewer than the ones used currently the nknots value for the spline would have to be adjusted. If significant numbers of new metrics were being added, it would also likely be worth making a custom geom_spline() function in a similar way to barfunction().

A further improvement I would have liked to make to this graph is, instead of having a scaling factor included on the line, I would have liked to have a composite Y axis with a scale linked to a slider input. The default scale would have been the same as the first axis, however the viewer could ‘zoom’ the second axis in the achieve the same effect as scaling the data, but with a detailed axis to refer to.

 

Resources used

A great resource for learning your way around Shiny.

Random strangers on Stackoverflow.com who told me:

Right idea, wrong function

Add brackets