For a more detailed description see the associated MedRxiv preprint

Overview

The COVID-19 epidemic has provided us with information, for many countries, on the evolution of a number of key epidemiological time series such as the number of cases, deaths, hospitalization, or intensive care units (ICUs) occupation. In this work we explore the functional relationship between these time series (in what follows we name these time series as indicators). Let \(f(t)\) and \(g(t)\) be a pair of indicators, we will assume that there exists a smooth time-varying delay, \(s(t)\), between the values of both indicators (for instance, we expect a temporal delay (or time warping) between cases and deaths). Moreover, we assume that once the delay between both indicators is compensated, the ratio between their values, \(r(t)\), is a smooth function. Therefore we look for smooth functions \(s(t)\) and \(r(t)\) such that

\[\begin{equation} r(t)f(t) \approx g(t+s(t)) \end{equation}\]

of course, this problem is ill-posed because there are an infinite number of combinations of functions \(r(t)\) and \(s(t)\) that satisfy the above equation. However, the problem becomes well-posed if we force the functions \(r(t)\) and \(s(t)\) to be smooth. The problem is then posed in terms of “we want the above equation to verify as well as possible but for \(r(t)\) and \(s(t)\) as smooth as possible”. This leads us to the following variational formulation of the problem where \(r(t)\) and \(s(t)\) are estimated by minimizing the following error functional :

\[\begin{equation} E(r,s)=\!\!\int_{0}^T \!\!\!\left( r(t)f(t)-g(t+s(t)) \right)^2 + w_r\int_{0}^T\!\!r'(t)^2dt+w_s\int_{0}^T\!\!s'(t)^2dt \end{equation}\]

the first term of this energy penalizes a poor matching between \(r(t)f(t)\) and \(g(t+s(t))\). The second and third terms of the energy impose that \(r(t)\) and \(s(t)\) be smooth modulus the values of the weights \(w_r\) and \(w_s\). The larger the values of these weights the more regular \(r(t)\) and \(s(t)\) will be. We can also include a term in the energy to penalize that \(s(t)\) be outside an expected interval \([s_{min},s_{max}]\) (see the MedRxiv preprint), but if we use an interval \([s_{min},s_{max}]\), large enough to include the possible values of \(s(t)\), this term is not needed.

We implement the minimization of the above energy in the EpiIndicators function. The input of EpiIndicators is a dataframe with the date and values of the indicators \(f(t)\) and \(g(t)\). The zero value should be assigned to the indicators in the case where the real value is not available.

The most important parameters of EpiIndicators are \(w_r\) and \(w_s\) which determine the regularity of \(r(t)\) and \(s(t)\), the default values of these parameters are \(w_r=1000\) and \(w_s=10\). These values have been obtained experimentally using simulated data. We always normalize \(f(t)\) and \(g(t)\) by dividing them by their medians before minimizing the energy.

The user can also change the shift interval, \([s_{min},s_{max}]\), where \(s(t)\) is expected to be (by default \([s_{min},s_{max}]=[-10,25]\)). By default, the boundary values of \(r(t)\) and \(s(t)\) at the extremes of the time interval \([0,T]\) are estimated automatically, but the user can manually fix any of the boundary values \(r(0)\), \(r(T)\), \(s(0)\) or \(s(T)\). The indicators \(f(t)\) and \(g(t)\) must be smooth functions. So, for instance, the raw registered number of cases or deaths are not adequate to run the function. These particular indicators should be smoothed before executing EpiIndicators, for instance, the restored indicator values obtained by EpiInvert can be used to get a smooth version of these indicators.

The output of EpiIndicators is a dataframe containing the following columns:

owid dataset

We use, owid, a dataset containing COVID-19 epidemiological indicators for Canada, France, Germany, Italy, UK and the USA obtained from Our World in data up to 2022-11-28. In the case a data value is not available for a given day we assign the value 0 to the indicator. owid is a dataframe containing the following variables :

Not all countries have recorded values for all indicators in this database. For example, France and Italy have data for all the indicators, but the rest of the countries do not. Therefore, before using the data from a country, it is convenient to analyze, by exploring the dataset, which indicators have values other than zero.

Package installation

EpiIndicators is included in the EpiInvert CRAN package, so you can install EpiInvert directly from CRAN. You can also install the development version of EpiInvert from GitHub with:

 install.packages("devtools")
 devtools::install_github("lalvarezmat/EpiInvert")

We attach some required packages

library(EpiInvert)
## Warning: package 'EpiInvert' was built under R version 4.3.2
library(ggplot2)
library(dplyr)
library(ggpubr)

We load the owid dataset:

data(owid)
summary(owid)
##    iso_code           location             date             new_cases      
##  Length:6007        Length:6007        Length:6007        Min.   :      0  
##  Class :character   Class :character   Class :character   1st Qu.:   2058  
##  Mode  :character   Mode  :character   Mode  :character   Median :  11336  
##                                                           Mean   :  37533  
##                                                           3rd Qu.:  41396  
##                                                           Max.   :1355242  
##  new_cases_smoothed new_cases_restored_EpiInvert   new_deaths    
##  Min.   :     0     Min.   :     7               Min.   :   0.0  
##  1st Qu.:  3180     1st Qu.:  3244               1st Qu.:  31.0  
##  Median : 15233     Median : 15750               Median : 104.0  
##  Mean   : 37318     Mean   : 37499               Mean   : 305.9  
##  3rd Qu.: 42573     3rd Qu.: 43033               3rd Qu.: 310.0  
##  Max.   :806898     Max.   :854609               Max.   :4389.0  
##  new_deaths_smoothed new_deaths_restored_EpiInvert  icu_patients  
##  Min.   :   0.0      Min.   :   0.0                Min.   :    0  
##  1st Qu.:  40.0      1st Qu.:  41.0                1st Qu.:  326  
##  Median : 113.0      Median : 112.0                Median :  932  
##  Mean   : 305.1      Mean   : 305.8                Mean   : 2642  
##  3rd Qu.: 333.0      3rd Qu.: 342.5                3rd Qu.: 2755  
##  Max.   :3380.0      Max.   :3375.0                Max.   :28891  
##  hosp_patients      weekly_icu_admissions weekly_hosp_admissions
##  Min.   :     0.0   Min.   :   0.0        Min.   :     0        
##  1st Qu.:   977.5   1st Qu.:   0.0        1st Qu.:   552        
##  Median :  6629.0   Median :   0.0        Median :  4590        
##  Mean   : 13940.8   Mean   : 324.7        Mean   : 10666        
##  3rd Qu.: 19734.0   3rd Qu.: 351.0        3rd Qu.: 10618        
##  Max.   :154497.0   Max.   :4838.0        Max.   :153977

Comparison of cases and deaths

For France, we compare the cases and deaths.

# Getting the France values from owid
sel <- filter(owid,iso_code=="FRA")
# Building a dataframe to call EpiIndicators
df<-data.frame(sel$date,sel$new_cases_restored_EpiInvert,sel$new_deaths_restored_EpiInvert)
# Running EpiIndicators
res <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
# Plotting the results 
EpiIndicators_plot(res)

We notice that, in general, there is a good agreement between \(r(t)f(t)\) and \(g(t+s(t))\) except at the beginning of the epidemic where \(r(t)f(t)\) is much smaller than \(g(t+s(t))\), and \(s(t)\) is negative. This is likely due to a very significant underestimation of the number of cases at the beginning of the epidemic. We highlight that the mortality rate, \(r(t)\) starts around 3.7%, decreases sharply up to 2022 and then it stabilizes around 0.17%. The delay \(s(t)\) starts at -7.6 days, increases sharply up to 2021 and then it stabilizes around 16 days.

Modifying \(w_r\) and \(w_s\)

For France, we compare the cases and deaths modifying \(w_r\) and \(w_s\)

# Getting the France values from owid
sel <- filter(owid,iso_code=="FRA")
# Building a dataframe to call EpiIndicators_plot
df<-data.frame(sel$date,sel$new_cases_restored_EpiInvert,sel$new_deaths_restored_EpiInvert)
# Running EpiIndicators
res <- EpiIndicators(df,EpiIndicators_params(list(wr=10,ws=1)))
# Plotting the results 
EpiIndicators_plot(res)

As expected, decreasing the value of \(w_r\) and \(w_s\) improves the agreement between \(r(t)f(t)\) and \(g(t+s(t))\) but \(r(t)\) and \(s(t)\) are more irregular.

Comparison of hospitalizations and ICU’s new admissions

For France, we compare hospitalizations’ weekly admissions and ICU’s weekly admissions

df<-data.frame(sel$date,sel$weekly_hosp_admissions,sel$weekly_icu_admissions)
res <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res)

We observe that there exists a very good agreement between \(r(t)f(t)\) and \(g(t+s(t))\). This is likely due to the fact that the quality of these indicators are higher than the quality of the reported cases used above.

Comparison of epidemiological waves between countries

We compare the deaths between Italy and France to estimate the delay between the COVID-19 waves in both countries

sel1 <- filter(owid,iso_code=="ITA")
sel2 <- filter(owid,iso_code=="FRA")
#we joint the deaths of both countries in a single dataframe
df <- joint_indicators_by_date(sel1$date,sel1$new_deaths_restored_EpiInvert,sel2$date,sel2$new_deaths_restored_EpiInvert)
#running EpiIndicators enlarging the expected interval for s(t) 
res <- EpiIndicators(df,EpiIndicators_params(list(s_min=-40,s_max=40,wr=1000,ws=100)))
EpiIndicators_plot(res)

We know that the epidemic started in Italy before France. This is reflected in that \(s(0)\approx\) 11.8 days. Then \(s(t)\) decreases sharply and stabilizes near zero, indicating that there is currently a little delay in the epidemic waves in France and Italy. We also know that the first wave of the epidemic in Italy generated a large number of deaths, this is reflected in that \(r(0)\approx 1.12\). The median of \(r(t)\) for the whole period is \(0.885\), which is a value very close to the ratio between the populations of Italy and France (around 0.875), which indicates that, globally, the mortality rate in both countries has been similar.

Using EpiInvert to smooth the raw values of cases and deaths

In the dataset owid, the restored values of the incidence and deaths using EpiInvert are precomputed and stored in the database, but if you need to smooth your own data using EpiInvert you have to run EpiInvert on your data. We illustrate this with an example of how to use EpiInvert to smooth the cases and deaths in real time for the USA :

sel <- filter(owid,iso_code=="USA")
# last day of available data in the sequence 
last_date <- sel$date[length(sel$date)]
# we do not use festive days, so we initialize festive days using a day outside the period
festive_days <- c("1999-01-01")
# running EpiInvert using the last 365 days for the raw cases and deaths
cases <- EpiInvert(sel$new_cases,last_date,festive_days,select_params(list(max_time_interval = 365)))
deaths <- EpiInvert(sel$new_deaths,last_date,festive_days,select_params(list(max_time_interval = 365)))
#we joint the restored cases and deaths in a single dataframe
df <- joint_indicators_by_date(cases$date,cases$i_restored,deaths$date,deaths$i_restored)
#running EpiIndicators 
res <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res)

We observe that the current mortality rate in the USA is around 0.63% which is higher than in France which is around 0.17%. This could indicate that the number of new cases is currently underestimated in the USA with respect to France. The current delay, in the USA, \(s(T)\) is around 19.8 days (16 days in France), which can indicate that it takes less time in the USA to register new cases (and/or it takes more time to register new deaths).

Comparing EpiInvert with the smooth values of cases provided by Our World in data

Our World in data dataset includes smooth values of the cases and deaths that we have stored in the owid dataset. Here we compare this smooth data with the one provided by EpiInvert for the new cases in France. We observe that the Our World in data smooth version of cases is more irregular than EpiInvert, it can be zero sometimes and the median of the delay \(s(t)\) is about 1.81 days. So, in general, to smooth the raw values of cases and deaths, it is better to use EpiInvert instead of the Our World in data smooth data.

sel <- filter(owid,iso_code=="FRA")
df<-data.frame(sel$date,sel$new_cases_restored_EpiInvert,sel$new_cases_smoothed)
res <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res)

Simulation study

To evaluate the capacity of EpiIndicators to correctly estimate \(r(t)\) and \(s(t)\) we developed a simulation framework where both functions were known. We use as second indicator \(g(t)\) the restored number of deaths in France using EpiInvert and we generate
the first indicator \(f(t)\) from simulated \(r(t)\) and \(s(t)\) using the formula

\[\begin{equation} f(t)=\frac{g(t+s(t))}{r(t)} \end{equation}\]

in this way, we obtain a perfect matching between both indicators. \(r(t)\) is obtained using atan() type functions where we simulate a decreasing behavior of \(r(t)\) and \(s(t)\) is obtained in the same way to simulate an increasing behavior of \(s(t)\). The pre-computed values of \(g(t)\), \(r(t)\) and \(s(t)\) are stored in the URL file “EpiIndicatorSimulatedData.csv”. Next, we read this file, compute \(f(t)\) using the above formula, and apply EpiIndicators to the resulting indicators:

# reading the file with simulated data
ControlData <- read.csv2(url("https://www.ctim.es/covid19/EpiIndicatorSimulatedData.csv"),sep=";")
g<-as.numeric(ControlData$g)
r0<-as.numeric(ControlData$r0)
s0<-as.numeric(ControlData$s0)
# computing f(t) using the above formula 
f<-apply_delay(g,s0)/r0
# building a dataframe to run EpiIndicators
df<-data.frame(ControlData$date,f,g)
res <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res)

Next, we compare the obtained values of \(r(t)\) and \(s(t)\) with the ground truth:

df<-data.frame(date=as.Date(c(ControlData$date,res$date)),value=c(r0,res$r),legend=c(rep("true r(t)",length(r0)),rep("estimated r(t)",length(res$r))))
g1<-ggplot(df,aes(date,value, col=legend))+theme_bw() + theme(legend.position="bottom",axis.title.x = element_blank()) + geom_line()
df<-data.frame(date=as.Date(c(ControlData$date,res$date)),value=c(s0,res$s),legend=c(rep("true s(t)",length(r0)),rep("estimated s(t)",length(res$s))))
g2<-ggplot(df,aes(date,value, col=legend))+theme_bw() + theme(legend.position="bottom",axis.title.x = element_blank()) + geom_line()
ggarrange(ggarrange(g1, ncol = 1, labels = c("A")),ggarrange(g2, ncol = 1, labels = c("B")),nrow = 2)

We observe a good agreement between the estimated values of \(r(t)\), \(s(t)\), and the ground truth. Notice that we cannot expect the values to be exactly the same because in the areas where \(g(t+s(t))\) is very small or constant, there is not enough information to compute \(s(t)\) and in that case, the energy regularity term dominates, causing the estimated value to be slightly smoother than the ground-truth. On the contrary, at the peaks of the epidemic waves, the information is more robust and we can expect the calculation of \(r(t)\) and \(s(t)\) to be more accurate.

Next we study, using this simulated data, the influence of the order in which the indicators are taken. Let \(\tilde{r}(t)\) and \(\tilde{s}(t)\) the ratio and delay obtained using \(g(t)\) as the first indicator and \(f(t)\) as the second indicator. Then, accordingly with our model we have that

\[\begin{equation} \tilde{r}(t)g(t) \approx f(t+\tilde{s}(t)) \approx \frac{g(t+\tilde{s}(t)+s(t+\tilde{s}(t))) }{r(t+\tilde{s}(t))} \end{equation}\]

To check if this relation is satisfied we study the normalized error distribution:

\[\begin{equation} \frac{r(t+\tilde{s}(t))\tilde{r}(t)g(t)-g(t+\tilde{s}(t)+s(t+\tilde{s}(t)))}{ median(g)} \end{equation}\]

# computing EpiIndicators changing the role of f(t) and g(t) 
df<-data.frame(ControlData$date,g,f)
res2 <- EpiIndicators(df,EpiIndicators_params(list(s_min=-25,s_max=10,wr=1000,ws=100)))
EpiIndicators_plot(res2)

# computing the normalized error 
gs<-apply_delay(g,res$s)
gss<-apply_delay(gs,res2$s)
rs<-apply_delay(res$r,res2$s)
dif <- (rs*res2$r*g-gss)/median(g)
summary(dif)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.405273 -0.023496  0.008841 -0.001021  0.038259  0.657578
ggplot(data.frame(date=as.Date(res$date),normalized_error=dif),aes(x=date,y=normalized_error))+theme_bw() + geom_line()

The small size of the above normalized error shows the performance of the method and its stability when we change the role of the indicators. At the beginning we find larger errors due to boundary effects in the estimations.

Transitivity

In this section we study the transitivity of the estimation when 3 indicators are involved. Let \(f_1(t)\), \(f_2(t)\) and \(f_3(t)\) be 3 indicators. In this example we consider, in the case of France, that \(f_1(t)\) is the number of new cases restored using EpiInvert, \(f_2(t)\) is the number of new hospital admissions and \(f_3(t)\) is the number of new deaths restored using EpiInvert. The hospitalization data has been taken from the COVID-19 European Hub. We will study the relation between the ratio and delay between the 3 indicators. Let \(r_{i,j}\) and \(s_{i,j}\) be the ratio and delay between indicators \(f_i(t)\) and \(f_j(t)\). Then we have that

\[\begin{equation} r_{i,j}(t)f_{i}(t) \approx f_{j}(t+s_{i,j}(t)) \end{equation}\]

using this relation for \(\{i,j\}=\{1,2\},\{2,3\},\{1,3\}\) we obtain the following transitivity relations:

\[\begin{equation} r_{2,3}(t+s_{1,2}(t))r_{1,2}(t)f_{1}(t) \approx r_{2,3}(t+s_{1,2}(t))f_{2}(t+s_{1,2}(t)) \approx f_{3}(t+s_{1,2}(t)+s_{2,3}(t+s_{1,2}(t))) \end{equation}\] \[\begin{equation} r_{1,3}(t)\approx r_{1,2}(t)r_{2,3}(t+s_{1,2}(t)) \end{equation}\] \[\begin{equation} s_{1,3}(t)\approx s_{1,2}(t)+s_{2,3}(t+s_{1,2}(t)) \end{equation}\]

next we check if the above 2 relations are satisfied for the example we use :

Loading the data and running EpiIndicators

data <- read.csv("https://www.ctim.es/covid19/FRA_cases_hosp_deaths.csv",sep=";")
df<-data.frame(data$date,data$cases,data$hosp)
res12 <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res12)

df<-data.frame(data$date,data$hosp,data$deaths)
res23 <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res23)

df<-data.frame(data$date,data$cases,data$deaths)
res13 <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res13)

Estimating \(r_{i,j}\) and \(s_{i,j}\) fixing the same dates in all cases (notice that EpiIndicators can remove some of the sequence values)

r13<-res13$r
s13<-res13$s
df<-joint_indicators_by_date(res13$date,res13$r,res12$date,res12$s)
s12<-df$g
df<-joint_indicators_by_date(res13$date,res13$r,res12$date,res12$r)
r12<-df$g
df<-joint_indicators_by_date(res13$date,res13$r,res23$date,res23$s)
s23<-df$g
df<-joint_indicators_by_date(res13$date,res13$r,res23$date,res23$r)
r23<-df$g

#estimating r23(t+s12(t))
r23s <- apply_delay(r23,s12)
#estimating s23(t+s12(t))
s23s<-apply_delay(s23,s12)

Plotting \(r_{1,3}(t)\) and \(r_{2,3}(t+s_{1,2}(t))*r_{1,2}(t)\) in (A) and \(s_{1,3}(t)\) and \(s_{1,2}(t)+s_{2,3}(t+s_{1,2}(t))\) in (B)

df<-data.frame(date=as.Date(c(res13$date,res13$date)),value=c(r13,r23s*r12),legend=c(rep("r13(t)",length(r13)),rep("r23(t+s12(t))*r12(t)",length(r13))))
g1<-ggplot(df,aes(date,value, col=legend))+theme_bw() + theme(legend.position="bottom",axis.title.x = element_blank()) + geom_line()
df<-data.frame(date=as.Date(c(res13$date,res13$date)),value=c(s13,s12+s23s),legend=c(rep("s13(t)",length(r13)),rep("s12(t)+s23(t+s12(t))",length(r13))))
g2<-ggplot(df,aes(date,value, col=legend))+theme_bw() + theme(legend.position="bottom",axis.title.x = element_blank()) + geom_line()
ggarrange(ggarrange(g1, ncol = 1, labels = c("A")),ggarrange(g2, ncol = 1, labels = c("B")),nrow = 2)

We notice a reasonable transitivity between the estimations of the ratio and delay for the 3 indicators except at the beginning of the epidemic likely due to the poor quality of the cases sequence.