EpiInvert
to smooth the raw
values of cases and deathsEpiInvert
with the
smooth values of cases provided by OWIDEpiIndicators
for cases, hospitalizations and deathsFor a more detailed description see the associated MedRxiv preprint
The COVID-19 epidemic has provided us with information, for many countries, on the evolution of a number of key epidemiological time series such as the number of cases, deaths, hospitalization, or intensive care units (ICUs) occupation. In this work we explore the functional relationship between these time series (in what follows we name these time series as indicators). Let \(f(t)\) and \(g(t)\) be a pair of indicators, we will assume that there exists a smooth time-varying delay, \(s(t)\), between the values of both indicators (for instance, we expect a temporal delay (or time warping) between cases and deaths). Moreover, we assume that once the delay between both indicators is compensated, the ratio between their values, \(r(t)\), is a smooth function. Therefore we look for smooth functions \(s(t)\) and \(r(t)\) such that
\[\begin{equation} r(t)f(t) \approx g(t+s(t)) \end{equation}\]
of course, this problem is ill-posed because there are an infinite number of combinations of functions \(r(t)\) and \(s(t)\) that satisfy the above equation. However, the problem becomes well-posed if we force the functions \(r(t)\) and \(s(t)\) to be smooth. The problem is then posed in terms of “we want the above equation to verify as well as possible but for \(r(t)\) and \(s(t)\) as smooth as possible”. This leads us to the following variational formulation of the problem where \(r(t)\) and \(s(t)\) are estimated by minimizing the following error functional :
\[\begin{equation} E(r,s)=\!\!\int_{0}^T \!\!\!\left( r(t)f(t)-g(t+s(t)) \right)^2 + w_r\int_{0}^T\!\!r'(t)^2dt+w_s\int_{0}^T\!\!s'(t)^2dt \end{equation}\]
the first term of this energy penalizes a poor matching between \(r(t)f(t)\) and \(g(t+s(t))\). The second and third terms of the energy impose that \(r(t)\) and \(s(t)\) be smooth modulus the values of the weights \(w_r\) and \(w_s\). The larger the values of these weights the more regular \(r(t)\) and \(s(t)\) will be. We can also include a term in the energy to penalize that \(s(t)\) be outside an expected interval \([s_{min},s_{max}]\) (see the MedRxiv preprint), but if we use an interval \([s_{min},s_{max}]\), large enough to include the possible values of \(s(t)\), this term is not needed.
We implement the minimization of the above energy in the
EpiIndicators
function. The input of
EpiIndicators
is a dataframe
with the date and
values of the indicators \(f(t)\) and
\(g(t)\). The zero value should be
assigned to the indicators in the case where the real value is not
available.
The most important parameters of EpiIndicators
are \(w_r\) and \(w_s\) which determine the regularity of
\(r(t)\) and \(s(t)\), the default values of these
parameters are \(w_r=1000\) and \(w_s=10\). These values have been obtained
experimentally using simulated data. We always normalize \(f(t)\) and \(g(t)\) by dividing them by their medians
before minimizing the energy.
The user can also change the shift interval, \([s_{min},s_{max}]\), where \(s(t)\) is expected to be (by default \([s_{min},s_{max}]=[-10,25]\)). By default,
the boundary values of \(r(t)\) and
\(s(t)\) at the extremes of the time
interval \([0,T]\) are estimated
automatically, but the user can manually fix any of the boundary values
\(r(0)\), \(r(T)\), \(s(0)\) or \(s(T)\). The indicators \(f(t)\) and \(g(t)\) must be smooth functions. So, for
instance, the raw registered number of cases or deaths are not adequate
to run the function. These particular indicators should be smoothed
before executing EpiIndicators
, for instance, the restored
indicator values obtained by EpiInvert
can be used to get a
smooth version of these indicators.
The output of EpiIndicators
is a dataframe
containing the following columns:
date
: the date of the indicators valuef
: first indicator \(f(t)\)g
: second indicator \(g(t)\)r
: the estimated ratio \(r(t)\)s
: the estimated shift (delay) \(s(t)\)f.r
: \(r(t)*f(t)\)g.s
: \(g(t+s(t))\)We use, owid, a dataset containing COVID-19
epidemiological indicators for Canada, France, Germany, Italy, UK and
the USA obtained from Our
World in data up to 2022-11-28. In the case a data value is not
available for a given day we assign the value 0 to the indicator.
owid is a dataframe
containing the
following variables :
iso_code
: iso code of the countrylocation
: country namedate
: date of the indicator valuenew_cases
: new confirmed casesnew_cases_smoothed
: new confirmed cases smoothed (as
provided by Our
World in data)new_cases_restored_EpiInvert
: new confirmed cases
restored using EpiInvert
new_deaths
: new deaths attributed to COVID-19new_deaths_smoothed
: new deaths smoothed (as provided
by Our
World in data)new_deaths_restored_EpiInvert
: new deaths restored
using EpiInvert
icu_patients
: number of COVID-19 patients in intensive
care units (ICUs) on a given dayhosp_patients
: number of COVID-19 patients in hospital
on a given dayweekly_icu_admissions
: number of COVID-19 patients
newly admitted to intensive care units (ICUs) in a given week (reporting
date and the preceding 6 days)weekly_hosp_admissions
: number of COVID-19 patients
newly admitted to hospitals in a given week (reporting date and the
preceding 6 days)Not all countries have recorded values for all indicators in this database. For example, France and Italy have data for all the indicators, but the rest of the countries do not. Therefore, before using the data from a country, it is convenient to analyze, by exploring the dataset, which indicators have values other than zero.
EpiIndicators
is included in the
EpiInvert
CRAN package, so you can install EpiInvert
directly
from CRAN. You can also install the development version of
EpiInvert
from
GitHub with:
install.packages("devtools")
devtools::install_github("lalvarezmat/EpiInvert")
We attach some required packages
library(EpiInvert)
## Warning: package 'EpiInvert' was built under R version 4.3.2
library(ggplot2)
library(dplyr)
library(ggpubr)
We load the owid dataset:
data(owid)
summary(owid)
## iso_code location date new_cases
## Length:6007 Length:6007 Length:6007 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 2058
## Mode :character Mode :character Mode :character Median : 11336
## Mean : 37533
## 3rd Qu.: 41396
## Max. :1355242
## new_cases_smoothed new_cases_restored_EpiInvert new_deaths
## Min. : 0 Min. : 7 Min. : 0.0
## 1st Qu.: 3180 1st Qu.: 3244 1st Qu.: 31.0
## Median : 15233 Median : 15750 Median : 104.0
## Mean : 37318 Mean : 37499 Mean : 305.9
## 3rd Qu.: 42573 3rd Qu.: 43033 3rd Qu.: 310.0
## Max. :806898 Max. :854609 Max. :4389.0
## new_deaths_smoothed new_deaths_restored_EpiInvert icu_patients
## Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 40.0 1st Qu.: 41.0 1st Qu.: 326
## Median : 113.0 Median : 112.0 Median : 932
## Mean : 305.1 Mean : 305.8 Mean : 2642
## 3rd Qu.: 333.0 3rd Qu.: 342.5 3rd Qu.: 2755
## Max. :3380.0 Max. :3375.0 Max. :28891
## hosp_patients weekly_icu_admissions weekly_hosp_admissions
## Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 977.5 1st Qu.: 0.0 1st Qu.: 552
## Median : 6629.0 Median : 0.0 Median : 4590
## Mean : 13940.8 Mean : 324.7 Mean : 10666
## 3rd Qu.: 19734.0 3rd Qu.: 351.0 3rd Qu.: 10618
## Max. :154497.0 Max. :4838.0 Max. :153977
For France, we compare the cases and deaths.
# Getting the France values from owid
sel <- filter(owid,iso_code=="FRA")
# Building a dataframe to call EpiIndicators
df<-data.frame(sel$date,sel$new_cases_restored_EpiInvert,sel$new_deaths_restored_EpiInvert)
# Running EpiIndicators
res <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
# Plotting the results
EpiIndicators_plot(res)
We notice that, in general, there is a good agreement between \(r(t)f(t)\) and \(g(t+s(t))\) except at the beginning of the epidemic where \(r(t)f(t)\) is much smaller than \(g(t+s(t))\), and \(s(t)\) is negative. This is likely due to a very significant underestimation of the number of cases at the beginning of the epidemic. We highlight that the mortality rate, \(r(t)\) starts around 3.7%, decreases sharply up to 2022 and then it stabilizes around 0.17%. The delay \(s(t)\) starts at -7.6 days, increases sharply up to 2021 and then it stabilizes around 16 days.
For France, we compare the cases and deaths modifying \(w_r\) and \(w_s\)
# Getting the France values from owid
sel <- filter(owid,iso_code=="FRA")
# Building a dataframe to call EpiIndicators_plot
df<-data.frame(sel$date,sel$new_cases_restored_EpiInvert,sel$new_deaths_restored_EpiInvert)
# Running EpiIndicators
res <- EpiIndicators(df,EpiIndicators_params(list(wr=10,ws=1)))
# Plotting the results
EpiIndicators_plot(res)
As expected, decreasing the value of \(w_r\) and \(w_s\) improves the agreement between \(r(t)f(t)\) and \(g(t+s(t))\) but \(r(t)\) and \(s(t)\) are more irregular.
For France, we compare hospitalizations’ weekly admissions and ICU’s weekly admissions
df<-data.frame(sel$date,sel$weekly_hosp_admissions,sel$weekly_icu_admissions)
res <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res)
We observe that there exists a very good agreement between \(r(t)f(t)\) and \(g(t+s(t))\). This is likely due to the fact that the quality of these indicators are higher than the quality of the reported cases used above.
We compare the deaths between Italy and France to estimate the delay between the COVID-19 waves in both countries
sel1 <- filter(owid,iso_code=="ITA")
sel2 <- filter(owid,iso_code=="FRA")
#we joint the deaths of both countries in a single dataframe
df <- joint_indicators_by_date(sel1$date,sel1$new_deaths_restored_EpiInvert,sel2$date,sel2$new_deaths_restored_EpiInvert)
#running EpiIndicators enlarging the expected interval for s(t)
res <- EpiIndicators(df,EpiIndicators_params(list(s_min=-40,s_max=40,wr=1000,ws=100)))
EpiIndicators_plot(res)
We know that the epidemic started in Italy before France. This is reflected in that \(s(0)\approx\) 11.8 days. Then \(s(t)\) decreases sharply and stabilizes near zero, indicating that there is currently a little delay in the epidemic waves in France and Italy. We also know that the first wave of the epidemic in Italy generated a large number of deaths, this is reflected in that \(r(0)\approx 1.12\). The median of \(r(t)\) for the whole period is \(0.885\), which is a value very close to the ratio between the populations of Italy and France (around 0.875), which indicates that, globally, the mortality rate in both countries has been similar.
EpiInvert
to smooth the raw values of
cases and deathsIn the dataset owid, the restored values of the
incidence and deaths using EpiInvert
are precomputed and
stored in the database, but if you need to smooth your own data using
EpiInvert
you have to run EpiInvert
on your
data. We illustrate this with an example of how to use
EpiInvert
to smooth the cases and deaths in real time for
the USA :
sel <- filter(owid,iso_code=="USA")
# last day of available data in the sequence
last_date <- sel$date[length(sel$date)]
# we do not use festive days, so we initialize festive days using a day outside the period
festive_days <- c("1999-01-01")
# running EpiInvert using the last 365 days for the raw cases and deaths
cases <- EpiInvert(sel$new_cases,last_date,festive_days,select_params(list(max_time_interval = 365)))
deaths <- EpiInvert(sel$new_deaths,last_date,festive_days,select_params(list(max_time_interval = 365)))
#we joint the restored cases and deaths in a single dataframe
df <- joint_indicators_by_date(cases$date,cases$i_restored,deaths$date,deaths$i_restored)
#running EpiIndicators
res <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res)
We observe that the current mortality rate in the USA is around 0.63% which is higher than in France which is around 0.17%. This could indicate that the number of new cases is currently underestimated in the USA with respect to France. The current delay, in the USA, \(s(T)\) is around 19.8 days (16 days in France), which can indicate that it takes less time in the USA to register new cases (and/or it takes more time to register new deaths).
EpiInvert
with the smooth values of
cases provided by Our
World in dataOur
World in data dataset includes smooth values of the cases and deaths
that we have stored in the owid dataset. Here we
compare this smooth data with the one provided by EpiInvert
for the new cases in France. We observe that the Our
World in data smooth version of cases is more irregular than
EpiInvert
, it can be zero sometimes and the median of the
delay \(s(t)\) is about 1.81 days. So,
in general, to smooth the raw values of cases and deaths, it is better
to use EpiInvert
instead of the Our
World in data smooth data.
sel <- filter(owid,iso_code=="FRA")
df<-data.frame(sel$date,sel$new_cases_restored_EpiInvert,sel$new_cases_smoothed)
res <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res)
To evaluate the capacity of EpiIndicators
to correctly
estimate \(r(t)\) and \(s(t)\) we developed a simulation framework
where both functions were known. We use as second indicator \(g(t)\) the restored number of deaths in
France using EpiInvert
and we generate
the first indicator \(f(t)\) from
simulated \(r(t)\) and \(s(t)\) using the formula
\[\begin{equation} f(t)=\frac{g(t+s(t))}{r(t)} \end{equation}\]
in this way, we obtain a perfect matching between both indicators.
\(r(t)\) is obtained using
atan()
type functions where we simulate a decreasing
behavior of \(r(t)\) and \(s(t)\) is obtained in the same way to
simulate an increasing behavior of \(s(t)\). The pre-computed values of \(g(t)\), \(r(t)\) and \(s(t)\) are stored in the URL file
“EpiIndicatorSimulatedData.csv”. Next, we read this file, compute \(f(t)\) using the above formula, and apply
EpiIndicators
to the resulting indicators:
# reading the file with simulated data
ControlData <- read.csv2(url("https://www.ctim.es/covid19/EpiIndicatorSimulatedData.csv"),sep=";")
g<-as.numeric(ControlData$g)
r0<-as.numeric(ControlData$r0)
s0<-as.numeric(ControlData$s0)
# computing f(t) using the above formula
f<-apply_delay(g,s0)/r0
# building a dataframe to run EpiIndicators
df<-data.frame(ControlData$date,f,g)
res <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res)
Next, we compare the obtained values of \(r(t)\) and \(s(t)\) with the ground truth:
df<-data.frame(date=as.Date(c(ControlData$date,res$date)),value=c(r0,res$r),legend=c(rep("true r(t)",length(r0)),rep("estimated r(t)",length(res$r))))
g1<-ggplot(df,aes(date,value, col=legend))+theme_bw() + theme(legend.position="bottom",axis.title.x = element_blank()) + geom_line()
df<-data.frame(date=as.Date(c(ControlData$date,res$date)),value=c(s0,res$s),legend=c(rep("true s(t)",length(r0)),rep("estimated s(t)",length(res$s))))
g2<-ggplot(df,aes(date,value, col=legend))+theme_bw() + theme(legend.position="bottom",axis.title.x = element_blank()) + geom_line()
ggarrange(ggarrange(g1, ncol = 1, labels = c("A")),ggarrange(g2, ncol = 1, labels = c("B")),nrow = 2)
We observe a good agreement between the estimated values of \(r(t)\), \(s(t)\), and the ground truth. Notice that we cannot expect the values to be exactly the same because in the areas where \(g(t+s(t))\) is very small or constant, there is not enough information to compute \(s(t)\) and in that case, the energy regularity term dominates, causing the estimated value to be slightly smoother than the ground-truth. On the contrary, at the peaks of the epidemic waves, the information is more robust and we can expect the calculation of \(r(t)\) and \(s(t)\) to be more accurate.
Next we study, using this simulated data, the influence of the order in which the indicators are taken. Let \(\tilde{r}(t)\) and \(\tilde{s}(t)\) the ratio and delay obtained using \(g(t)\) as the first indicator and \(f(t)\) as the second indicator. Then, accordingly with our model we have that
\[\begin{equation} \tilde{r}(t)g(t) \approx f(t+\tilde{s}(t)) \approx \frac{g(t+\tilde{s}(t)+s(t+\tilde{s}(t))) }{r(t+\tilde{s}(t))} \end{equation}\]
To check if this relation is satisfied we study the normalized error distribution:
\[\begin{equation} \frac{r(t+\tilde{s}(t))\tilde{r}(t)g(t)-g(t+\tilde{s}(t)+s(t+\tilde{s}(t)))}{ median(g)} \end{equation}\]
# computing EpiIndicators changing the role of f(t) and g(t)
df<-data.frame(ControlData$date,g,f)
res2 <- EpiIndicators(df,EpiIndicators_params(list(s_min=-25,s_max=10,wr=1000,ws=100)))
EpiIndicators_plot(res2)
# computing the normalized error
gs<-apply_delay(g,res$s)
gss<-apply_delay(gs,res2$s)
rs<-apply_delay(res$r,res2$s)
dif <- (rs*res2$r*g-gss)/median(g)
summary(dif)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.405273 -0.023496 0.008841 -0.001021 0.038259 0.657578
ggplot(data.frame(date=as.Date(res$date),normalized_error=dif),aes(x=date,y=normalized_error))+theme_bw() + geom_line()
The small size of the above normalized error shows the performance of the method and its stability when we change the role of the indicators. At the beginning we find larger errors due to boundary effects in the estimations.
In this section we study the transitivity of the estimation when 3 indicators are involved. Let \(f_1(t)\), \(f_2(t)\) and \(f_3(t)\) be 3 indicators. In this example we consider, in the case of France, that \(f_1(t)\) is the number of new cases restored using EpiInvert, \(f_2(t)\) is the number of new hospital admissions and \(f_3(t)\) is the number of new deaths restored using EpiInvert. The hospitalization data has been taken from the COVID-19 European Hub. We will study the relation between the ratio and delay between the 3 indicators. Let \(r_{i,j}\) and \(s_{i,j}\) be the ratio and delay between indicators \(f_i(t)\) and \(f_j(t)\). Then we have that
\[\begin{equation} r_{i,j}(t)f_{i}(t) \approx f_{j}(t+s_{i,j}(t)) \end{equation}\]
using this relation for \(\{i,j\}=\{1,2\},\{2,3\},\{1,3\}\) we obtain the following transitivity relations:
\[\begin{equation} r_{2,3}(t+s_{1,2}(t))r_{1,2}(t)f_{1}(t) \approx r_{2,3}(t+s_{1,2}(t))f_{2}(t+s_{1,2}(t)) \approx f_{3}(t+s_{1,2}(t)+s_{2,3}(t+s_{1,2}(t))) \end{equation}\] \[\begin{equation} r_{1,3}(t)\approx r_{1,2}(t)r_{2,3}(t+s_{1,2}(t)) \end{equation}\] \[\begin{equation} s_{1,3}(t)\approx s_{1,2}(t)+s_{2,3}(t+s_{1,2}(t)) \end{equation}\]
next we check if the above 2 relations are satisfied for the example we use :
EpiIndicators
data <- read.csv("https://www.ctim.es/covid19/FRA_cases_hosp_deaths.csv",sep=";")
df<-data.frame(data$date,data$cases,data$hosp)
res12 <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res12)
df<-data.frame(data$date,data$hosp,data$deaths)
res23 <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res23)
df<-data.frame(data$date,data$cases,data$deaths)
res13 <- EpiIndicators(df,EpiIndicators_params(list(wr=1000,ws=100)))
EpiIndicators_plot(res13)
EpiIndicators
can remove some of the
sequence values)r13<-res13$r
s13<-res13$s
df<-joint_indicators_by_date(res13$date,res13$r,res12$date,res12$s)
s12<-df$g
df<-joint_indicators_by_date(res13$date,res13$r,res12$date,res12$r)
r12<-df$g
df<-joint_indicators_by_date(res13$date,res13$r,res23$date,res23$s)
s23<-df$g
df<-joint_indicators_by_date(res13$date,res13$r,res23$date,res23$r)
r23<-df$g
#estimating r23(t+s12(t))
r23s <- apply_delay(r23,s12)
#estimating s23(t+s12(t))
s23s<-apply_delay(s23,s12)
df<-data.frame(date=as.Date(c(res13$date,res13$date)),value=c(r13,r23s*r12),legend=c(rep("r13(t)",length(r13)),rep("r23(t+s12(t))*r12(t)",length(r13))))
g1<-ggplot(df,aes(date,value, col=legend))+theme_bw() + theme(legend.position="bottom",axis.title.x = element_blank()) + geom_line()
df<-data.frame(date=as.Date(c(res13$date,res13$date)),value=c(s13,s12+s23s),legend=c(rep("s13(t)",length(r13)),rep("s12(t)+s23(t+s12(t))",length(r13))))
g2<-ggplot(df,aes(date,value, col=legend))+theme_bw() + theme(legend.position="bottom",axis.title.x = element_blank()) + geom_line()
ggarrange(ggarrange(g1, ncol = 1, labels = c("A")),ggarrange(g2, ncol = 1, labels = c("B")),nrow = 2)
We notice a reasonable transitivity between the estimations of the ratio and delay for the 3 indicators except at the beginning of the epidemic likely due to the poor quality of the cases sequence.