In this vignette we explore using
epinowcast
to estimate COVID-19 hospitalisations by date of
positive test in Germany stratified by age using several model
specifications with different degrees of flexibility. We then evaluate
the resulting nowcasts using visual checks, approximate leave-one-out
(LOO) cross-validation using Pareto smoothed importance sampling, and
out of sample scoring using the weighted interval score and other
scoring measures for the single report date considered here. Before
working through this vignette reading the model definition is advised
(vignette("model-definition")
)
We use the epinowcast
package, data.table
and purrr
for data manipulation, ggplot2
for
plotting, knitr
to produce tables of output,
loo
to approximately evaluate out of sample performance and
scoringutils
to evaluate out of sample forecast
performance.
library(epinowcast)
library(data.table)
library(purrr)
library(ggplot2)
library(loo)
library(scoringutils)
library(knitr)
This vignette includes several models that take upwards of 30 minutes to fit to data on a moderately equipped laptop. To speed up model fitting if more CPUs are available set the number of threads used per chain to half the number of real cores available (here 6 as we are using 2 MCMC chains and have 12 real cores available). Note this may cause conflicts with other processes running on your computer and if this is an issue reduce the number of threads used.
Nowcasting is effectively the estimation of reporting patterns for
recently reported data. This requires data on these patterns for
previous observations, which typically means the time series of data as
reported on multiple consecutive days (in theory non-consecutive days
could be used but this is not yet supported in
epinowcast
).
Here we use COVID-19 hospitalisations by date of positive test in Germany stratified by age group available from up to the 1st of September 2020 (with 40 days of data included prior to this) as an example of data available in real-time and hospitalisations by date of positive test available up to 20th of October to represent hospitalisations as finally reported. These data are sourced from the Robert Koch Institute via the Germany Nowcasting hub where they are deconvolved from weekly data and days with negative reported hospitalisations are adjusted.
We first filter out the data that would have been available on the 1st of September for the last 40 days.
nat_germany_hosp <- epinowcast::germany_covid19_hosp[location == "DE"]
retro_nat_germany <- nat_germany_hosp |>
enw_filter_report_dates(latest_date = "2021-09-01") |>
enw_filter_reference_dates(include_days = 40)
retro_nat_germany
#> reference_date location age_group confirm report_date
#> <IDat> <fctr> <fctr> <int> <IDat>
#> 1: 2021-07-23 DE 00+ 30 2021-07-23
#> 2: 2021-07-24 DE 00+ 31 2021-07-24
#> 3: 2021-07-25 DE 00+ 8 2021-07-25
#> 4: 2021-07-26 DE 00+ 9 2021-07-26
#> 5: 2021-07-27 DE 00+ 35 2021-07-27
#> ---
#> 6023: 2021-07-23 DE 05-14 1 2021-09-01
#> 6024: 2021-07-23 DE 15-34 21 2021-09-01
#> 6025: 2021-07-23 DE 35-59 39 2021-09-01
#> 6026: 2021-07-23 DE 60-79 21 2021-09-01
#> 6027: 2021-07-23 DE 80+ 5 2021-09-01
Similarly we then find the data that were available on the 20th of October for these dates, which will serve as the target “true” data.
epinowcast
works by assuming data has been preprocessed
into the reporting format it requires, coupled with meta data for both
reference and report dates. enw_preprocess_data()
can be
used for this, although users can also use the internal functions to
produce their own custom preprocessing steps. It is at this stage that
arbitrary groupings of observations can be defined, which will then be
propagated throughout all subsequent modelling steps. Here we have data
stratified by age, and hence grouped by age group, but in principle this
could be any grouping or combination of groups independent of the
reference and report date models. We furthermore assume a maximum delay
required to make the model identifiable. We set this to 40 days due to
evidence of long reporting delays in this example data. However, note
that in most cases the majority of right truncation occurs in the first
few days and that increasing the maximum delay has a non-linear effect
on run-time (i.e. a model with a maximum delay of 20 days will be much
faster to fit than with 40 days). Note also that under the current
formulation delays longer than the maximum are ignored so that the
adjusted estimate is really for data reported after the maximum delay
rather than for finally reported data.
Another key modelling choice we make at this stage is to model overall hospitalisations jointly with age groups rather than as an aggregation of age group estimates. This implicitly assumes that aggregated and non-aggregated data are not comparable (which may or may not be the case) but that the reporting process shares some of the same mechanisms. Another way to approach this would be to only model age stratified hospitalisations and then to aggregate the nowcast estimates into total counts after fitting the model.
pobs <- enw_preprocess_data(retro_nat_germany, max_delay = 40, by = "age_group")
pobs
#> obs new_confirm latest
#> <list> <list> <list>
#> 1: <data.table[6020x9]> <data.table[6020x11]> <data.table[287x10]>
#> missing_reference reporting_triangle metareference
#> <list> <list> <list>
#> 1: <data.table[0x6]> <data.table[287x42]> <data.table[287x9]>
#> metareport metadelay max_delay time snapshots by
#> <list> <list> <num> <int> <int> <list>
#> 1: <data.table[560x12]> <data.table[40x5]> 40 41 287 age_group
#> groups max_date timestep
#> <int> <IDat> <char>
#> 1: 7 2021-09-01 day
Here we explore a range of increasingly complex models using subject area knowledge and posterior predictive checks to motivate modelling choices.
To speed up model fitting we make use of posterior information from the previous model (with some inflation) for some parameters. Note that this is not a truly Bayesian approach and in some situations may be problematic.
priors <- summary(
nowcast,
type = "fit",
variables = c("refp_mean_int", "refp_sd_int", "sqrt_phi")
)
priors[, sd := sd * 1.5]
priors
#> variable mean median sd mad q5
#> <char> <num> <num> <num> <num> <num>
#> 1: refp_mean_int[1] 1.1076321 1.0995100 0.17138501 0.10470121 0.9402545
#> 2: refp_sd_int[1] 2.5812547 2.5733600 0.18830440 0.12121738 2.3878600
#> 3: sqrt_phi[1] 0.6439369 0.6435505 0.02715621 0.01895875 0.6133418
#> q20 q80 q95 rhat ess_bulk ess_tail
#> <num> <num> <num> <num> <num> <num>
#> 1: 1.012036 1.1917780 1.3073580 0.9991323 698.8579 524.3550
#> 2: 2.473288 2.6831340 2.7961235 0.9995040 698.7063 418.7872
#> 3: 0.628594 0.6596474 0.6736205 0.9997282 1428.7212 698.8907
The underlying data we are trying to nowcast clearly has day of week
periodicity. In our default model, a group specific random walk on the
log of notified cases, this is not accounted for. Accounting for this,
using a random effect on the day of the week is likely to improve
nowcast performance and reduce the computation needed to fit the model.
We can specify this using the enw_expectation()
module and
the formula interface as follows.
expectation_module <- enw_expectation(
~ 0 + (1 | day_of_week) + (1 | day:.group), data = pobs
)
exp_nowcast <- epinowcast(pobs,
expectation = expectation_module,
fit = fit,
priors = priors
)
We again visualise the nowcasts
plot of chunk exp_nowcast
In order to identify areas where the current model is poorly reproducing the data we plot the posterior predictions against the data. This plot is faceted age group and reference data with the y axis showing the number of observations reported on a given day for a given reference day and the x axis showing the report date. We see fairly clearly oscillations in reported cases every 7 days which is expressed in the plot by oscillations in each facet that appear to move from left to right across facets. This indicates that some kind of week day adjustment may be needed.
plot(exp_nowcast, type = "posterior") +
facet_wrap(vars(age_group, reference_date), scales = "free")