When comparing forecast models, not all models will have made predictions for every target. Naively averaging scores across different sets of targets can produce misleading summaries. This vignette walks through diagnosing which targets each model covers, scoring the forecasts, and then adjusting the scores using filtering and imputation so that summaries are comparable.
The approaches here were initially inspired by Kim, Ray & Reich (2026), who discuss the importance of handling missing forecasts when evaluating model contributions beyond simple leaderboard rankings.
Before adjusting scores we should aim to understand the patterns of potential missingness. Models may have different coverage for legitimate reasons (a model may only forecast deaths, not cases) or because of operational failures (a missed submission deadline).
get_forecast_counts() tabulates how many forecasts each
model has, grouped by the columns you choose.
library(scoringutils)
fc <- as_forecast_quantile(example_quantile)
#> ℹ Some rows containing NA values may be removed. This is fine if not
#> unexpected.
get_forecast_counts(fc, by = c("model", "target_type"))
#> Key: <model, target_type>
#> model target_type count
#> <char> <char> <int>
#> 1: EuroCOVIDhub-baseline Cases 128
#> 2: EuroCOVIDhub-baseline Deaths 128
#> 3: EuroCOVIDhub-ensemble Cases 128
#> 4: EuroCOVIDhub-ensemble Deaths 128
#> 5: UMass-MechBayes Cases 0
#> 6: UMass-MechBayes Deaths 128
#> 7: epiforecasts-EpiNow2 Cases 128
#> 8: epiforecasts-EpiNow2 Deaths 119UMass-MechBayes does not forecast cases at all, and
epiforecasts-EpiNow2 has fewer death forecasts than the
other models.
To see exactly which death targets epiforecasts-EpiNow2
is missing, we can request counts at a finer level and filter to
zero-count rows.
death_counts <- get_forecast_counts(
fc,
by = c("model", "target_type", "location", "target_end_date")
)
death_counts[
model == "epiforecasts-EpiNow2" &
target_type == "Deaths" &
count == 0
]
#> Key: <model, target_type, location, target_end_date>
#> model target_type location target_end_date count
#> <char> <char> <char> <Date> <int>
#> 1: epiforecasts-EpiNow2 Deaths FR 2021-06-19 0Note that get_pairwise_comparisons() handles missingness
internally, restricting each pair of models to their shared set of
targets before comparing. The functions below give you explicit control
over how to handle missingness when computing score summaries.
We score all forecasts first, then adjust the resulting scores table.
Both filter_scores() and
impute_missing_scores() accept a compare
argument (default "model") that specifies which column
identifies the units being compared.
A naive summary averages over different numbers of targets per model, which can make direct comparison misleading.
summarise_scores(scores, by = "model")
#> model wis overprediction underprediction dispersion
#> <char> <num> <num> <num> <num>
#> 1: EuroCOVIDhub-ensemble 8992.62316 5025.130095 2120.64029 1846.85278
#> 2: EuroCOVIDhub-baseline 14321.48926 7081.000000 5143.53567 2096.95360
#> 3: epiforecasts-EpiNow2 10827.40786 6179.439535 1697.23411 2950.73422
#> 4: UMass-MechBayes 52.65195 8.978601 16.80095 26.87239
#> bias interval_coverage_50 interval_coverage_90 ae_median
#> <num> <num> <num> <num>
#> 1: 0.00812500 0.6328125 0.9023438 12077.10156
#> 2: 0.21851562 0.4960938 0.9101562 19353.42969
#> 3: -0.04336032 0.4453441 0.8461538 14521.10526
#> 4: -0.02234375 0.4609375 0.8750000 78.47656The sections below show two approaches to addressing this: filtering scores to a common set of targets, and imputing scores for missing targets.
filter_scores() removes scores based on the supplied
strategy. This can be appropriate when a model legitimately does not
cover certain targets and you want to compare only on shared ground.
The default strategy, filter_to_intersection(), keeps
only targets covered by all models.
scores_filtered <- filter_scores(scores)
#> ℹ Filtered out 411 rows.
#> ℹ 476 of 887 rows remaining.
summarise_scores(scores_filtered, by = "model")
#> model wis overprediction underprediction dispersion
#> <char> <num> <num> <num> <num>
#> 1: EuroCOVIDhub-ensemble 41.30642 7.366094 4.313117 29.62721
#> 2: EuroCOVIDhub-baseline 158.92682 66.312386 2.257216 90.35722
#> 3: UMass-MechBayes 49.58008 7.935331 15.134819 26.50993
#> 4: epiforecasts-EpiNow2 66.64282 18.892583 15.893314 31.85692
#> bias interval_coverage_50 interval_coverage_90 ae_median
#> <num> <num> <num> <num>
#> 1: 0.06890756 0.8655462 1.0000000 53.86555
#> 2: 0.32941176 0.6638655 1.0000000 230.96639
#> 3: -0.01495798 0.4621849 0.8991597 74.17647
#> 4: -0.00512605 0.4201681 0.9075630 104.74790This drops all case targets (since UMass-MechBayes has
no case forecasts) and the death targets
epiforecasts-EpiNow2 missed.
The default requires every model to cover a target for it to be kept.
The min_coverage argument relaxes this by keeping targets
covered by at least a given proportion of models. With four models,
min_coverage = 0.75 requires coverage by at least
three.
scores_relaxed <- filter_scores(
scores,
strategy = filter_to_intersection(min_coverage = 0.75)
)
#> ℹ No rows filtered. Returning scores unchanged.
summarise_scores(scores_relaxed, by = "model")
#> model wis overprediction underprediction dispersion
#> <char> <num> <num> <num> <num>
#> 1: EuroCOVIDhub-ensemble 8992.62316 5025.130095 2120.64029 1846.85278
#> 2: EuroCOVIDhub-baseline 14321.48926 7081.000000 5143.53567 2096.95360
#> 3: epiforecasts-EpiNow2 10827.40786 6179.439535 1697.23411 2950.73422
#> 4: UMass-MechBayes 52.65195 8.978601 16.80095 26.87239
#> bias interval_coverage_50 interval_coverage_90 ae_median
#> <num> <num> <num> <num>
#> 1: 0.00812500 0.6328125 0.9023438 12077.10156
#> 2: 0.21851562 0.4960938 0.9101562 19353.42969
#> 3: -0.04336032 0.4453441 0.8461538 14521.10526
#> 4: -0.02234375 0.4609375 0.8750000 78.47656In this example, case targets are covered by three of four models and
death targets by three or four, so min_coverage = 0.75
retains all targets. The default (min_coverage = 1) is
stricter and drops all case targets because UMass-MechBayes
has no case forecasts. Between 0.75 and 1.0 no intermediate threshold
changes the result here because no target is covered by exactly three
out of four models while being missing for one that isn’t
UMass-MechBayes.
filter_to_include() restricts to targets covered by
named models. For example, to evaluate all models only on the targets
epiforecasts-EpiNow2 covered:
scores_epinow2 <- filter_scores(
scores,
strategy = filter_to_include("epiforecasts-EpiNow2")
)
#> ℹ Filtered out 27 rows.
#> ℹ 860 of 887 rows remaining.
summarise_scores(scores_epinow2, by = "model")
#> model wis overprediction underprediction dispersion
#> <char> <num> <num> <num> <num>
#> 1: EuroCOVIDhub-ensemble 9318.72435 5208.081676 2197.86217 1912.78050
#> 2: EuroCOVIDhub-baseline 14837.28683 7336.810069 5330.95195 2169.52482
#> 3: epiforecasts-EpiNow2 10827.40786 6179.439535 1697.23411 2950.73422
#> 4: UMass-MechBayes 49.58008 7.935331 15.13482 26.50993
#> bias interval_coverage_50 interval_coverage_90 ae_median
#> <num> <num> <num> <num>
#> 1: 0.003967611 0.6194332 0.8987854 12515.57490
#> 2: 0.209473684 0.4898785 0.9068826 20049.01215
#> 3: -0.043360324 0.4453441 0.8461538 14521.10526
#> 4: -0.014957983 0.4621849 0.8991597 74.17647This keeps both case and death targets where EpiNow2 submitted
forecasts, but UMass-MechBayes will still be missing case
scores in the result.
Instead of dropping data, impute_missing_scores() fills
in scores for target combinations a model did not cover. Imputed rows
are marked with .imputed = TRUE so they can be identified
later.
The simplest option: fill missing scores with NA. This
preserves the structure of the data without making assumptions about
what the score would have been. NA values propagate through
summaries, so this can also serve as a diagnostic check to confirm where
missingness exists.
scores_na <- impute_missing_scores(
scores,
strategy = impute_na_score()
)
#> ℹ Imputing 137 missing score rows.
#> ℹ 2 model values affected.
summarise_scores(scores_na, by = "model")
#> model wis overprediction underprediction dispersion
#> <char> <num> <num> <num> <num>
#> 1: EuroCOVIDhub-ensemble 8992.623 5025.13 2120.640 1846.853
#> 2: EuroCOVIDhub-baseline 14321.489 7081.00 5143.536 2096.954
#> 3: epiforecasts-EpiNow2 NA NA NA NA
#> 4: UMass-MechBayes NA NA NA NA
#> bias interval_coverage_50 interval_coverage_90 ae_median
#> <num> <num> <num> <num>
#> 1: 0.0081250 0.6328125 0.9023438 12077.10
#> 2: 0.2185156 0.4960938 0.9101562 19353.43
#> 3: NA NA NA NA
#> 4: NA NA NA NAFill each missing score with the worst (maximum) observed score for that target across all models. This penalises models most heavily for missing targets.
scores_worst <- impute_missing_scores(
scores,
strategy = impute_worst_score()
)
#> ℹ Imputing 137 missing score rows.
#> ℹ 2 model values affected.
summarise_scores(scores_worst, by = "model")
#> model wis overprediction underprediction dispersion
#> <char> <num> <num> <num> <num>
#> 1: EuroCOVIDhub-ensemble 8992.623 5025.130 2120.640 1846.853
#> 2: EuroCOVIDhub-baseline 14321.489 7081.000 5143.536 2096.954
#> 3: epiforecasts-EpiNow2 10452.972 5964.931 1638.933 2850.699
#> 4: UMass-MechBayes 15389.614 7632.698 6020.870 3725.102
#> bias interval_coverage_50 interval_coverage_90 ae_median
#> <num> <num> <num> <num>
#> 1: 0.00812500 0.6328125 0.9023438 12077.10
#> 2: 0.21851562 0.4960938 0.9101562 19353.43
#> 3: -0.02054687 0.4648438 0.8515625 14020.69
#> 4: 0.13046875 0.6015625 0.9101562 21026.91We can check that the imputed rows match the models and targets we identified as missing earlier.
Fill with the scores of a named baseline model, treating a missing forecast as performing no better than that baseline.
scores_ref <- impute_missing_scores(
scores,
strategy = impute_model_score("EuroCOVIDhub-baseline")
)
#> ℹ Imputing 137 missing score rows.
#> ℹ 2 model values affected.
summarise_scores(scores_ref, by = "model")
#> model wis overprediction underprediction dispersion
#> <char> <num> <num> <num> <num>
#> 1: EuroCOVIDhub-ensemble 8992.623 5025.130 2120.640 1846.853
#> 2: EuroCOVIDhub-baseline 14321.489 7081.000 5143.536 2096.954
#> 3: epiforecasts-EpiNow2 10452.583 5964.318 1637.566 2850.699
#> 4: UMass-MechBayes 14268.113 7052.540 5150.887 2064.687
#> bias interval_coverage_50 interval_coverage_90 ae_median
#> <num> <num> <num> <num>
#> 1: 0.00812500 0.6328125 0.9023438 12077.10
#> 2: 0.21851562 0.4960938 0.9101562 19353.43
#> 3: -0.02542969 0.4531250 0.8515625 14019.86
#> 4: 0.03781250 0.3945312 0.8476562 19276.04This is a reasonable default when a suitable baseline exists, though more research is needed on best practice for choosing the reference model and understanding the impact of this choice.
Fill with the mean score across models that did forecast each target. This is the least severe penalty as it assigns the average performance, which may be close to the ensemble performance.
scores_mean <- impute_missing_scores(
scores,
strategy = impute_mean_score()
)
#> ℹ Imputing 137 missing score rows.
#> ℹ 2 model values affected.
summarise_scores(scores_mean, by = "model")
#> model wis overprediction underprediction dispersion
#> <char> <num> <num> <num> <num>
#> 1: EuroCOVIDhub-ensemble 8992.623 5025.130 2120.640 1846.853
#> 2: EuroCOVIDhub-baseline 14321.489 7081.000 5143.536 2096.954
#> 3: epiforecasts-EpiNow2 10450.295 5963.217 1638.036 2849.042
#> 4: UMass-MechBayes 11236.152 6012.164 2972.151 2251.837
#> bias interval_coverage_50 interval_coverage_90 ae_median
#> <num> <num> <num> <num>
#> 1: 0.00812500 0.6328125 0.9023438 12077.10
#> 2: 0.21851562 0.4960938 0.9101562 19353.43
#> 3: -0.03634115 0.4544271 0.8463542 14015.78
#> 4: -0.01739583 0.4283854 0.8398438 15122.32Consider combining filtering and imputation when you want to focus on
a specific model’s targets but still need complete scores for all
models. For example, to evaluate on the targets
epiforecasts-EpiNow2 covered and then impute scores for
models that are missing forecasts within that set:
result <- scores |>
filter_scores(
strategy = filter_to_include("epiforecasts-EpiNow2")
) |>
impute_missing_scores(
strategy = impute_worst_score()
)
#> ℹ Filtered out 27 rows.
#> ℹ 860 of 887 rows remaining.
#> ℹ Imputing 128 missing score rows.
#> ℹ 1 model value affected.
summarise_scores(result, by = "model")
#> model wis overprediction underprediction dispersion
#> <char> <num> <num> <num> <num>
#> 1: EuroCOVIDhub-ensemble 9318.724 5208.082 2197.862 1912.781
#> 2: EuroCOVIDhub-baseline 14837.287 7336.810 5330.952 2169.525
#> 3: epiforecasts-EpiNow2 10827.408 6179.440 1697.234 2950.734
#> 4: UMass-MechBayes 15946.971 7909.983 6238.839 3859.681
#> bias interval_coverage_50 interval_coverage_90 ae_median
#> <num> <num> <num> <num>
#> 1: 0.003967611 0.6194332 0.8987854 12515.57
#> 2: 0.209473684 0.4898785 0.9068826 20049.01
#> 3: -0.043360324 0.4453441 0.8461538 14521.11
#> 4: 0.139595142 0.6072874 0.9230769 21788.14