Outbreak analytics pipelines often start with case line
lists, which are data tables in which every line is a different
case/patient, and columns record different variables of potential
epidemiological interest such as date of events (e.g. onset of symptom,
case notification), disease outcome, or patient data (e.g. age, sex,
occupation). Such data is typically held in a data.frame
(or a tibble
) and used in various downstream analysis.
While this approach is functional, it often means that each analysis
step will:
need to check the required inputs are present in the data, and for the user to specify where (e.g. ‘This is the column where dates of onset are stored.’)
need to validate the required data (e.g. ‘Check that the
field storing dates of onset are indeed dates, and not a
character
.’)
The aim of linelist is to take care of these pre-requisites once and for all before downstream analyses, thus helping to make data pipelines more robust and straightforward.
linelist is an R package which implements basic data representation for case line lists, alongside accessors and basic methods. It essentially provides three types of functionalities:
tagging: a tags system permits to pre-identify key epidemiological variables needed in downstream analyses (e.g. dates of case notification, symptom onset, age, gender, disease outcome)
validation: functions checking that tagged
variables are indeed present in the data.frame/tibble
, and
that they have the expected type (e.g. checking that dates are
Date
, integer
or
numeric
)
secured methods: generic functions which could
lead to the loss of tagged variables have dedicated methods for
linelist objects with adapted behaviours, either updating tags
as needed (e.g. rename()
,
names() <- ...
) or issuing warnings/errors when
tagged variables are lost (e.g. select()
,
x[]
, x[[]]
)
linelist is designed to add a robust, foundational layer to your data pipelines, but it might add unnecessary complexity to your analysis scripts. Here are a few hints to gauge if you should consider using the package.
You may have use for linelist if …:
your data changes/updates over time (e.g. new entries, new variables, renamed variables)
you build data pipelines entailing multiple layers of data processing and analysis
you are looking to build re-useable analysis scripts, i.e. which will work on other datasets with minimal added changes
Conversely, you probably do not need it if …:
you work on historical data, which has likely already been curated/validated and will no longer change
you perform some quick, simple analysis of your data, which you will not need to expand on later
your analysis scripts are very specific and will not be re-used elsewhere
Our stable versions are released periodically on CRAN, and can be installed using:
If you prefer using the latest features and bug fixes, you can alternatively install the development version of linelist from GitHub using the following commands:
if (!require(remotes)) {
install.packages("remotes")
}
remotes::install_github("epiverse-trace/linelist", build_vignettes = TRUE)
Once installed, you can load the package in your R session using:
A linelist
object is an instance of a
data.frame
or a tibble
in which key
epidemiological variables have been tagged. The main features
of the packages are broken down into the 3 categories outlined
above.
Tags are paired keys pointing a reference epidemiological variables
to the name of a column in a data.frame
or
tibble
. The tagging system permits to construct
linelist
objects, modify tags in existing objects, check
and access existing tags and the corresponding variables.
make_linelist()
: to create a linelist
object by tagging key epi variables in a data.frame
or a
tibble
set_tags()
: to add, remove, or modify tags in a
linelist
tags()
: to list variables which have been tagged in
a linelist
tags_names()
: to list all recognized tag names;
details on what the tags represent can be found at ?make_linelist
tags_df()
: to obtain a data.frame
of
all the tagged variables in a linelist
Basic routines are provided to validate linelist objects. More advanced validation e.g. looking at compatibility of dated events will be implemented in a separate package.
validate_tags()
: check that tagged variables are
present in the dataset, that tags match the pre-defined list of tagged
variables
validate_types()
: check that tagged variables have
an acceptable class, as defined in tags_types()
validate_linelist()
: general validation of
linelist objects, equivalent to running both
validate_tags()
and validate_types()
, and
checking the class of the object
These are dedicated S3 methods for existing generics which can be used to prevent the loss of tagged variables.
lost_tags_action()
: to set the behaviour to adopt
when tagged variables would be lost by an operation: issue a warning
(default), an error, or ignore
get_lost_tags_action()
: to check the current
behaviour for lost tagged variables
names<-()
: the ‘base R’ approach to renaming
columns of a linelist
; will rename tags as needed to match
the new column names
x[]
and x[[]]
: for subsetting columns
using ‘base R’ syntax; will behave according to
get_lost_tags_actions()
if tagged variables are
lost
In this example, we use the case line list of the Hagelloch 1861
measles outbreak, distributed by the outbreaks package as
measles_hagelloch_1861
.
data(measles_hagelloch_1861, package = "outbreaks")
# overview of the data
head(measles_hagelloch_1861)
#> case_ID infector date_of_prodrome date_of_rash date_of_death age gender
#> 1 1 45 1861-11-21 1861-11-25 <NA> 7 f
#> 2 2 45 1861-11-23 1861-11-27 <NA> 6 f
#> 3 3 172 1861-11-28 1861-12-02 <NA> 4 f
#> 4 4 180 1861-11-27 1861-11-28 <NA> 13 m
#> 5 5 45 1861-11-22 1861-11-27 <NA> 8 f
#> 6 6 180 1861-11-26 1861-11-29 <NA> 12 m
#> family_ID class complications x_loc y_loc
#> 1 41 1 yes 142.5 100.0
#> 2 41 1 yes 142.5 100.0
#> 3 41 0 yes 142.5 100.0
#> 4 61 2 yes 165.0 102.5
#> 5 42 1 yes 145.0 120.0
#> 6 42 2 yes 145.0 120.0
Let us assume we want to tag the following variables to facilitate
downstream analyses, after having checked their tag name in
?make_linelist
:
prodrome
(tag:
date_onset
)date_death
)age
)gender
)We first load a few useful packages, and create a
linelist
with the above information:
library(tibble) # data.frame but with nice printing
library(dplyr) # for data handling
library(magrittr) # for the %>% operator
library(linelist) # this package!
x <- measles_hagelloch_1861 %>%
tibble() %>%
make_linelist(date_onset = "date_of_prodrome",
date_death = "date_of_death",
age = "age",
gender = "gender")
head(x)
#>
#> // linelist object
#> # A tibble: 6 × 12
#> case_ID infector date_of_prodrome date_of_rash date_of_death age gender
#> <int> <int> <date> <date> <date> <dbl> <fct>
#> 1 1 45 1861-11-21 1861-11-25 NA 7 f
#> 2 2 45 1861-11-23 1861-11-27 NA 6 f
#> 3 3 172 1861-11-28 1861-12-02 NA 4 f
#> 4 4 180 1861-11-27 1861-11-28 NA 13 m
#> 5 5 45 1861-11-22 1861-11-27 NA 8 f
#> 6 6 180 1861-11-26 1861-11-29 NA 12 m
#> # ℹ 5 more variables: family_ID <int>, class <fct>, complications <fct>,
#> # x_loc <dbl>, y_loc <dbl>
#>
#> // tags: date_onset:date_of_prodrome, date_death:date_of_death, gender:gender, age:age
The printing of the object confirms that the tags have been added. If we want to double-check which variables have been tagged:
Now that key variables have been tagged in x
, we can
used these pre-defined fields in downstream analyses, without having to
worry about variable names and types. We could access tagged variables
using any of the following means:
# select tagged variables only
x %>%
select(has_tag(c("date_onset", "date_death")))
#> Warning: The following tags have lost their variable:
#> gender:gender, age:age
#>
#> // linelist object
#> # A tibble: 188 × 2
#> date_of_prodrome date_of_death
#> <date> <date>
#> 1 1861-11-21 NA
#> 2 1861-11-23 NA
#> 3 1861-11-28 NA
#> 4 1861-11-27 NA
#> 5 1861-11-22 NA
#> 6 1861-11-26 NA
#> 7 1861-11-24 NA
#> 8 1861-11-21 NA
#> 9 1861-11-26 NA
#> 10 1861-11-21 NA
#> # ℹ 178 more rows
#>
#> // tags: date_onset:date_of_prodrome, date_death:date_of_death
# select tagged variables only with renaming on the fly
x %>%
select(onset = has_tag("date_onset"))
#> Warning: The following tags have lost their variable:
#> date_death:date_of_death, gender:gender, age:age
#>
#> // linelist object
#> # A tibble: 188 × 1
#> onset
#> <date>
#> 1 1861-11-21
#> 2 1861-11-23
#> 3 1861-11-28
#> 4 1861-11-27
#> 5 1861-11-22
#> 6 1861-11-26
#> 7 1861-11-24
#> 8 1861-11-21
#> 9 1861-11-26
#> 10 1861-11-21
#> # ℹ 178 more rows
#>
#> // tags: date_onset:onset
# get all tagged variables in a data.frame
x %>%
tags_df()
#> # A tibble: 188 × 4
#> date_onset date_death gender age
#> <date> <date> <fct> <dbl>
#> 1 1861-11-21 NA f 7
#> 2 1861-11-23 NA f 6
#> 3 1861-11-28 NA f 4
#> 4 1861-11-27 NA m 13
#> 5 1861-11-22 NA f 8
#> 6 1861-11-26 NA m 12
#> 7 1861-11-24 NA m 6
#> 8 1861-11-21 NA m 10
#> 9 1861-11-26 NA m 13
#> 10 1861-11-21 NA f 7
#> # ℹ 178 more rows
Because x
remains a valid tibble
, we can
use any data handling operations implemented in dplyr
.
However, some of these operations may cause accidental removal of key
tagged variables. linelist provides a safeguard mechanism
against this. For instance, let’s assume we want to select only some
columns of x
:
x %>%
select(1:2)
#> Warning: The following tags have lost their variable:
#> date_onset:date_of_prodrome, date_death:date_of_death, gender:gender, age:age
#>
#> // linelist object
#> # A tibble: 188 × 2
#> case_ID infector
#> <int> <int>
#> 1 1 45
#> 2 2 45
#> 3 3 172
#> 4 4 180
#> 5 5 45
#> 6 6 180
#> 7 7 42
#> 8 8 45
#> 9 9 182
#> 10 10 45
#> # ℹ 178 more rows
#>
#> // tags: [no tagged variable]
Here, the above command gave a meaningful warning, in which
select()
removes some of the variables that were
tagged.
We can also use the has_tag()
select helper to select
columns via their tag. For example, to retain the first 2 variables, and
the gender
tag:
# hybrid selection
x %>%
select(1:2, has_tag("gender"))
#> Warning: The following tags have lost their variable:
#> date_onset:date_of_prodrome, date_death:date_of_death, age:age
#>
#> // linelist object
#> # A tibble: 188 × 3
#> case_ID infector gender
#> <int> <int> <fct>
#> 1 1 45 f
#> 2 2 45 f
#> 3 3 172 f
#> 4 4 180 m
#> 5 5 45 f
#> 6 6 180 m
#> 7 7 42 m
#> 8 8 45 m
#> 9 9 182 m
#> 10 10 45 f
#> # ℹ 178 more rows
#>
#> // tags: gender:gender
Again, we observe a warning as before due to the loss of tagged variables in the operation. This behaviour can be silenced if needed, or could be changed to issue an error (for stronger pipelines for instance):
# hybrid selection
x %>%
select(1:2, has_tag("gender"))
#> Warning: The following tags have lost their variable:
#> date_onset:date_of_prodrome, date_death:date_of_death, age:age
#>
#> // linelist object
#> # A tibble: 188 × 3
#> case_ID infector gender
#> <int> <int> <fct>
#> 1 1 45 f
#> 2 2 45 f
#> 3 3 172 f
#> 4 4 180 m
#> 5 5 45 f
#> 6 6 180 m
#> 7 7 42 m
#> 8 8 45 m
#> 9 9 182 m
#> 10 10 45 f
#> # ℹ 178 more rows
#>
#> // tags: gender:gender
# hybrid selection - no warning
lost_tags_action("none")
#> Lost tags will now be ignored.
x %>%
select(1:2, has_tag("gender"))
#>
#> // linelist object
#> # A tibble: 188 × 3
#> case_ID infector gender
#> <int> <int> <fct>
#> 1 1 45 f
#> 2 2 45 f
#> 3 3 172 f
#> 4 4 180 m
#> 5 5 45 f
#> 6 6 180 m
#> 7 7 42 m
#> 8 8 45 m
#> 9 9 182 m
#> 10 10 45 f
#> # ℹ 178 more rows
#>
#> // tags: gender:gender
# hybrid selection - error due to lost tags
lost_tags_action("error")
#> Lost tags will now issue an error.
x %>%
select(1:2, has_tag("gender"))
#> Error: The following tags have lost their variable:
#> date_onset:date_of_prodrome, date_death:date_of_death, age:age
# note that `lost_tags_action` sets the behavior for any later operation, so we
# need to reset the default
get_lost_tags_action() # check current behaviour
#> [1] "error"
lost_tags_action() # reset default
#> Lost tags will now issue a warning.
If you wish to change the lost_tags_action
in a way that
persists across R sessions, you can do so by setting the
LINELIST_LOST_ACTION
environment variable. For example,
your .Renviron
file could contain the following line:
LINELIST_LOST_ACTION="error"