| Title: | Parse and Deduplicate Author Names |
|---|---|
| Description: | Utilities to parse authors fields from DESCRIPTION files and general purpose functions to deduplicate names in database, beyond the specific case of R package authors. |
| Authors: | Hugo Gruson [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-4094-1476>), Chris Hartgerink [rev] (ORCID: <https://orcid.org/0000-0003-1050-6809>), data.org [fnd] (until version 0.2.0 included) |
| Maintainer: | Hugo Gruson <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.0.9000 |
| Built: | 2026-05-31 10:29:28 UTC |
| Source: | https://github.com/Bisaloo/authoritative |
A data.frame of historical metadata from CRAN packages epidemiology.
cran_epidemiology_packagescran_epidemiology_packages
A data.frame with 5 variables:
package name
package version
authors as listed in the Authors@R field from the
DESCRIPTION file
authors as listed in the Author field from the
DESCRIPTION file
package maintainer
Expand names from abbreviated forms or initials
expand_names(short, expanded)expand_names(short, expanded)
short |
A character vector of potentially abbreviated names |
expanded |
A character vector of potentially expanded names |
When you have a list xof abbreviated and non-abbreviated names and you want
to deduplicate them, this function can be used as expand_names(x, x), which
will return the most expanded version available in x for each name
A character vector with the same length as short
expand_names( c("W A Mozart", "Wolfgang Mozart", "Wolfgang A Mozart"), "Wolfgang Amadeus Mozart" ) # Real-case application example # Deduplicate names in list, as described in "details" epi_pkg_authors <- cran_epidemiology_packages |> subset(!is.na(`Authors@R`), `Authors@R`, drop = TRUE) |> parse_authors_r() |> # Drop email, role, ORCID and format as string rather than person object lapply(function(x) format(x, include = c("given", "family"))) |> unlist() # With all duplicates length(unique(epi_pkg_authors)) # Deduplicate epi_pkg_authors_normalized <- expand_names(epi_pkg_authors, epi_pkg_authors) length(unique(epi_pkg_authors_normalized))expand_names( c("W A Mozart", "Wolfgang Mozart", "Wolfgang A Mozart"), "Wolfgang Amadeus Mozart" ) # Real-case application example # Deduplicate names in list, as described in "details" epi_pkg_authors <- cran_epidemiology_packages |> subset(!is.na(`Authors@R`), `Authors@R`, drop = TRUE) |> parse_authors_r() |> # Drop email, role, ORCID and format as string rather than person object lapply(function(x) format(x, include = c("given", "family"))) |> unlist() # With all duplicates length(unique(epi_pkg_authors)) # Deduplicate epi_pkg_authors_normalized <- expand_names(epi_pkg_authors, epi_pkg_authors) length(unique(epi_pkg_authors_normalized))
Invert 'LastName FirstName' to 'FirstName LastName' (or the reverse)
invert_names(names, correct_names)invert_names(names, correct_names)
names |
A character vector of potentially inverted names |
correct_names |
A character vector of correct names |
When you have a list x of mixed 'First Last' and 'Last First' names, but no
source of truth and you want to deduplicate them, this function can be used
as expand_names(x, x), which will return the most common version available
in x for each name.
A character vector with the same length as names
invert_names( c("Wolfgang Mozart", "Mozart Wolfgang"), "Wolfgang Mozart" ) # Real-case application example # Deduplicate names in list, as described in "details" epi_pkg_authors <- cran_epidemiology_packages |> subset(!is.na(`Authors@R`), `Authors@R`, drop = TRUE) |> parse_authors_r() |> # Drop email, role, ORCID and format as string rather than person object lapply(function(x) format(x, include = c("given", "family"))) |> unlist() # With all duplicates length(unique(epi_pkg_authors)) # Deduplicate epi_pkg_authors_normalized <- invert_names(epi_pkg_authors, epi_pkg_authors) length(unique(epi_pkg_authors_normalized))invert_names( c("Wolfgang Mozart", "Mozart Wolfgang"), "Wolfgang Mozart" ) # Real-case application example # Deduplicate names in list, as described in "details" epi_pkg_authors <- cran_epidemiology_packages |> subset(!is.na(`Authors@R`), `Authors@R`, drop = TRUE) |> parse_authors_r() |> # Drop email, role, ORCID and format as string rather than person object lapply(function(x) format(x, include = c("given", "family"))) |> unlist() # With all duplicates length(unique(epi_pkg_authors)) # Deduplicate epi_pkg_authors_normalized <- invert_names(epi_pkg_authors, epi_pkg_authors) length(unique(epi_pkg_authors_normalized))
Author field from a DESCRIPTION fileParse the Author field from a DESCRIPTION file into a person object
parse_authors(author_string)parse_authors(author_string)
author_string |
A character containing the |
A character vector, or a list of character vectors of length equals
to the length of author_string
# Read from a DESCRIPTION file directly utils_description <- system.file("DESCRIPTION", package = "utils") utils_authors <- read.dcf(utils_description, "Author") parse_authors(utils_authors) # Read from a database of CRAN metadata cran_epidemiology_packages$Author |> parse_authors() |> unlist() |> unique() |> sort()# Read from a DESCRIPTION file directly utils_description <- system.file("DESCRIPTION", package = "utils") utils_authors <- read.dcf(utils_description, "Author") parse_authors(utils_authors) # Read from a database of CRAN metadata cran_epidemiology_packages$Author |> parse_authors() |> unlist() |> unique() |> sort()
Authors@R field from a DESCRIPTION fileParse the Authors@R field from a DESCRIPTION file into a person object
parse_authors_r(authors_r_string)parse_authors_r(authors_r_string)
authors_r_string |
A character containing the |
A person object, or a list of person objects of length equals
to the length of authors_r_string
# Read from a DESCRIPTION file directly pkg_description <- system.file("DESCRIPTION", package = "authoritative") authors_r_pkg <- read.dcf(pkg_description, "Authors@R") parse_authors_r(authors_r_pkg) # Read from a database of CRAN metadata cran_epidemiology_packages |> subset(!is.na(`Authors@R`), `Authors@R`, drop = TRUE) |> parse_authors_r() |> head()# Read from a DESCRIPTION file directly pkg_description <- system.file("DESCRIPTION", package = "authoritative") authors_r_pkg <- read.dcf(pkg_description, "Authors@R") parse_authors_r(authors_r_pkg) # Read from a database of CRAN metadata cran_epidemiology_packages |> subset(!is.na(`Authors@R`), `Authors@R`, drop = TRUE) |> parse_authors_r() |> head()