A substitute for plyr’s mapvalues
The mapvalues() function in plyr was immensely useful, but with plyr’s retirement, it’s time to move on and find something better.
The R package plyr was retired in 2023. In any case, it was redundant because dplyr had been created to take it’s place, although some of the functions differed. One of the functions to be lost was mapvalues()
, which I liked and used quite a bit. Thankfully, there is an alternative approach available.
A Reproducible Example
Let’s start with a tiny sample of data from this paper:
df <- structure(list(barcode = c("B2020001c14r10n176", "B2020001c14r08n136",
"B2020001c02r07n129", "B2020001c13r10n177", "B2020001c07r08n142",
"B2020001c08r06n106", "B2020001c12r05n086", "B2020001c03r07n116",
"B2020001c19r08n143", "B2020001c13r07n131", "B2020001c08r10n180"
), line = c("14", "14", "2", "13", "7", "8", "12", "3", "19", "13", "8"),
filename = c("B2019004t0g076n125_2019-07-29 17-52-46_B2019004_90-deg_46497501_2410_0.png",
"B2019004t1g110n001_2019-07-31 00-03-59_B2019004_0-deg_46733801_1065_0.png",
"B2019004t1g142n470_2019-08-14 08-40-04_B2019004_90-deg_48091701_2241_0.png",
"B2019004t1g004n153_2019-09-04 02-51-08_B2019004_0-deg_49923101_1078_0.png",
"B2019004t1g182n471_2019-08-25 08-40-59_B2019004_90-deg_48992601_2245_0.png",
"B2019004t1g031n079_2019-08-21 01-29-31_B2019004_0-deg_48532901_2206_0.png",
"B2019004t1g155n627_2019-08-11 11-39-30_B2019004_0-deg_47796701_1772_0.png",
"B2019004t0g179n347_2019-08-18 06-24-41_B2019004_90-deg_48408201_1754_0.png",
"B2019004t1g143n688_2019-09-01 12-44-45_B2019004_90-deg_49741701_1528_0.png",
"B2019004t0g262n880_2019-08-25 16-19-18_B2019004_90-deg_49115301_1004_0.png",
"B2019004t0g190n444_2019-08-14 08-10-22_B2019004_0-deg_48083801_2402_0.png"
), count_unique_branches = c(2L, 1L, 1L, 2L, 4L, 2L, 2L,
4L, 3L, 5L, 3L)), class = "data.frame", row.names = c(NA,
-11L))
Here, I have some data about branch numbers related to individual lentil plants, each of which have barcodes. I need to identify which variety each plant belongs to, based on the line number. Thankfully, I have that data in another data frame:
cultivars <- structure(list(variety = c("Aldinga", "CDC Ruby", "CIPAL0717",
"Cobber", "Commondo", "Cumra", "Digger", "Eston", "ILL2024",
"ILL7537", "Indianhead", "Matilda", "Nipper", "Northfield", "PBA Bolt",
"SP1333", "PBA Hallmark XT", "PBA Jumbo2", "PBA Greenfield"),
line_number = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19"
)), class = "data.frame", row.names = c(NA, -19L))
So what I need to do is pull-out the variety data by matching line_number
from cultivars
with line
from df.
plyr
The plyr method to achieve this was easy. First create a vector (which I’ll call genotype
) with the the mapped values:
library(plyr)
genotype <- plyr::mapvalues(df$line, from=cultivars$line_number, to=cultivars$variety)
Then add it to the original data frame (df
):
df$genotype <- genotype
dplyr: A better way
The better method is to use dplyr’s left_join()
function as follows:
library(tidyverse)
join <- left_join(df, cultivars, join_by(line == line_number))
A left join keeps all observations in x
. So in this case, we’re getting dplyr to:
- Keep all observations from
df
. - Match the data in
df
withcultivars
, specifically by matchingdf$line
withcultivars$line_number
.
Make sure that the data types that are being matched are of the same type: you can’t match an integer with a character string, for instance.
Also note that an inner_join()
would also work here, but could result in the loss of data because unmatched rows in either input are not included in the result.
Comments
No comments have yet been submitted. Be the first!