Cumulative totals by group in R
Use Tidyverse’s dplyr package in R to add cumulative totals to a table based on group.
This week I needed to add some cumulative totals to a dataset in R by group. This isn’t too difficult with the dplyr package from Tidyverse. To demonstrate this concept, we’ll use some timely data from the State of Victoria showing daily COVID-19 vaccination rates. Here’s the data from 1 March 2021 to 4 June 2021:
date <- c("1-Mar-21", "2-Mar-21", "3-Mar-21", "4-Mar-21", "5-Mar-21", "6-Mar-21", "7-Mar-21", "8-Mar-21", "9-Mar-21", "10-Mar-21", "11-Mar-21", "12-Mar-21", "13-Mar-21", "14-Mar-21", "15-Mar-21", "16-Mar-21", "17-Mar-21", "18-Mar-21", "19-Mar-21", "20-Mar-21", "21-Mar-21", "22-Mar-21", "23-Mar-21", "24-Mar-21", "25-Mar-21", "26-Mar-21", "27-Mar-21", "28-Mar-21", "29-Mar-21", "30-Mar-21", "31-Mar-21", "1-Apr-21", "2-Apr-21", "3-Apr-21", "4-Apr-21", "5-Apr-21", "6-Apr-21", "7-Apr-21", "8-Apr-21", "9-Apr-21", "10-Apr-21", "11-Apr-21", "12-Apr-21", "13-Apr-21", "14-Apr-21", "15-Apr-21", "16-Apr-21", "17-Apr-21", "18-Apr-21", "19-Apr-21", "20-Apr-21", "21-Apr-21", "22-Apr-21", "23-Apr-21", "24-Apr-21", "25-Apr-21", "26-Apr-21", "27-Apr-21", "28-Apr-21", "29-Apr-21", "30-Apr-21", "1-May-21", "2-May-21", "3-May-21", "4-May-21", "5-May-21", "6-May-21", "7-May-21", "8-May-21", "9-May-21", "10-May-21", "11-May-21", "12-May-21", "13-May-21", "14-May-21", "15-May-21", "16-May-21", "17-May-21", "18-May-21", "19-May-21", "20-May-21", "21-May-21", "22-May-21", "23-May-21", "24-May-21", "25-May-21", "26-May-21", "27-May-21", "28-May-21", "29-May-21", "30-May-21", "31-May-21", "1-Jun-21", "2-Jun-21", "3-Jun-21", "4-Jun-21")
vaccinated <- c(100, 1158, 1656, 1969, 2262, 1944, 727, 844, 1072, 2831, 3992, 3631, 4350, 904, 606, 3474, 3666, 4041, 4742, 4552, 1398, 1342, 4486, 4385, 5698, 6034, 5072, 3512, 2072, 6483, 6300, 8457, 7555, 910, 247, 443, 1023, 6059, 5778, 131513, 13256, 7408, 3528, 13094, 14382, 14581, 14182, 13009, 5283, 18777, 16705, 15074, 16937, 19185, 13395, 6376, 1109, 11854, 17195, 20489, 16805, 12503, 4964, 2281, 19155, 23867, 22537, 24564, 18528, 7956, 3213, 22044, 19769, 23393, 21895, 19954, 7667, 4370, 24217, 25162, 24961, 26553, 21741, 9175, 4964, 11424, 30498, 30426, 41389, 43458, 27448, 20909, 40389, 51176, 50710, 21475)
We need to format the dates so R can read them:
date <- as.Date(date, "%d-%b-%y")
We also need to load the Tidyverse package (which contains dplyr):
library(tidyverse)
First, we make a dataframe combining the dates and the daily vaccination numbers:
df <- data.frame(date, vaccinated)
Cumulative totals are easy using the cumsum()
function combined with dplyr’s mutate()
:
df <- df %>% mutate(cumulative=cumsum(vaccinated))
Here’s a little bit of output generated using head(df)
:
date vaccinated cumulative
1 2021-03-01 100 100
2 2021-03-02 1158 1258
3 2021-03-03 1656 2914
4 2021-03-04 1969 4883
5 2021-03-05 2262 7145
6 2021-03-06 1944 9089
Looks good.
Now what if we wanted to subset the data by groups? For simplicity in this example, we’ll add some categorical information and break our data into groups based on week of the year. (Typical experimental data would already have categorical information). To add the week number to our data frame, we use the following commands:
week <- as.numeric(format(as.Date(date), "%W"))
df <- data.frame(df, week)
Now we have some group data. Let’s check it with print(df)
(I’ll only show a sample of the output here):
date vaccinated cumulative week
1 2021-03-01 100 100 9
2 2021-03-02 1158 1258 9
3 2021-03-03 1656 2914 9
4 2021-03-04 1969 4883 9
5 2021-03-05 2262 7145 9
6 2021-03-06 1944 9089 9
7 2021-03-07 727 9816 9
8 2021-03-08 844 10660 10
9 2021-03-09 1072 11732 10
10 2021-03-10 2831 14563 10
11 2021-03-11 3992 18555 10
12 2021-03-12 3631 22186 10
13 2021-03-13 4350 26536 10
14 2021-03-14 904 27440 10
15 2021-03-15 606 28046 11
16 2021-03-16 3474 31520 11
17 2021-03-17 3666 35186 11
18 2021-03-18 4041 39227 11
19 2021-03-19 4742 43969 11
20 2021-03-20 4552 48521 11
21 2021-03-21 1398 49919 11
22 2021-03-22 1342 51261 12
23 2021-03-23 4486 55747 12
24 2021-03-24 4385 60132 12
25 2021-03-25 5698 65830 12
26 2021-03-26 6034 71864 12
27 2021-03-27 5072 76936 12
28 2021-03-28 3512 80448 12
...
Now we need to summarise the data (to get a weekly total number of vaccinations) then mutate the output to calculate a cumulative total for the weekly statistics. Let’s do it in two steps:
summary <- df %>% group_by(week) %>% summarise(total_vaccinated = sum(vaccinated), .groups = 'drop')
This summarises the totals. If we head(summary)
, we get the following output:
# A tibble: 6 x 2
week total_vaccinated
<dbl> <dbl>
1 9 9816
2 10 17624
3 11 22479
4 12 30529
5 13 32024
6 14 165480
Now we can add the mutate()
function to get a weekly cumulative total:
summary <- df %>% group_by(week) %>% summarise(total_vaccinated = sum(vaccinated), .groups = 'drop') %>% mutate(cumulative_vaccinated = cumsum(total_vaccinated))
This yields the following output:
# A tibble: 14 x 3
week total_vaccinated cumulative_vaccinated
<dbl> <dbl> <dbl>
1 9 9816 9816
2 10 17624 27440
3 11 22479 49919
4 12 30529 80448
5 13 32024 112472
6 14 165480 277952
7 15 78059 356011
8 16 106449 462460
9 17 84919 547379
10 18 118888 666267
11 19 117935 784202
12 20 136179 920381
13 21 189607 1109988
14 22 184659 1294647
And there we have it: summarised data with cumulative totals by group (in this case week of vaccination). Don’t forget that this can be converted into a data frame via data.frame(summary)
.
Comments
No comments have yet been submitted. Be the first!