# A tibble: 6 × 8
trt age marker stage grade response death ttdeath
<chr> <dbl> <dbl> <fct> <fct> <int> <int> <dbl>
1 Drug A 23 0.16 T1 II 0 0 24
2 Drug B 9 1.11 T2 I 1 0 24
3 Drug A 31 0.277 T1 II 0 0 24
4 Drug A NA 2.07 T3 III 1 1 17.6
5 Drug A 51 2.77 T4 III 1 1 16.4
6 Drug B 39 0.613 T4 I 0 1 15.6
Basic tbl_summary() table
When including both numeric and categorical variables, a long table will be created.
Code
tbl_summary(data = trial, include =c(age, marker, grade))
Characteristic
N = 2001
Age
47 (38, 57)
Unknown
11
Marker Level (ng/mL)
0.64 (0.22, 1.41)
Unknown
10
Grade
I
68 (34%)
II
68 (34%)
III
64 (32%)
1 Median (Q1, Q3); n (%)
We can stratify by treatment arm by using the by argument.
Code
# stratify by treatment armtbl_summary(data = trial, include =c(age, marker, grade), by = trt, missing ="no")
Characteristic
Drug A
N = 981
Drug B
N = 1021
Age
46 (37, 60)
48 (39, 56)
Marker Level (ng/mL)
0.84 (0.23, 1.60)
0.52 (0.18, 1.21)
Grade
I
35 (36%)
33 (32%)
II
32 (33%)
36 (35%)
III
31 (32%)
33 (32%)
1 Median (Q1, Q3); n (%)
Stratify by 2 variables
In the example above, we stratified by one variable: treatment arm. Other variables are summarized as rows. Notice the indentation: it is still 3 variables: age, marker, grade.
What if you want to stratify between the variables presented in the rows? For example, you want to know what is the age across different levels of stages. you can do this:
Code
# stratify by stagetbl_summary(data = trial, include =c(age), by = stage, missing ="no")
Characteristic
T1
N = 531
T2
N = 541
T3
N = 431
T4
N = 501
Age
45 (36, 57)
48 (42, 55)
50 (39, 60)
46 (37, 56)
1 Median (Q1, Q3)
The disadvantage of this is that you cannot stratify by treatment arm at the same time. You need to make use of the wide format data.
Use wide format
You need to keep the ID variable to make sure the data is in wide format, to keep each row uniquely identifiable.
Code
trial_mini <-select(trial, trt, age, stage) |>mutate(id =1:nrow(trial))# wide formattrial_mini_wide <- trial_mini |>pivot_wider(names_from = stage, values_from = age)# how long format looks likehead(trial_mini)
# A tibble: 6 × 4
trt age stage id
<chr> <dbl> <fct> <int>
1 Drug A 23 T1 1
2 Drug B 9 T2 2
3 Drug A 31 T1 3
4 Drug A NA T3 4
5 Drug A 51 T4 5
6 Drug B 39 T4 6
Code
# stage variable is expanded into 4 variables: T1, T2, T3, T4head(trial_mini_wide)
# A tibble: 6 × 6
trt id T1 T2 T3 T4
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Drug A 1 23 NA NA NA
2 Drug B 2 NA 9 NA NA
3 Drug A 3 31 NA NA NA
4 Drug A 4 NA NA NA NA
5 Drug A 5 NA NA NA 51
6 Drug B 6 NA NA NA 39
Now you can summarize all the stage variables together using include(). They will be piled together in rows, all about the target numeric variable, age.
---title: "Which variable to summarize in `tbl_summary()`"code-fold: truecategories: [tbl_summary]---Quick way to remember where to put what variable: - `by`: stratification, as column. Typically treatment arm.- `include`: variables to summarize, as rows. - continuous: value in one row, missing in another row. - categorical: one row per category, missing in another row.To stratify by **more than one variable**, you can make use of the **wide format** data.### Load data and packagesWe are going to use the `trial` data from `gtsummary`. ```{r}#| eval: true#| message: false#| warning: falselibrary(gtsummary)library(dplyr)library(tidyr)head(trial)```# Basic `tbl_summary()` tableWhen including both numeric and categorical variables, a long table will be created.```{r}#| eval: truetbl_summary(data = trial, include =c(age, marker, grade))```We can stratify by treatment arm by using the `by` argument. ```{r}#| eval: true# stratify by treatment armtbl_summary(data = trial, include =c(age, marker, grade), by = trt, missing ="no")```# Stratify by 2 variablesIn the example above, we stratified by one variable: treatment arm. Other variables are summarized as rows. Notice the indentation: it is still 3 variables: age, marker, grade.What if you want to stratify between the variables presented in the rows? For example, you want to know what is the **age across different levels of stages**. you can do this:```{r}# stratify by stagetbl_summary(data = trial, include =c(age), by = stage, missing ="no")```The disadvantage of this is that you cannot stratify by **treatment arm** at the same time. You need to make use of the wide format data.### Use wide formatYou need to keep the ID variable to make sure the data is in wide format, to keep each row uniquely identifiable.```{r}trial_mini <-select(trial, trt, age, stage) |>mutate(id =1:nrow(trial))# wide formattrial_mini_wide <- trial_mini |>pivot_wider(names_from = stage, values_from = age)# how long format looks likehead(trial_mini)# stage variable is expanded into 4 variables: T1, T2, T3, T4head(trial_mini_wide)```Now you can summarize all the stage variables together using `include()`. They will be piled together in rows, all about the target numeric variable, **age**. ```{r}tbl_summary( trial_mini_wide,include =c('T1', 'T2', 'T3', 'T4'),missing ="no")```You should name the variables in the wide format to make sure the labels are informative. Use `label` argument, and specify a list.```{r}tbl_summary( trial_mini_wide,by = trt,include =c('T1', 'T2', 'T3', 'T4'),missing ='no',label =list('T1'="Stage T1",'T2'="Stage T2",'T3'="Stage T3",'T4'="Stage T4" ))```### What statistics? You can specify a different statistics than its default. ```{r}tbl_summary( trial_mini_wide,by = trt,include =c('T1', 'T2', 'T3', 'T4'),missing ='no',label =list('T1'="Stage T1",'T2'="Stage T2",'T3'="Stage T3",'T4'="Stage T4" ),statistic =list(all_continuous() ~"{mean} ({sd})"))```