Which variable to summarize in tbl_summary()

tbl_summary

Quick way to remember where to put what variable:

To stratify by more than one variable, you can make use of the wide format data.

Load data and packages

We are going to use the trial data from gtsummary.

Code
library(gtsummary)
library(dplyr)
library(tidyr)
head(trial)
# A tibble: 6 × 8
  trt      age marker stage grade response death ttdeath
  <chr>  <dbl>  <dbl> <fct> <fct>    <int> <int>   <dbl>
1 Drug A    23  0.16  T1    II           0     0    24  
2 Drug B     9  1.11  T2    I            1     0    24  
3 Drug A    31  0.277 T1    II           0     0    24  
4 Drug A    NA  2.07  T3    III          1     1    17.6
5 Drug A    51  2.77  T4    III          1     1    16.4
6 Drug B    39  0.613 T4    I            0     1    15.6

Basic tbl_summary() table

When including both numeric and categorical variables, a long table will be created.

Code
tbl_summary(data = trial, include = c(age, marker, grade))
Characteristic N = 2001
Age 47 (38, 57)
    Unknown 11
Marker Level (ng/mL) 0.64 (0.22, 1.41)
    Unknown 10
Grade
    I 68 (34%)
    II 68 (34%)
    III 64 (32%)
1 Median (Q1, Q3); n (%)

We can stratify by treatment arm by using the by argument.

Code
# stratify by treatment arm
tbl_summary(data = trial, include = c(age, marker, grade), by = trt, missing = "no")
Characteristic Drug A
N = 981
Drug B
N = 1021
Age 46 (37, 60) 48 (39, 56)
Marker Level (ng/mL) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21)
Grade

    I 35 (36%) 33 (32%)
    II 32 (33%) 36 (35%)
    III 31 (32%) 33 (32%)
1 Median (Q1, Q3); n (%)

Stratify by 2 variables

In the example above, we stratified by one variable: treatment arm. Other variables are summarized as rows. Notice the indentation: it is still 3 variables: age, marker, grade.

What if you want to stratify between the variables presented in the rows? For example, you want to know what is the age across different levels of stages. you can do this:

Code
# stratify by stage
tbl_summary(data = trial, include = c(age), by = stage, missing = "no")
Characteristic T1
N = 531
T2
N = 541
T3
N = 431
T4
N = 501
Age 45 (36, 57) 48 (42, 55) 50 (39, 60) 46 (37, 56)
1 Median (Q1, Q3)

The disadvantage of this is that you cannot stratify by treatment arm at the same time. You need to make use of the wide format data.

Use wide format

You need to keep the ID variable to make sure the data is in wide format, to keep each row uniquely identifiable.

Code
trial_mini <- select(trial, trt, age, stage) |> mutate(id = 1:nrow(trial))
# wide format
trial_mini_wide <- trial_mini |>
  pivot_wider(names_from = stage, values_from = age)

# how long format looks like
head(trial_mini)
# A tibble: 6 × 4
  trt      age stage    id
  <chr>  <dbl> <fct> <int>
1 Drug A    23 T1        1
2 Drug B     9 T2        2
3 Drug A    31 T1        3
4 Drug A    NA T3        4
5 Drug A    51 T4        5
6 Drug B    39 T4        6
Code
# stage variable is expanded into 4 variables: T1, T2, T3, T4
head(trial_mini_wide)
# A tibble: 6 × 6
  trt       id    T1    T2    T3    T4
  <chr>  <int> <dbl> <dbl> <dbl> <dbl>
1 Drug A     1    23    NA    NA    NA
2 Drug B     2    NA     9    NA    NA
3 Drug A     3    31    NA    NA    NA
4 Drug A     4    NA    NA    NA    NA
5 Drug A     5    NA    NA    NA    51
6 Drug B     6    NA    NA    NA    39

Now you can summarize all the stage variables together using include(). They will be piled together in rows, all about the target numeric variable, age.

Code
tbl_summary(
  trial_mini_wide,
  include = c('T1', 'T2', 'T3', 'T4'),
  missing = "no"
)
Characteristic N = 2001
Age 45 (36, 57)
Age 48 (42, 55)
Age 50 (39, 60)
Age 46 (37, 56)
1 Median (Q1, Q3)

You should name the variables in the wide format to make sure the labels are informative. Use label argument, and specify a list.

Code
tbl_summary(
  trial_mini_wide,
  by = trt,
  include = c('T1', 'T2', 'T3', 'T4'),
  missing = 'no',
  label = list(
    'T1' = "Stage T1",
    'T2' = "Stage T2",
    'T3' = "Stage T3",
    'T4' = "Stage T4"
  )
)
Characteristic Drug A
N = 981
Drug B
N = 1021
Stage T1 43 (31, 53) 47 (43, 57)
Stage T2 48 (41, 63) 49 (42, 53)
Stage T3 48 (38, 61) 53 (40, 59)
Stage T4 46 (36, 60) 45 (37, 54)
1 Median (Q1, Q3)

What statistics?

You can specify a different statistics than its default.

Code
tbl_summary(
  trial_mini_wide,
  by = trt,
  include = c('T1', 'T2', 'T3', 'T4'),
  missing = 'no',
  label = list(
    'T1' = "Stage T1",
    'T2' = "Stage T2",
    'T3' = "Stage T3",
    'T4' = "Stage T4"
  ),
  statistic = list(all_continuous() ~ "{mean} ({sd})")
)
Characteristic Drug A
N = 981
Drug B
N = 1021
Stage T1 44 (15) 50 (14)
Stage T2 50 (13) 46 (12)
Stage T3 49 (14) 50 (15)
Stage T4 45 (17) 44 (15)
1 Mean (SD)