Which variable to summarize in `tbl_summary()`

tbl_summary

Quick way to remember where to put what variable:

by: stratification, as column. Typically treatment arm.
include: variables to summarize, as rows.
- continuous: value in one row, missing in another row.
- categorical: one row per category, missing in another row.

To stratify by more than one variable, you can make use of the wide format data.

Load data and packages

We are going to use the trial data from gtsummary.

Code

library(gtsummary)
library(dplyr)
library(tidyr)
head(trial)

# A tibble: 6 × 8
  trt      age marker stage grade response death ttdeath
  <chr>  <dbl>  <dbl> <fct> <fct>    <int> <int>   <dbl>
1 Drug A    23  0.16  T1    II           0     0    24  
2 Drug B     9  1.11  T2    I            1     0    24  
3 Drug A    31  0.277 T1    II           0     0    24  
4 Drug A    NA  2.07  T3    III          1     1    17.6
5 Drug A    51  2.77  T4    III          1     1    16.4
6 Drug B    39  0.613 T4    I            0     1    15.6

Basic `tbl_summary()` table

When including both numeric and categorical variables, a long table will be created.

Code

tbl_summary(data = trial, include = c(age, marker, grade))

Characteristic	N = 200¹
Age	47 (38, 57)
Unknown	11
Marker Level (ng/mL)	0.64 (0.22, 1.41)
Unknown	10
Grade
I	68 (34%)
II	68 (34%)
III	64 (32%)
¹ Median (Q1, Q3); n (%)

We can stratify by treatment arm by using the by argument.

Code

# stratify by treatment arm
tbl_summary(data = trial, include = c(age, marker, grade), by = trt, missing = "no")

Characteristic	Drug A N = 98¹	Drug B N = 102¹
Age	46 (37, 60)	48 (39, 56)
Marker Level (ng/mL)	0.84 (0.23, 1.60)	0.52 (0.18, 1.21)
Grade
I	35 (36%)	33 (32%)
II	32 (33%)	36 (35%)
III	31 (32%)	33 (32%)
¹ Median (Q1, Q3); n (%)

Stratify by 2 variables

In the example above, we stratified by one variable: treatment arm. Other variables are summarized as rows. Notice the indentation: it is still 3 variables: age, marker, grade.

What if you want to stratify between the variables presented in the rows? For example, you want to know what is the age across different levels of stages. you can do this:

Code

# stratify by stage
tbl_summary(data = trial, include = c(age), by = stage, missing = "no")

Characteristic	T1 N = 53¹	T2 N = 54¹	T3 N = 43¹	T4 N = 50¹
Age	45 (36, 57)	48 (42, 55)	50 (39, 60)	46 (37, 56)
¹ Median (Q1, Q3)

The disadvantage of this is that you cannot stratify by treatment arm at the same time. You need to make use of the wide format data.

Use wide format

You need to keep the ID variable to make sure the data is in wide format, to keep each row uniquely identifiable.

Code

trial_mini <- select(trial, trt, age, stage) |> mutate(id = 1:nrow(trial))
# wide format
trial_mini_wide <- trial_mini |>
  pivot_wider(names_from = stage, values_from = age)

# how long format looks like
head(trial_mini)

# A tibble: 6 × 4
  trt      age stage    id
  <chr>  <dbl> <fct> <int>
1 Drug A    23 T1        1
2 Drug B     9 T2        2
3 Drug A    31 T1        3
4 Drug A    NA T3        4
5 Drug A    51 T4        5
6 Drug B    39 T4        6

Code

# stage variable is expanded into 4 variables: T1, T2, T3, T4
head(trial_mini_wide)

# A tibble: 6 × 6
  trt       id    T1    T2    T3    T4
  <chr>  <int> <dbl> <dbl> <dbl> <dbl>
1 Drug A     1    23    NA    NA    NA
2 Drug B     2    NA     9    NA    NA
3 Drug A     3    31    NA    NA    NA
4 Drug A     4    NA    NA    NA    NA
5 Drug A     5    NA    NA    NA    51
6 Drug B     6    NA    NA    NA    39

Now you can summarize all the stage variables together using include(). They will be piled together in rows, all about the target numeric variable, age.

Code

tbl_summary(
  trial_mini_wide,
  include = c('T1', 'T2', 'T3', 'T4'),
  missing = "no"
)

Characteristic	N = 200¹
Age	45 (36, 57)
Age	48 (42, 55)
Age	50 (39, 60)
Age	46 (37, 56)
¹ Median (Q1, Q3)

You should name the variables in the wide format to make sure the labels are informative. Use label argument, and specify a list.

Code

tbl_summary(
  trial_mini_wide,
  by = trt,
  include = c('T1', 'T2', 'T3', 'T4'),
  missing = 'no',
  label = list(
    'T1' = "Stage T1",
    'T2' = "Stage T2",
    'T3' = "Stage T3",
    'T4' = "Stage T4"
  )
)

Characteristic	Drug A N = 98¹	Drug B N = 102¹
Stage T1	43 (31, 53)	47 (43, 57)
Stage T2	48 (41, 63)	49 (42, 53)
Stage T3	48 (38, 61)	53 (40, 59)
Stage T4	46 (36, 60)	45 (37, 54)
¹ Median (Q1, Q3)

What statistics?

You can specify a different statistics than its default.

Code

tbl_summary(
  trial_mini_wide,
  by = trt,
  include = c('T1', 'T2', 'T3', 'T4'),
  missing = 'no',
  label = list(
    'T1' = "Stage T1",
    'T2' = "Stage T2",
    'T3' = "Stage T3",
    'T4' = "Stage T4"
  ),
  statistic = list(all_continuous() ~ "{mean} ({sd})")
)

Characteristic	Drug A N = 98¹	Drug B N = 102¹
Stage T1	44 (15)	50 (14)
Stage T2	50 (13)	46 (12)
Stage T3	49 (14)	50 (15)
Stage T4	45 (17)	44 (15)
¹ Mean (SD)

Load data and packages

Basic tbl_summary() table

Stratify by 2 variables

Use wide format

What statistics?

Basic `tbl_summary()` table