Nice layout for comparison of categorical outcomes

Steve Simon

2024-06-27

While it is easy to get R to produce descriptive statistics for categorical outcomes in a research study, it is a bit harder to get R to produce a layout that is publication-ready. Hard, but not impossible. Here is a simple example.

Here is a brief description of this dataset, taken from the data dictionary on my github site.

The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to America in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of “women and children first” that held during the boarding of life boats.

Here are the first few rows of data.

##                                            Name   Age PClass    Sex
## 1                  Allen, Miss Elisabeth Walton 29.00    1st female
## 2                   Allison, Miss Helen Loraine  2.00    1st female
## 3           Allison, Mr Hudson Joshua Creighton 30.00    1st   male
## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 25.00    1st female
## 5                 Allison, Master Hudson Trevor  0.92    1st   male
## 6                            Anderson, Mr Harry 47.00    1st   male
##   Survived
## 1        1
## 2        0
## 3        0
## 4        0
## 5        1
## 6        1

I have hidden the R code up to this point, as it is mundane and not of great interest. I will show the R code and output for the rest of the analysis.

Using the count command in R shows deaths and survivors among men and women.

ti0 %>%
  group_by(Sex, Survived) %>%
  summarize(n1=n(), .groups="drop") -> tab01
## Error in summarize(., n1 = n(), .groups = "drop"): argument "by" is missing, with no default
tab01
## Error in eval(expr, envir, enclos): object 'tab01' not found

A quick note about the code. First, it would be simpler and better to use the single function count instead of the two functions group_by and summarize. I prefer the two functions because you can generalize a bit more easily to other cases like using a continuous outcome.

Second, the .groups argument is new. When you use a group_by statement with two or more variables, the result is a grouped tibble. The summarize function will produce either a grouped tibble with everything except the last variable or a tibble that is not grouped. Most of the time (in my experience), you want the latter, but the default is the former. To produce a straight tibble rather than a grouped tibble, use the .groups = "drop" argument in summarize.

If you examine the results, there are a lot more male deaths than female deaths, but be careful because there were also more men than women on the boat. Calculating the proportions of deaths and survivors among men and women is fairly easy.

tab01 %>%
  group_by(Sex) %>%
  mutate(n2=sum(n1)) %>%
  mutate(pct=round(100*n1/n2)) -> tab02
## Error in group_by(., Sex): object 'tab01' not found
tab02
## Error in eval(expr, envir, enclos): object 'tab02' not found

There will be times that the sprintf function will layout things a bit better than round. If you have numbers larger than 1000, you might consider adding a comma (e.g., 1,000) using the format(., big.mark=",") to make the large numbers a bit easier to read.

You can combine the numerator, denominator, and percentage into one nice package using the glue function.

tab02 %>%
  mutate(out=glue("{n1}/{n2} ({pct}%)")) %>%
  select(Sex, Survived, out) -> tab03
## Error in select(., Sex, Survived, out): unused arguments (Sex, Survived, out)
tab03
## Error in eval(expr, envir, enclos): object 'tab03' not found

Again, the sprintf function might work better than glue.

Next, arrange things so that the data on survivors is in one column and the data on deaths is in a separate column. This is done with the pivot_wider function.

tab03 %>%
  pivot_wider(
    names_from=Survived, 
    values_from=out) -> tab05
## Error in pivot_wider(., names_from = Survived, values_from = out): object 'tab03' not found
tab05
## Error in eval(expr, envir, enclos): object 'tab05' not found

Use the set_names function to produce nicer looking variable names. The set_names function is one of the useful aliases available in the magrittr package.

tab05 %>%
  set_names(c(
    "Sex", 
    "Deaths", 
    "Survivors")) -> tab06
## Error in set_names(., c("Sex", "Deaths", "Survivors")): object 'tab05' not found
tab06
## Error in eval(expr, envir, enclos): object 'tab06' not found

Next, let’s use the kableExtra package to make things look really nice. There are a gazillion different options in kableExtra, and you might want to review the kableExtra vignette. There are lots of other packages out there that can create publication quality tables.

library(kableExtra)
tab06 %>%
  kbl %>%
  kable_paper(
    "striped", 
    full_width=FALSE) -> tab07
## Error in knitr::kable(x = x, format = format, digits = digits, row.names = row.names, : object 'tab06' not found
tab07
## Error in eval(expr, envir, enclos): object 'tab07' not found

If you have multiple independent variables, the pack_rows function is nice. Notice that we have to set the name of the first column to blank to avoid redundancy.

library(kableExtra)
tab06 %>%
  set_names(c(" ", "Deaths", "Survivors")) %>%
  kbl %>%
  kable_paper(
    "striped", 
    full_width=FALSE) %>%
  pack_rows("Sex", 1, 2) -> tab08
## Error in set_names(., c(" ", "Deaths", "Survivors")): object 'tab06' not found
tab08
## Error in eval(expr, envir, enclos): object 'tab08' not found

Here is the code all in one place rather than step by step. It looks quite complex, but if you build things one step at a time, the complex code is not all that complex to generate.

ti0 %>%
  group_by(Sex, Survived) %>%
  summarize(n1=n(), .groups = "drop") %>%
  group_by(Sex) %>%
  mutate(n2=sum(n1)) %>%
  mutate(pct=round(100*n1/n2)) %>%
  mutate(out=glue("{n1}/{n2} ({pct}%)")) %>%
  select(Sex, Survived, out) %>%
  pivot_wider(names_from=Survived, values_from=out) %>%
  set_names(c(" ", "Deaths", "Survivors")) %>%
  kbl %>%
  kable_paper("striped", full_width=FALSE) %>%
  pack_rows("Sex", 1, 2)
## Error in select(., Sex, Survived, out): unused arguments (Sex, Survived, out)

Here is what the code would look like if you used sprintf instead of glue.

ti0 %>%
  group_by(Sex, Survived) %>%
  summarize(n1=n(), .groups = "drop") %>%
  group_by(Sex) %>%
  mutate(n2=sum(n1)) %>%
  mutate(pct=100*n1/n2) %>%
  mutate(out=sprintf("%3.0f/%3.0f (%3.0f%%)", n1, n2, pct)) %>%
  select(Sex, Survived, out) %>%
  pivot_wider(names_from=Survived, values_from=out) %>%
  set_names(c(" ", "Deaths", "Survivors")) %>%
  kbl %>%
  kable_paper("striped", full_width=FALSE) %>%
  pack_rows("Sex", 1, 2)
## Error in select(., Sex, Survived, out): unused arguments (Sex, Survived, out)