Illustrating splines using the Titanic dataset

Steve Simon

2024-05-07

Splines provide a useful way to model relationships that are more complex than a simple linear relationship. They work in a variety of regression models. Here is an illustration of how to use a spline in a logistic regression model with data from survival of passengers on the Titanic.

The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to American in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of “women and children first” that held during the boarding of life boats. Let’s look at the effect of age on survival using a logistic regression model.

Here is a brief description of the diamond pricing dataset, taken from the data dictionary on my github site.

The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to America in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of “women and children first” that held during the boarding of life boats.

Here are the first few rows of data.

f2 <- "titanic.txt"
ti0 <- 
  data.frame(
    read_tsv(
      file=paste0(f0, f2),
      col_types="ccncn"))
names(ti0) <- tolower(names(ti0))
head(ti0)
##                                            name pclass   age    sex
## 1                  Allen, Miss Elisabeth Walton    1st 29.00 female
## 2                   Allison, Miss Helen Loraine    1st  2.00 female
## 3           Allison, Mr Hudson Joshua Creighton    1st 30.00   male
## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st 25.00 female
## 5                 Allison, Master Hudson Trevor    1st  0.92   male
## 6                            Anderson, Mr Harry    1st 47.00   male
##   survived
## 1        1
## 2        0
## 3        0
## 4        0
## 5        1
## 6        1

Here are a few descriptive statistics

summary(ti0$Age)
## Length  Class   Mode 
##      0   NULL   NULL
count(ti0, pclass)
##   pclass   n
## 1    1st 322
## 2    2nd 280
## 3    3rd 711
count(ti0, sex)
##      sex   n
## 1 female 462
## 2   male 851
count(ti0, survived)
##   survived   n
## 1        0 863
## 2        1 450

The boxplots reveal little differences between the ages of survivors and deaths. If something is going on, it is subtle.

ggplot(data=ti0, aes(factor(survived), age)) + 
  geom_boxplot()
## Warning: Removed 557 rows containing non-finite values (stat_boxplot).

plot of chunk titanic-spline-04

Fit a linear model first.

ti1 <- filter(ti0, !is.na(age))
linear_fit <- glm(
  survived ~ age,
  family=binomial,
  data=ti1)
tidy(linear_fit)
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.0814    0.174      -0.468  0.640 
## 2 age         -0.00879   0.00523    -1.68   0.0928

There may be a downward trend in the odds of survival over time, but it is not statistically significant.

Plot the predicted values.

age_sequence <- data.frame(
  age = seq(min(ti1$age), max(ti1$age), length=100))
age_sequence$prediction <- 
  predict(linear_fit, newdata=age_sequence, type="response")
ggplot(data=ti1, aes(age, survived)) +
  geom_point(pch=1) +
  geom_line(
    data=age_sequence,
    aes(age, prediction))

plot of chunk titanic-spline-06

The plot is linear, but you should really look at it on the log odds scale because the logistic regression model is linear in the log odds. You need to change the predict function from type=“response” to type=“link” to get predictions on a log odds scale.

age_sequence$prediction <- 
  predict(linear_fit, newdata=age_sequence, type="link")
ggplot(data=age_sequence, aes(age, prediction)) +
  geom_line()

plot of chunk titanic-spline-07

Now fit a spline function.

spline_fit <- glm(
  survived ~ rcs(age),
  family=binomial,
  data=ti1)
tidy(spline_fit)
## # A tibble: 5 x 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)       1.29     0.399      3.24  0.00120 
## 2 rcs(age)age      -0.102    0.0296    -3.42  0.000616
## 3 rcs(age)age'      0.223    0.159      1.40  0.161   
## 4 rcs(age)age''    -0.141    1.20      -0.118 0.906   
## 5 rcs(age)age'''   -0.752    1.57      -0.478 0.633

The coefficients from the restricted cubic spline are pretty much uninterpretable. You have to visualize the spline graphically. First do this on the log odds scale to see how far from linear the spline fit is.

age_sequence$prediction <- 
  predict(spline_fit, newdata=age_sequence, type="link")
ggplot(data=age_sequence, aes(age, prediction)) +
  geom_line()

plot of chunk titanic-spline-09

Now plot on the original scale to get a picture of what is going on.

age_sequence$prediction <- 
  predict(spline_fit, newdata=age_sequence, type="response")
ggplot(data=ti1, aes(age, survived)) +
  geom_point(pch=1) +
  geom_line(
    data=age_sequence,
    aes(age, prediction))

plot of chunk titanic-spline-10

It looks like “women and children” first might actually be “women, children, and old people first”.