Splines provide a useful way to model relationships that are more complex than a simple linear relationship. They work in a variety of regression models. Here is an illustration of how to use a spline in a logistic regression model with data from survival of passengers on the Titanic.
The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to American in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of “women and children first” that held during the boarding of life boats. Let’s look at the effect of age on survival using a logistic regression model.
Here is a brief description of the diamond pricing dataset, taken from the data dictionary on my github site.
The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to America in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of “women and children first” that held during the boarding of life boats.
Here are the first few rows of data.
f2 <- "titanic.txt"
ti0 <-
data.frame(
read_tsv(
file=paste0(f0, f2),
col_types="ccncn"))
names(ti0) <- tolower(names(ti0))
head(ti0)
## name pclass age sex
## 1 Allen, Miss Elisabeth Walton 1st 29.00 female
## 2 Allison, Miss Helen Loraine 1st 2.00 female
## 3 Allison, Mr Hudson Joshua Creighton 1st 30.00 male
## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.00 female
## 5 Allison, Master Hudson Trevor 1st 0.92 male
## 6 Anderson, Mr Harry 1st 47.00 male
## survived
## 1 1
## 2 0
## 3 0
## 4 0
## 5 1
## 6 1
Here are a few descriptive statistics
summary(ti0$Age)
## Length Class Mode
## 0 NULL NULL
count(ti0, pclass)
## pclass n
## 1 1st 322
## 2 2nd 280
## 3 3rd 711
count(ti0, sex)
## sex n
## 1 female 462
## 2 male 851
count(ti0, survived)
## survived n
## 1 0 863
## 2 1 450
The boxplots reveal little differences between the ages of survivors and deaths. If something is going on, it is subtle.
ggplot(data=ti0, aes(factor(survived), age)) +
geom_boxplot()
## Warning: Removed 557 rows containing non-finite values (stat_boxplot).
Fit a linear model first.
ti1 <- filter(ti0, !is.na(age))
linear_fit <- glm(
survived ~ age,
family=binomial,
data=ti1)
tidy(linear_fit)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.0814 0.174 -0.468 0.640
## 2 age -0.00879 0.00523 -1.68 0.0928
There may be a downward trend in the odds of survival over time, but it is not statistically significant.
Plot the predicted values.
age_sequence <- data.frame(
age = seq(min(ti1$age), max(ti1$age), length=100))
age_sequence$prediction <-
predict(linear_fit, newdata=age_sequence, type="response")
ggplot(data=ti1, aes(age, survived)) +
geom_point(pch=1) +
geom_line(
data=age_sequence,
aes(age, prediction))
The plot is linear, but you should really look at it on the log odds scale because the logistic regression model is linear in the log odds. You need to change the predict function from type=“response” to type=“link” to get predictions on a log odds scale.
age_sequence$prediction <-
predict(linear_fit, newdata=age_sequence, type="link")
ggplot(data=age_sequence, aes(age, prediction)) +
geom_line()
Now fit a spline function.
spline_fit <- glm(
survived ~ rcs(age),
family=binomial,
data=ti1)
tidy(spline_fit)
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.29 0.399 3.24 0.00120
## 2 rcs(age)age -0.102 0.0296 -3.42 0.000616
## 3 rcs(age)age' 0.223 0.159 1.40 0.161
## 4 rcs(age)age'' -0.141 1.20 -0.118 0.906
## 5 rcs(age)age''' -0.752 1.57 -0.478 0.633
The coefficients from the restricted cubic spline are pretty much uninterpretable. You have to visualize the spline graphically. First do this on the log odds scale to see how far from linear the spline fit is.
age_sequence$prediction <-
predict(spline_fit, newdata=age_sequence, type="link")
ggplot(data=age_sequence, aes(age, prediction)) +
geom_line()
Now plot on the original scale to get a picture of what is going on.
age_sequence$prediction <-
predict(spline_fit, newdata=age_sequence, type="response")
ggplot(data=ti1, aes(age, survived)) +
geom_point(pch=1) +
geom_line(
data=age_sequence,
aes(age, prediction))
It looks like “women and children” first might actually be “women, children, and old people first”.