Splines provide a useful way to model relationships that are more complex than a simple linear relationship. They work in a variety of regression models. Here is an illustration of how to use a spline in a dataset looking at pricing for diamonds.
Here is a brief description of the diamond pricing dataset, taken from the data dictionary on my github site.
The data is intended to teach some lessons about regression models. The size of a diamond, as well as some categorical descriptors (color and clarity) are listed for 308 diamonds, along with their sales price.
You can also find the original source of the data and a more detailed description, both found on the Journal of Statistics Education datasets website.
f2 <- "singapore-diamond-prices.txt"
raw_data <- read.fwf(
file=glue("{f0}/{f1}/{f2}"),
widths=c(5,2,5,4,5))
di <- raw_data
names(di) <- c(
"carat",
"colour",
"clarity",
"certification",
"price"
)
Make adjustments for exchange rate and inflation.
exchange_rate <- 1.72
inflation_rate <- 1.81
di$adjusted_price <- di$price*inflation_rate/exchange_rate
head(di)
## carat colour clarity certification price adjusted_price
## 1 0.30 D VS2 GIA 1302 1370.128
## 2 0.30 E VS1 GIA 1510 1589.012
## 3 0.30 G VVS1 GIA 1510 1589.012
## 4 0.30 G VS1 GIA 1260 1325.930
## 5 0.31 D VS1 GIA 1641 1726.866
## 6 0.31 E VS1 GIA 1555 1636.366
Here are a few descriptive statistics
summary(di$carat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1800 0.3500 0.6200 0.6309 0.8500 1.1000
summary(di$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 638 1625 4215 5019 7446 16008
summary(di$adjusted_price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 671.4 1710.0 4435.6 5282.1 7835.6 16845.6
count(di, colour)
## colour n
## 1 D 16
## 2 E 44
## 3 F 82
## 4 G 65
## 5 H 61
## 6 I 40
count(di, clarity)
## clarity n
## 1 IF 44
## 2 VS1 81
## 3 VS2 53
## 4 VVS1 52
## 5 VVS2 78
count(di, certification)
## certification n
## 1 GIA 151
## 2 HRD 79
## 3 IGI 78
Suppose that you want to build a model that predicts the adjusted price using carat, colour, and clarity. You suspect, however, that the relationship between carat and adjusted price is nonlinear.
ggplot(data=di, aes(carat, adjusted_price)) +
geom_point(pch=1) -> scatter1
scatter1
First, examine a linear trend.
scatter1+geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
It looks like you could do a bit better than a simple linear fit, and a spline is a good option.
There are many options for splines in R, and you might have additional options beyond splines that also work well.
Choosing among the splines is tricky. To be honest, most of the choices will work out well and the choices end up being between various gradations of good to great.
A type of spline that I like a lot are the restricted cubic splines developed by Frank Harrell in the rms package. You implement these splines by adding an rcs function to your model formula. You use the number of knots to control the complexity from splines that represent a gentle deviation from normality to splines that wiggle all around and are far from linear. The default option for knots in rcs usually works fairly well.
There is a problem with direct interpretation, though. The coefficients for a linear fit are fairly easy to interpret.
linear_fit <- lm(adjusted_price ~ carat, data=di)
linear_coefficients <- tidy(linear_fit)
linear_coefficients
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2419. 167. -14.5 1.28e- 36
## 2 carat 12206. 242. 50.4 3.04e-150
The slope, 1.2206 × 104, represents the estimated average change in price when carats increase by one unit.
Things don’t have such a nice interpretation with your spline model.
spline_fit <- lm(adjusted_price ~ rcs(carat), data=di)
tidy(spline_fit)
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 24.2 544. 0.0445 0.965
## 2 rcs(carat)carat 4600. 1940. 2.37 0.0183
## 3 rcs(carat)carat' 23215. 10824. 2.14 0.0328
## 4 rcs(carat)carat'' -52143. 31938. -1.63 0.104
## 5 rcs(carat)carat''' 56668. 44127. 1.28 0.200
The coefficients from the restricted cubic spline are pretty much uninterpretable. You have to visualize the spline graphically.
carat_sequence <- data.frame(
carat = seq(min(di$carat), max(di$carat), length=100))
carat_sequence$prediction <-
predict(spline_fit, newdata=carat_sequence)
scatter1 + geom_line(
data=carat_sequence,
aes(carat, prediction))
You might be interested in seeing if the spline fits better than a simple linear function. Fit both a linear term and a spline and then compare to the fit with just a linear term.
linear_plus_spline_fit <- lm(
adjusted_price ~ carat + rcs(carat),
data=di)
anova(linear_fit, linear_plus_spline_fit)
## Analysis of Variance Table
##
## Model 1: adjusted_price ~ carat
## Model 2: adjusted_price ~ carat + rcs(carat)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 306 423220454
## 2 303 346288655 3 76931799 22.438 3.8e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is small, indicating that adding a spline to a linear fit significantly improves the prediction model.