Illustrating splines using the diamond pricing dataset

Steve Simon

2024-05-07

Splines provide a useful way to model relationships that are more complex than a simple linear relationship. They work in a variety of regression models. Here is an illustration of how to use a spline in a dataset looking at pricing for diamonds.

Here is a brief description of the diamond pricing dataset, taken from the data dictionary on my github site.

The data is intended to teach some lessons about regression models. The size of a diamond, as well as some categorical descriptors (color and clarity) are listed for 308 diamonds, along with their sales price.

You can also find the original source of the data and a more detailed description, both found on the Journal of Statistics Education datasets website.

f2 <- "singapore-diamond-prices.txt"
raw_data <- read.fwf(
  file=glue("{f0}/{f1}/{f2}"), 
  widths=c(5,2,5,4,5))
di <- raw_data
names(di) <- c(
  "carat",
  "colour",
  "clarity",
  "certification",
  "price"
)

Make adjustments for exchange rate and inflation.

exchange_rate <- 1.72
inflation_rate <- 1.81
di$adjusted_price <- di$price*inflation_rate/exchange_rate
head(di)
##   carat colour clarity certification price adjusted_price
## 1  0.30     D    VS2            GIA   1302       1370.128
## 2  0.30     E    VS1            GIA   1510       1589.012
## 3  0.30     G    VVS1           GIA   1510       1589.012
## 4  0.30     G    VS1            GIA   1260       1325.930
## 5  0.31     D    VS1            GIA   1641       1726.866
## 6  0.31     E    VS1            GIA   1555       1636.366

Here are a few descriptive statistics

summary(di$carat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1800  0.3500  0.6200  0.6309  0.8500  1.1000
summary(di$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     638    1625    4215    5019    7446   16008
summary(di$adjusted_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   671.4  1710.0  4435.6  5282.1  7835.6 16845.6
count(di, colour)
##   colour  n
## 1     D  16
## 2     E  44
## 3     F  82
## 4     G  65
## 5     H  61
## 6     I  40
count(di, clarity)
##   clarity  n
## 1   IF    44
## 2   VS1   81
## 3   VS2   53
## 4   VVS1  52
## 5   VVS2  78
count(di, certification)
##   certification   n
## 1          GIA  151
## 2          HRD   79
## 3          IGI   78

Suppose that you want to build a model that predicts the adjusted price using carat, colour, and clarity. You suspect, however, that the relationship between carat and adjusted price is nonlinear.

ggplot(data=di, aes(carat, adjusted_price)) +
  geom_point(pch=1) -> scatter1
scatter1

plot of chunk diamond-spline-05

First, examine a linear trend.

scatter1+geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

plot of chunk diamond-spline-06

It looks like you could do a bit better than a simple linear fit, and a spline is a good option.

There are many options for splines in R, and you might have additional options beyond splines that also work well.

Choosing among the splines is tricky. To be honest, most of the choices will work out well and the choices end up being between various gradations of good to great.

A type of spline that I like a lot are the restricted cubic splines developed by Frank Harrell in the rms package. You implement these splines by adding an rcs function to your model formula. You use the number of knots to control the complexity from splines that represent a gentle deviation from normality to splines that wiggle all around and are far from linear. The default option for knots in rcs usually works fairly well.

There is a problem with direct interpretation, though. The coefficients for a linear fit are fairly easy to interpret.

linear_fit <- lm(adjusted_price ~ carat, data=di)
linear_coefficients <- tidy(linear_fit)
linear_coefficients
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   -2419.      167.     -14.5 1.28e- 36
## 2 carat         12206.      242.      50.4 3.04e-150

The slope, 1.2206 × 104, represents the estimated average change in price when carats increase by one unit.

Things don’t have such a nice interpretation with your spline model.

spline_fit <- lm(adjusted_price ~ rcs(carat), data=di)
tidy(spline_fit)
## # A tibble: 5 x 5
##   term               estimate std.error statistic p.value
##   <chr>                 <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)            24.2      544.    0.0445  0.965 
## 2 rcs(carat)carat      4600.      1940.    2.37    0.0183
## 3 rcs(carat)carat'    23215.     10824.    2.14    0.0328
## 4 rcs(carat)carat''  -52143.     31938.   -1.63    0.104 
## 5 rcs(carat)carat'''  56668.     44127.    1.28    0.200

The coefficients from the restricted cubic spline are pretty much uninterpretable. You have to visualize the spline graphically.

carat_sequence <- data.frame(
  carat = seq(min(di$carat), max(di$carat), length=100))
carat_sequence$prediction <- 
  predict(spline_fit, newdata=carat_sequence)
scatter1 + geom_line(
  data=carat_sequence,
  aes(carat, prediction))

plot of chunk diamond-spline-09

You might be interested in seeing if the spline fits better than a simple linear function. Fit both a linear term and a spline and then compare to the fit with just a linear term.

linear_plus_spline_fit <- lm(
  adjusted_price ~ carat + rcs(carat), 
  data=di)
anova(linear_fit, linear_plus_spline_fit)
## Analysis of Variance Table
## 
## Model 1: adjusted_price ~ carat
## Model 2: adjusted_price ~ carat + rcs(carat)
##   Res.Df       RSS Df Sum of Sq      F  Pr(>F)    
## 1    306 423220454                                
## 2    303 346288655  3  76931799 22.438 3.8e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is small, indicating that adding a spline to a linear fit significantly improves the prediction model.