suppressMessages(suppressWarnings(library(broom)))
suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(magrittr)))
Downward bend
Suppose you want to fit a function that is flat from x=a to x=b and which declines linearly from x=b to x=c.
Here is the formula:
\(f(x) = d - e*(x-b)_+\)
where
\(u_+ = u\) if \(u \ge 0\), \(= 0\) if \(u<0\)
Here is some simulated data for testing this function.
library(ggplot2)
x <- 1:20
y <- c(rep(40, 15), 40-(1:5)^2)+round(rnorm(10), 1)
simulated_example <- data.frame(x, y)
simulated_example
## x y
## 1 1 40.2
## 2 2 40.8
## 3 3 40.0
## 4 4 39.2
## 5 5 39.6
## 6 6 39.4
## 7 7 39.7
## 8 8 40.2
## 9 9 40.3
## 10 10 39.7
## 11 11 40.2
## 12 12 40.8
## 13 13 40.0
## 14 14 39.2
## 15 15 39.6
## 16 16 38.4
## 17 17 35.7
## 18 18 31.2
## 19 19 24.3
## 20 20 14.7
ggplot(simulated_example, aes(x, y)) +
geom_point()
Fit a linear spline
I want a function that is flat for the values x = 0 to 15, and that drops linearly for x > 15. I also want to function to be continuous (no abrupt jumps).
To fit this function, you need a special X matrix.
linear_spline_ivs <- data.frame(x0=rep(1, 20), x1=c(rep(0, 15), 1:5))
linear_spline_ivs
## x0 x1
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## 9 1 0
## 10 1 0
## 11 1 0
## 12 1 0
## 13 1 0
## 14 1 0
## 15 1 0
## 16 1 1
## 17 1 2
## 18 1 3
## 19 1 4
## 20 1 5
Surprisingly, even though the function you are trying to fit is not a linear function of x, you can still fit it with the lm function, as long as you have the correct independent variable matrix.
simple_spline_data <-
data.frame(
x1=linear_spline_ivs$x1,
x=simulated_example$x,
y=simulated_example$y)
linear_spline_model <- lm(y~x1, data=simple_spline_data)
tidy(linear_spline_model)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 40.3 0.455 88.5 3.27e-25
## 2 x1 -4.19 0.275 -15.3 9.74e-12
To get the plot of predictions for this model and future models accurate, you need to create values between the existing values.
x_extra <- seq(0, 20, length=1000)
data.frame(
x=x_extra,
x1=pmax(0, x_extra-15),
x1sq=pmax(0, x_extra-15)^2) %>%
full_join(simulated_example, by="x") -> enhanced_data
linear_spline_predictions <- augment(linear_spline_model, newdata=enhanced_data)
ggplot(linear_spline_predictions, aes(x, y)) +
geom_point() +
geom_line(y=linear_spline_predictions$.fitted)
## Warning: Removed 999 rows containing missing values (geom_point).
## Warning: Removed 19 row(s) containing missing values (geom_path).
Fit a quadratic spline
Suppose you want to fit a function which is flat up to a certain point and then declines quadratically. You could do this with the following model.
simple_spline_data %>%
mutate(x1sq=x1*x1) -> quadratic_spline_data
quadratic_spline_model <- lm(y~x1sq, data=quadratic_spline_data)
tidy(quadratic_spline_model)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 39.9 0.117 342. 8.74e-36
## 2 x1sq -0.997 0.0167 -59.9 3.59e-22
quadratic_spline_predictions <- augment(quadratic_spline_model, newdata=enhanced_data)
ggplot(quadratic_spline_predictions, aes(x, y)) +
geom_point() +
geom_line(y=quadratic_spline_predictions$.fitted)
## Warning: Removed 999 rows containing missing values (geom_point).
## Warning: Removed 19 row(s) containing missing values (geom_path).
Notice that the quadratic model is smoother. It does not have a sharp elbow at the transition. Smoothness is often preferred in model fitting, though there are exceptions.
Fit a cubic spline
Here’s a final example of very simple splines. There is a “magic” cubic polynomial, \(3x^2-2x^3\). It rises smoothly from 0 at x=0, to 1 at x=1. It is flat (zero first derivative) at x=0 and x=1. This is what the function looks like.
x <- seq(0, 1, length=100)
y <- 3*x^2-2*x^3
data.frame(x=x, y=y) %>%
ggplot(aes(x, y)) +
geom_line()
You can add a constant and/or multiply by a constant to get different starting and ending points. So, for example,
\(0.75 - 0.5(3x^2-2x^3)\)
will start at 0.75 and drop 0.5 units to end at 0.25.
x <- seq(0, 1, length=100)
y <- 3*x^2-2*x^3
data.frame(x=x, y=y) %>%
mutate(y=0.75-0.5*y) %>%
ggplot(aes(x, y)) +
geom_line() +
expand_limits(y=0:1)
I will use this function to show transition probabilities in a graph.
In a study of dog walking. Data was collected before and during the COVID pandemic. A “dog walker” was classified as someone who spent 150 minutes or more per week walking their dog. Did the proportion of dog walkers increase or decrease during the pandemic?
Source: Wallengren O, Bosaeus I, Frändin K, Lissner L, Falk Erhag H, Wetterberg H, Rydberg Sterner T, Rydén L, Rothenberg E, Skoog I. Comparison of the 2010 and 2019 diagnostic criteria for sarcopenia by the European Working Group on Sarcopenia in Older People (EWGSOP) in two cohorts of Swedish older adults. BMC Geriatr. 2021 Oct 26;21(1):600. doi: 10.1186/s12877-021-02533-y. PMID: 34702174; PMCID: PMC8547086. Available in [html format][wal1] or [pdf format][wal2].
Here’s the data.
During
Pandemic
Before
Pandemic Dog walker Non-walker Total
Dog walker 56 72 128
Non-walker 40 48 96
Total 96 120 216
Let’s convert these to cell percentages.
During
Pandemic
Before
Pandemic Dog walker Non-walker Total
Dog walker 26% 33% 59%
Non-walker 19% 22% 41%
Total 45% 55% 100%
Vucinic M, Vucicevic M, Nenadovicć K. THE COVID-19 PANDEMIC AFFECTS OWNERS WALKING WITH THEIR DOGS. J Vet Behav. 2021 Oct 20. doi: 10.1016/j.jveb.2021.10.009. Epub ahead of print. PMID: 34690614; PMCID: PMC8527592. Avialble in html format or pdf format.
par(mar=rep(0.1,4))
plot(c(-1,2), c(0, 216), type="n", axes=FALSE, xlab="", ylab="")
polygon(x=c(-1, 0, 0, -1), y=c( 0, 0, 88, 88), col="pink", border=NA)
polygon(x=c( 2, 1, 1, 2), y=c( 0, 0, 120, 120), col="pink", border=NA)
polygon(x=c( 0, 1, 1, 0), y=c( 0, 0, 48, 48), col="pink", border=NA)
polygon(x=c(-1, 0, 0, -1), y=c(216, 216, 88, 88), col="lightgreen", border=NA)
polygon(x=c( 2, 1, 1, 2), y=c(216, 216, 120, 120), col="lightgreen", border=NA)
polygon(x=c( 0, 1, 1, 0), y=c(216, 216, 160, 160), col="lightgreen", border=NA)
segments(-1, 0, 2, 0)
segments(-1, 216, 2, 216)
segments(-1, 88, 0, 88)
segments( 1, 120, 2, 120)
segments(-0.5, 48, 0, 48, lty="dotted")
segments(-0.5, 160, 0, 160, lty="dotted")
segments( 1.5, 48, 1, 48, lty="dotted")
segments( 1.5, 160, 1, 160, lty="dotted")
x <-seq(0,1, length=100)
y <- 3*x^2-2*x^3
for (i in 1:length(x)) {
segments(x[i], 160-40*y[i], x[i], 88-40*y[i], col=rgb(floor(colorRamp(c("lightgreen", "pink"))(x[i]))/255), lwd=3)
}
lines(x, 88 - 40*y)
lines(x, 160 - 40*y)
lines(x, 160 + 0*y)
for (i in 1:length(x)) {
segments(x[i], 48+72*y[i], x[i], 88+72*y[i], col=rgb(floor(colorRamp(c("pink", "lightgreen"))(x[i]))/255), lwd=3)
}
lines(x, 48 + 0*y)
lines(x, 48 + 72*y)
lines(x, 88 + 72*y)
text(-0.95, 49, "No dog walking", srt=90)
text(-0.95, 157, "Dog walking", srt=90)
text(-0.5, 210, "Before pandemic")
text( 1.5, 210, "During pandemic")
text(-0.7, 48, "41%")
text(-0.2, 24, "22%")
text(-0.2, 68, "19%")
text( 1.2, 24, "22%")
text( 1.2, 84, "33%")
text( 1.7, 48, "55%")
text(-0.7,160, "59%")
text(-0.2,188, "26%")
text(-0.2,124, "33%")
text( 1.2,188, "26%")
text( 1.2,140, "19%")
text( 1.7,160, "45%")
Notice how the 19% slides smoothly upward and the 33% slides smoothly downward. So you can see that while some people took up dog walking during the pandemic, this was more than offset by the number who dropped dog walking during the pandemic.
Aris Perperoglou, Willi Sauerbrei, Michal Abrahamowicz, Matthias Schmid. A review of spline function procedures in R. BMC Medical Research Methodology, 2019-03-06, 19(46). Available in html format or pdf format.