The hedging hyperprior for the exponential distribution

2023-02-28

suppressMessages(suppressWarnings(library(MASS)))
suppressMessages(suppressWarnings(library(tidyverse)))
knitr::opts_chunk$set(echo=TRUE, fig.width=8, fig.height=2)

I want to write a paper about the hedging hyperprior. I need to work out some simple examples first. Here is an example involving the exponential distribution.

A simple Baysian analysis with an informative prior.

Consider a Bayesian analysis of an outcome variable that has an exponential distribution. The density of the exponential distribution is

\(f(x) = \theta^{-1} e^{-x / \theta}\)

You have an informative prior on \(\theta\) that has an inverse gamma distribution with prior parameters \(\alpha_0\) and \(\beta_0\).

\(g(\theta; \alpha_0, \beta_0) = \frac {\beta^\alpha_0} {\Gamma(\alpha_0)} \theta^{-\alpha_0-1} e^{-\beta_0 / \theta}\)

The mean of this prior distribution is

\(\frac{\beta}{\alpha_0-1}\)

You collect a sample of n data points, \(\mathbf{X} = X_1, X_2, ..., X_n\). This gives you a likelihood of

\(L(\mathbf{X}) = \prod_{i=1}^n f(X_i) = \theta^{-n} e^{-n \bar X / \theta}\)

and a posterior distribution of

\(h(\theta|\mathbf{X}) \propto \frac {\beta_0^{\alpha_0-1}} {\Gamma(\alpha_0)} \theta^{-\alpha_0-1} e^{-\beta_0 / \theta} \theta^{-n} e^{-n \bar X / \theta}\)

which simplifies to

\(h(\theta | \mathbf{X}) \propto \theta^{-\alpha_0-1-n} e^{-(\beta_0 + n\bar X) / \theta}\)

which, thanks to the miracle of conjugancy, is an inverse gamma distribution. With posterior paramaters

\(\alpha_1 = \alpha_0 + n\)

\(\beta_1 = \beta_0 + n\bar X\)

The posterior mean

\(\frac{\beta_0 + n\bar X}{\alpha_0-1+n}\)

can be written as a weighted average of the prior mean and the mean of the observed data.

\(\Big( \frac {\alpha_0-1} {\alpha_0-1+n}\Big) \frac {\beta_0} {\alpha-1} + \Big( \frac {n} {\alpha-1+n}\Big) \bar X\)

A simple numerical example

x_max <- 3

a0 <- 41
b0 <- 44

n <- 60
xbar <- 2.1

a1 <- a0 + n
b1 <- b0 + n*xbar

posterior_alpha <- a0+n
posterior_beta <- b0+n*xbar

x_grid <- seq(0.01, x_max, length=1000)

f <- function(x, theta) {
  theta^(-1) * 
    exp(-x/theta)
}

g <- function(theta, alpha, beta) {
  beta^alpha/gamma(alpha) *
    theta^(-alpha-1) * 
    exp(-beta/theta)
}

prior05 <- 1/qgamma(c(0.975), a0, b0) 
prior95 <- 1/qgamma(c(0.025), a0, b0) 

post05 <- 1/qgamma(c(0.975), a1, b1) 
post95 <- 1/qgamma(c(0.025), a1, b1) 

Let’s illustrate this with an informative prior with \(\alpha\) = 41 and \(\beta\) = 44.

g_grid <- g(x_grid, a0, b0)
outside <- x_grid < prior05 | x_grid > prior95
x_color <- 
  ifelse(outside, "white", "gray")
data.frame(x=x_grid, y=g_grid) %>%
  ggplot(aes(x, y)) +
    geom_segment(aes(xend=x, yend=0), color=x_color) +
    geom_line() + xlab(" ") + ylab(" ")

This corresponds to a fairly informative prior distribution. This distribution places \(\theta\) within 0.81 and 1.50 with 95% probability.

Let’s see what happens with this informative prior if you observe 60 values of x with a mean of 2.1. The likelihood is

f_grid <- f(xbar, x_grid)^n
x_color <- "white"
data.frame(x=x_grid, y=f_grid) %>%
  ggplot(aes(x, y)) +
    geom_line() + xlab(" ") + ylab(" ")

Multiply these two together to get the posterior distribution.

h_grid <- g(x_grid, a1, b1)
outside <- x_grid < post05 | x_grid > post95
x_color <- 
  ifelse(outside, "white", "gray")
data.frame(x=x_grid, y=h_grid) %>%
  ggplot(aes(x, y)) +
    geom_segment(aes(xend=x_grid, yend=0), color=x_color) +
    geom_line() + xlab(" ") + ylab(" ")

The posterior distribution is not close to either the prior distribution or the likelihood. It places \(\theta\) within 1.40 and 2.07 with 95% probability. It excludes the prior mean of 1.10 and it excludes 2.1, the mean of the data.

The hedging hyperprior

When the likelihood of your data and your informative prior are in conflict, you may wish to downweight the informative prior. You can do this by including a hedging hyperprior.

Define your prior distribution as an inverse gamma distributions with parameters \(1+\lambda (\alpha-1)\) and \(\lambda \beta\) where \(\lambda\) ranges between 0 and 1. Then place a hyperprior on the parameter \(\lambda\). The uniform(0,1) distribution is a reasonable choice for the hyperprior, but others are possible.

knitr::opts_chunk$set(echo=TRUE, fig.width=8, fig.height=8)
lambda_grid <- seq(0, 1, length=60)
theta_grid <- seq(0, x_max, length=60)


gstar <- function(theta, lambda) {g(theta, lambda*(a0-1)+1, lambda*(b0-1)+1)}
density <- outer(theta_grid, lambda_grid, gstar)
op <- par(bg = "white")
z_max <- max(density, na.rm=TRUE)
persp(
  theta_grid, 
  lambda_grid, 
  density, 
  zlim=c(0, 1.1*z_max),
  xlab="theta",
  ylab="lambda",
  theta = 135, 
  phi = 30, 
  expand=0.5,
  ticktype="detailed")

It is hard to see at this scale, but the prior distribution is higher for small values of \(\lambda\) when \(\theta\) is very large. This becomes more obvious if you remove the middle of the surface and rescale the Z-axis. There is a similar pattern on the far side of the graph. The prior trends towards a higher prior for small values of \(\lambda\) when \(\theta\) is very small.

This illustrates how the hedging hyperprior does its magic. It puts more probability on large values of \(\lambda\) (gives closer to full weight for the informative prior) when you observe data values close to the prior mean. It puts more probability on small values of \(\lambda\) (gives much less weight to the informative prior) when the data values are much larger or much smaller than the prior mean.

density <- outer(theta_grid, lambda_grid, gstar)
# density[12:32, ] <- NA
# density[11:40, ] <- NA
density[density>0.3] <- 0.3
op <- par(bg = "white")
z_max <- max(density, na.rm=TRUE)
persp(
  theta_grid, 
  lambda_grid, 
  density, 
  zlim=c(0, 1.1*z_max),
  xlab="theta",
  ylab="lambda",
  theta = 135, 
  phi = 30, 
  expand=0.5,
  ticktype="detailed")

The likelihood function is constant across all values of \(\lambda\), so its surface looks like the following.

lambda_grid <- seq(0, 1, length=60)
theta_grid <- seq(0, x_max, length=60)


fstar <- function(theta, lambda) {f(xbar, theta)^n}
likelihood <- outer(theta_grid, lambda_grid, fstar)
z_max <- max(likelihood, na.rm=TRUE)
persp(
  theta_grid, 
  lambda_grid, 
  likelihood, 
  zlim=c(0, 1.1*z_max),
  xlab="theta",
  ylab="lambda",
  theta = 135, 
  phi = 30, 
  expand=0.5,
  ticktype="detailed")

The product of the prior and the likelihood is proportional to the posterior distribution.

lambda_grid <- seq(0, 1, length=60)
theta_grid <- seq(0, x_max, length=60)


hstar <- function(theta, lambda) {g(theta, lambda*(a0-1)+1, lambda*(b0-1)+1)*f(xbar, theta)^n}
density <- outer(theta_grid, lambda_grid, hstar)
z_max <- max(density, na.rm=TRUE)
persp(
  theta_grid, 
  lambda_grid, 
  density, 
  zlim=c(0, 1.1*z_max),
  xlab="theta",
  ylab="lambda",
  theta = 135, 
  phi = 30, 
  expand=0.5,
  ticktype="detailed")

A contour plot shows where the probability is concentrated. The labels on the contour lines represent how much probabilty is found inside the contour line.

density <- density / sum(density, na.rm=TRUE)
density %>%
  as.vector %>%
  sort %>%
  rev -> sorted_density
cumulative_p <- 0
for (p in sorted_density) {
  cumulative_p <- p + cumulative_p
  density[density==p] <- cumulative_p
}
contour(
  theta_grid, 
  lambda_grid, 
  density,
  labcex=1,
  levels=c(10, 25, 50, 75, 90)/100,
  xlab="theta",
  ylab="lambda")

You can see that the hedging hyperprior did its job. The most probable values of \(\lambda\) are small and the prior mean (1.1) is largely ignored in favor of the mean of the data (2.1).