# The hedging hyperprior for the normal distribution: unknown mean, known precision

## 2023-02-28

suppressMessages(suppressWarnings(library(MASS)))
suppressMessages(suppressWarnings(library(tidyverse)))
knitr::opts_chunk$set(echo=TRUE, fig.width=8, fig.height=2) I want to write a paper about the hedging hyperprior. I need to work out some simple examples first. Here is an example involving the normal distribution. ### A simple Bayesian analysis with an informative prior. There are lots of guides on the Internet on using a normal prior for data that is normally distributed with the population standard deviation $$\sigma$$ known or unknown. I used papers written by Kevin P. Murphy and Michael I. Jordan to get these formulas. For simplicity, all sums and products are assumed to go from i=1 to n. Consider a Bayesian analysis of an outcome variable that has a normal distribution. The density of the normal distribution is $$f(X; \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-(X-\mu)^2 / 2\sigma^2}$$ For this derivation, it is easier to write the density in terms of $$f(X; \mu, \tau) = \sqrt{\frac{\tau}{2 \pi}}e^{-\frac{1}{2}\tau (X-\mu)^2}$$ where $$\tau=\frac{1}{\sigma^2}$$ The parameter $$\tau$$ is called the precision. Note that some define the precision differently as the inverse of the standard deviation rather than the inverse of the variance. Assume that $$\tau$$ is known and equal to $$\tau_F$$. Other cases with mean known and precision unknown and with both mean and precision unknown appear on different web pages. For simplicity, we will drop any multiplicative constants not involving X or $$\mu$$ $$f(X; \mu) \propto e^{-\frac{1}{2}\tau_F(X-\mu)^2}$$ The conjugate prior for a normal distribution is also a normal distribution. $$g(\mu; \mu_0, \tau_0) \propto e^{- \frac{1}{2}\tau_0(\mu-\mu_0)^2}$$ where the prior mean is $$\mu_0$$ and the prior precision is $$\tau_0$$. For an informative prior distribution, $$\tau_0$$ is usually much smaller than $$\tau_F$$. The ratio $$n_0=\tau_F/\tau_0$$ is the effective prior sample size and is equivalent to adding $$n_0$$ pseudo data values of $$\mu_0$$ to the observed data. You have collected a sample of n data points, $$\mathbf{X} = X_1, X_2, ..., X_n$$. This gives you a likelihood of $$L(\mathbf{X} | \mu) \propto e^{- \frac{1}{2} \tau_F\sum(X_i-\mu)^2}$$ Rewrite this by incorporating $$\bar{X}$$ into the squared term $$L(\mathbf{X} | \mu) \propto e^{-\frac{1}{2} \tau_F \sum (X_i-\bar{X}+\bar{X}-\mu)^2}$$ When you square the term inside the sum, the cross-product cancels out, leaving you with $$L(\mathbf{X} | \mu) \propto e^{- \frac{1}{2} \tau_F (\sum(X_i-\bar{X})^2} e^{- \frac{n}{2} \tau_F (\bar{X}-\mu)^2}$$ $$L(\mathbf{X} | \mu) \propto e^{-\frac{n-1}{2}\tau_F S^2} e^{-\frac{n}{2} \tau_F (\mu-\bar{X})^2}$$ The posterior distribution is proportional to the product of the prior and the likelihood. $$h(\mu | \mathbf{X}) \propto e^{- \frac{1}{2} \tau_0 (\mu-\mu_0)^2} e^{- \frac{n-1}{2} \tau_F S^2} e^{-\frac{n}{2} \tau_F(\mu-\bar{X})^2)}$$ Drop the middle exponential term, which is constant with respect to $$\mu$$ and use high school algebra (completing the square) to get $$h(\mu | \mathbf{X}) \propto e^{ -(n\tau_F+\tau_0) (\mu- \frac {n\tau_F\bar{X}+\tau_0\mu_0} {n\tau_F+\tau_0})^2}$$ which is a normal distribution. The posterior mean $$\frac{n \tau_F \bar{X} + \tau_0 \mu_0}{n\tau_F+\tau_0}$$ and the posterior precision is $$n \tau_F + \tau_0$$ The posterior mean can be written as a weighted average of the prior mean and the mean of the observed data. $$\frac{n \tau_F}{n\tau_F+\tau_0}\bar{X} + \frac{\tau_0}{n\tau_F+\tau_0}\mu_0$$ ### A simple numerical example x_max <- 8 mu0 <- 4 tau0 <- 8 tauF <- 0.6 n <- 10 xbar <- 2 mu1 <- (tau0*mu0+n*tauF*xbar)/(tau0+n*tauF) tau1 <- tau0+n*tauF x_grid <- seq(0.01, x_max, length=1000) f <- function(x, mu, tau) { sqrt(tau/2*pi)*exp(-0.5*tau*(x-mu)^2) } g <- f prior05 <- qnorm(0.025, mu0, 1/sqrt(tau0)) prior95 <- qnorm(0.975, mu0, 1/sqrt(tau0)) post05 <- qnorm(0.025, mu1, 1/sqrt(tau1)) post95 <- qnorm(0.975, mu1, 1/sqrt(tau1))  Let’s illustrate this with an informative prior with $$\mu_0$$ = 4 and $$\tau_0$$ = 8. g_grid <- g(x_grid, mu0, tau0) outside <- x_grid < prior05 | x_grid > prior95 x_color <- ifelse(outside, "white", "gray") data.frame(x=x_grid, y=g_grid) %>% ggplot(aes(x, y)) + geom_segment(aes(xend=x, yend=0), color=x_color) + geom_line() + xlab(" ") + ylab(" ") This corresponds to a fairly informative prior distribution. This distribution places $$\mu$$ within 3.31 and 4.69 with 95% probability. Let’s see what happens with this informative prior if you observe 10 values of x with a mean of 2. The likelihood is f_grid <- f(xbar, x_grid, tauF)^n x_color <- "white" data.frame(x=x_grid, y=f_grid) %>% ggplot(aes(x, y)) + geom_line() + xlab(" ") + ylab(" ") Multiply these two together to get the posterior distribution. h_grid <- g(x_grid, mu1, tau1) outside <- x_grid < post05 | x_grid > post95 x_color <- ifelse(outside, "white", "gray") data.frame(x=x_grid, y=h_grid) %>% ggplot(aes(x, y)) + geom_segment(aes(xend=x_grid, yend=0), color=x_color) + geom_line() + xlab(" ") + ylab(" ") The posterior distribution is not close to either the prior distribution or the likelihood. It places $$\mu$$ within 2.62 and 3.67 with 95% probability. It excludes the prior mean of 4 and it excludes 2, the mean of the data. ### The hedging hyperprior When the likelihood of your data and your informative prior are in conflict, you may wish to downweight the informative prior. You can do this by including a hedging hyperprior. Define your prior distribution as a normal distributions with prior mean $$\mu0$$ and prior precision $$\tau_F+\lambda (\tau_0-\tau_F)$$ where $$\lambda$$ ranges between 0 and 1 and $$\tau_F$$ is less than $$\tau_0$$. When $$\lambda=0$$, the prior precision reduces to $$\tau_F$$ which is equivalent to a prior sample size of $$n_0=1$$. When $$lambda=1$$, the prior precision is $$\tau_0$$ which represents the full strength of the informative priori distribution. Place a hyperprior on the parameter $$\lambda$$. The uniform(0,1) distribution is a reasonable choice for the hyperprior, but others are possible. knitr::opts_chunk$set(echo=TRUE, fig.width=8, fig.height=8)
lambda_grid <- seq(0, 1, length=60)
mu_grid <- seq(0, x_max, length=60)

gstar <- function(mu, lambda) {g(mu, mu0, lambda*(tau0-tauF)+tauF)}
density <- outer(mu_grid, lambda_grid, gstar)
op <- par(bg = "white")
z_max <- max(density, na.rm=TRUE)
persp(
mu_grid,
lambda_grid,
density,
zlim=c(0, 1.1*z_max),
xlab="mu",
ylab="lambda",
mu = 135,
phi = 30,
expand=0.5,
ticktype="detailed")
## Warning in persp.default(mu_grid, lambda_grid, density, zlim = c(0, 1.1 * : "mu"
## is not a graphical parameter

This is the joint prior distribution of $$\mu_0$$ and $$\lambda$$.

It is hard to see at this scale, but the prior distribution is higher for small values of $$\lambda$$ when $$\mu$$ is very large or very small. This becomes more obvious if you remove the middle of the surface and rescale the Z-axis.

density <- outer(mu_grid, lambda_grid, gstar)
density[16:44, ] <- NA
op <- par(bg = "white")
z_max <- max(density, na.rm=TRUE)
persp(
mu_grid,
lambda_grid,
density,
zlim=c(0, 1.1*z_max),
xlab="mu",
ylab="lambda",
mu = 135,
phi = 30,
expand=0.5,
ticktype="detailed")
## Warning in persp.default(mu_grid, lambda_grid, density, zlim = c(0, 1.1 * : "mu"
## is not a graphical parameter

The likelihood function is constant across all values of $$\tau$$, so its surface looks like the following.

lambda_grid <- seq(0, 1, length=60)
mu_grid <- seq(0, x_max, length=60)

fstar <- function(mu, lambda) {f(xbar, mu, tau0)^n}
likelihood <- outer(mu_grid, lambda_grid, fstar)
z_max <- max(likelihood, na.rm=TRUE)
persp(
mu_grid,
lambda_grid,
likelihood,
zlim=c(0, 1.1*z_max),
xlab="mu",
ylab="lambda",
mu = 135,
phi = 30,
expand=0.5,
ticktype="detailed")
## Warning in persp.default(mu_grid, lambda_grid, likelihood, zlim = c(0, 1.1 * :
## "mu" is not a graphical parameter

The product of the prior and the likelihood is proportional to the posterior distribution.

lambda_grid <- seq(0, 1, length=60)
mu_grid <- seq(0, x_max, length=60)

hstar <- function(mu, lambda) {g(mu, mu0, lambda*(tau0-tauF) + tauF)*f(xbar, mu, tauF)^n}
density <- outer(mu_grid, lambda_grid, hstar)
z_max <- max(density, na.rm=TRUE)
persp(
mu_grid,
lambda_grid,
density,
zlim=c(0, 1.1*z_max),
xlab="mu",
ylab="lambda",
mu = 135,
phi = 30,
expand=0.5,
ticktype="detailed")
## Warning in persp.default(mu_grid, lambda_grid, density, zlim = c(0, 1.1 * : "mu"
## is not a graphical parameter

A contour plot shows where the probability is concentrated. The labels on the contour lines represent how much probabilty is found inside the contour line.

density <- density / sum(density, na.rm=TRUE)
density %>%
as.vector %>%
sort %>%
rev -> sorted_density
cumulative_p <- 0
for (p in sorted_density) {
cumulative_p <- p + cumulative_p
density[density==p] <- cumulative_p
}
contour(
mu_grid,
lambda_grid,
density,
labcex=1,
levels=c(50, 75, 90)/100,
xlab="mu",
ylab="lambda")

You can see that the hedging hyperprior did its job. The most probable values of $$\lambda$$ are small and the prior mean (4) is largely ignored in favor of the mean of the data (2).