With the generalized linear model, you have the option of choosing a model that works well with a variety of different distributions (normal, Poisson, Gamma, etc.). But you also have the option of working outside of these distributions. You can fit models that do not specify the distribution, but which fit a particular transformation of the linear effect and particular mean-variance relationship.
suppressMessages(suppressWarnings(library(broom)))
suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(ggplot2)))
suppressMessages(suppressWarnings(library(foreign)))
suppressMessages(suppressWarnings(library(magrittr)))
Link and variance functions
A distribution is said to be in the exponential family
\(ln f(y, \theta)=d(\theta)e(y)+g(\theta)+h(y)\)
but sometimes this is rewritten in the form
\(ln f(y, \theta)=\frac{\theta y+b(\theta)}{a(\phi)}+c(y, \phi)\)
where \(\phi\) is a dispersion or scale parameter. This is called the canonical form. There are names for these parameters and functions.
- \(\theta\) is the natural parameter,
- \(\phi\) is the dispersion parameter,
- $a( ) is the variance function, and
- $g( ) = b’( )^{-1} is the link function.
If you also define
\(\mu=E[Y]\)
then the link function, g, and its inverse, b, define how the regression model works.
For Poisson regression, the link function is
\(ln(f(y, \lambda) = ln(\frac{\lambda^y e^{-\lambda}}{y!})=\lambda ln(\lambda)-\lambda-ln(y!)\)
To write this in the canonical form, you need to define the natural parameter as
\(\theta=ln(\lambda)\)
and
\(a(\phi)=1\)
The link function is
\(g(\mu)=ln(\mu)\)
and the variance function is
\(V(\mu)=\mu\)
These functions are derived from the form of the Poisson distribution, and the estimation methods in the generalized linear model depend only on the link and variance functions.
In 1974, Robert Weddeburn developed a theory that would work for any link and variance function, even if it could not be derived from a particular probability distribution. His approach used the same principles as maximum likelihood, but it could not be called a likelihood approach since there was no probability distribution. He called his approach quasi-likelihood and showed that it created estimates with reasonable properties.
The quasi Poisson model
If you modify the variance function of Poisson regression, from
\(V(\mu)=\mu\)
to
\(V(\mu)=\phi \mu\)
and keep the link function the same, then you get a quasi Poisson model. There is no simple probability distribution that would produce this link and variance function, but it seems like a more reasonable approach, because there is more flexibility.
For technical reasons, the parameter \(\phi\) has to be greater than one, implying that the dispersion of the quasiPoisson model is greater than the dispersion of the Poissson model. The quasiPoisson model is often called an overdispersion model.
There are two ways to estimate \(\phi\). The first is very simple and direct.
\(\hat{\phi}= dispersion / df\)
The amount of overdispersion is equal to much much larger the dispersion is than the degrees of freedom.
A second approach estimates \(\phi\) iteratively along with the other terms in the regression model.
One important limitation of the quasi likelihood approach is that there is no natural way to estimate AIC (Akaike’s Information Criteria) an important statistic for comparing different models.
Start with Poisson regression
It is easiest to understand the quasiPoisson model by comparing it to the Poisson model.
Francis Huang has a nice web page showing two datasets and how to analyze them using the Poisson regression model as well as several more sophisticated models. The first data set is available for anyone to download. The link I originally used no longer works.
fn <- "http://faculty.missouri.edu/huangf/data/poisson/articles.csv"
articles <- read.csv(fn)
head(articles)
Here is a different link for the same file, used at German Rodriguez’s website. I’ve placed a copy on my website in case this link breaks. Also, this dataset is available with the pscl library of R under the name, bioChemists.
fn <- "http://www.pmean.com/00files/couart2.dta"
articles <- read.dta(fn)
head(articles)
## art fem mar kid5 phd ment
## 1 0 Men Married 0 2.52 7
## 2 0 Women Single 0 2.05 6
## 3 0 Women Single 0 3.75 6
## 4 0 Men Married 1 1.18 3
## 5 0 Women Single 0 3.75 26
## 6 0 Women Married 2 3.59 2
The variables are + gender (fem), + marital status (mar), + number of children (kid5), + prestige of graduate program (phd), + the number of articles published by the individual’s mentor (ment) + the number of articles published by the scientist (art: the outcome).
To fit a Poisson regression model in R, use the glm function with the argument, family=“poisson”.
pois1 <- glm(art ~ fem + ment + phd + mar + kid5, family=poisson, data = articles)
summary(pois1)
##
## Call:
## glm(formula = art ~ fem + ment + phd + mar + kid5, family = poisson,
## data = articles)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5672 -1.5398 -0.3660 0.5722 5.4467
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.304617 0.102981 2.958 0.0031 **
## femWomen -0.224594 0.054613 -4.112 3.92e-05 ***
## ment 0.025543 0.002006 12.733 < 2e-16 ***
## phd 0.012823 0.026397 0.486 0.6271
## marMarried 0.155243 0.061374 2.529 0.0114 *
## kid5 -0.184883 0.040127 -4.607 4.08e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1817.4 on 914 degrees of freedom
## Residual deviance: 1634.4 on 909 degrees of freedom
## AIC: 3314.1
##
## Number of Fisher Scoring iterations: 5
The interpretation of the coefficients should be done after exponentiating their values.
pois1 %>%
tidy %>%
mutate(lower_limit=estimate-1.96*std.error) %>%
mutate(upper_limit=estimate+1.96*std.error) %>%
select(term, estimate, lower_limit, upper_limit) %>%
mutate(estimate=exp(estimate)) %>%
mutate(lower_limit=exp(lower_limit)) %>%
mutate(upper_limit=exp(upper_limit)) -> pois1_coefficients
pois1_coefficients
## # A tibble: 6 x 4
## term estimate lower_limit upper_limit
## <chr> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.36 1.11 1.66
## 2 femWomen 0.799 0.718 0.889
## 3 ment 1.03 1.02 1.03
## 4 phd 1.01 0.962 1.07
## 5 marMarried 1.17 1.04 1.32
## 6 kid5 0.831 0.768 0.899
Fitting a quasi-Poisson model
quasi1 <- glm(art ~ fem + ment + phd + mar + kid5, family=quasipoisson, data = articles)
summary(quasi1)
##
## Call:
## glm(formula = art ~ fem + ment + phd + mar + kid5, family = quasipoisson,
## data = articles)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5672 -1.5398 -0.3660 0.5722 5.4467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.304617 0.139273 2.187 0.028983 *
## femWomen -0.224594 0.073860 -3.041 0.002427 **
## ment 0.025543 0.002713 9.415 < 2e-16 ***
## phd 0.012823 0.035700 0.359 0.719544
## marMarried 0.155243 0.083003 1.870 0.061759 .
## kid5 -0.184883 0.054268 -3.407 0.000686 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 1.829006)
##
## Null deviance: 1817.4 on 914 degrees of freedom
## Residual deviance: 1634.4 on 909 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
Notice that there is no value listed for AIC.
quasi1 %>%
tidy %>%
mutate(lower_limit=estimate-1.96*std.error) %>%
mutate(upper_limit=estimate+1.96*std.error) %>%
select(term, estimate, lower_limit, upper_limit) %>%
mutate(estimate=exp(estimate)) %>%
mutate(lower_limit=exp(lower_limit)) %>%
mutate(upper_limit=exp(upper_limit)) -> quasi1_coefficients
quasi1_coefficients
## # A tibble: 6 x 4
## term estimate lower_limit upper_limit
## <chr> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.36 1.03 1.78
## 2 femWomen 0.799 0.691 0.923
## 3 ment 1.03 1.02 1.03
## 4 phd 1.01 0.944 1.09
## 5 marMarried 1.17 0.993 1.37
## 6 kid5 0.831 0.747 0.924
If you compare these confidence intervals to the earlier ones, they are wider. This reflects the overdispersion that is only accounted for properly by the quasi Poisson model.
References
David Lillis. Generalized Linear Models in R, Part 7: Checking for Overdispersion in Count Regression. The Analysis Factor blog. Available in html format