Working with Poisson regression

Poisson regression is a statistical model under the large umbrella of generalized linear models. I want to explain how Poisson regression works. I also will show and interpret an example.

suppressMessages(suppressWarnings(library(broom)))
suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(ggplot2)))
suppressMessages(suppressWarnings(library(foreign)))
suppressMessages(suppressWarnings(library(magrittr)))

The linear regression model

A linear regression model is written as

$Y_i=\beta_0+X_i\beta_1+\epsilon_i$

and has two assumptions that are often violated:

the $\epsilon_i$ are normally distributed, and
the variance is the same for all $\epsilon_i$.

This led to a variety of alternative approaches that worked better, especially for dependent variables that represented proportions or counts.

The generalized linear model

In 1982, McCullagh and Nelder wrote a book outlining the generalized linear model, an approach to regression that would work well both for situations that met the assumptions of linear regression and for situations that did not meet those assumptions.

It was the sort of breakthrough that Physicists are still hoping for in the grand unified theory of the electromagnetic, strong, and weak forces under one model.

McCullagh and Nelder developed a model that works for any statistical distribution in a very broad class of distributions known as the exponential family. The exponential family of distributions includes both continuous and categorical distributions. The only requirement is that the probability density function (for continuous distributions) or the probability mass function (for categorical variables) can be factored in a a way that isolates the parameter(s) of the distribution in the following way:

$ln(f(y; \theta))=d(\theta)e(y)+g(\theta)+h(y)$

The Poisson distribution

$f(y; \theta)=\theta^{-y}e^{-\theta}/y!$ for y=0, 1, …$

clearly fits into the exponential family, because

$lnf(y; \theta)=y ln(\theta)-\theta-ln(y!)$

Other distributions that fit into this family include the

bernoulli
beta
chi-squared
exponential
gamma
geometric
normal

There are some distributions which are not part of the exponential family. The t-distribution is one example.

If you know that a distribution is part of the exponential family, you know the general form of the maximum likelihood estimates and find a solution using iteratively reweighted least squares.

More on Poisson regression

Counts are never negative, so we need to insure that no matter what the values of the covariates in a Poisson regression model, the predicted values produced by this model are never negative. A simple way to achieve this is to calculate the regression coefficients on a log scale.

The logarithm is an interesting function that converts multiplication into addition.

$log(ab)=log(a)+log(b)$

Back before computers, tables of logarithms simplified complex calculations. To compute a product of 4,156 and 2,023, you could multiply things out directly, but it would look something like

   2141
 x 2123
 ------
   6423
  4282
 2141
4282
-------
4545343

Instead, you could look up the base 10 logarithm of each number, add them together, and then convert back from the log scale to the original scale.

log10(2141)

## [1] 3.330617

log10(2123)

## [1] 3.32695

log10(2141)+log10(2123)

## [1] 6.657567

10^(log10(2141)+log10(2123))

## [1] 4545343

It only saves you a little bit of time, but back when mathematicians had to do hundreds or thousands of these calculations, it added up.

There’s a joke along these lines.

After the flood waters receded, Noah’s ark landed on solid ground. Noah opened the door and let all the animals out with the exhortation “Go forth and multiply.” There were two snakes, however, that did not want to leave the ark. Noah looked at them and said, “I told you to go forth and multiply.” The snakes looked sadly at him and said “We can’t multiply. We’re adders.” So Noah left the ark and found some wood to build a small flat platform. “Here.” he said. “This is a log table. It will allow adders to multiply.”

The Poisson regression model produces estimates on a log scale. This implies that the relationships are effectively multiplicative when transformed back to the original scale.

Simple example

Francis Huang has a nice web page showing two datasets and how to analyze them using the Poisson regression model as well as several more sophisticated models. The first data set is available for anyone to download. The link I originally used no longer works.

fn <- "http://faculty.missouri.edu/huangf/data/poisson/articles.csv"
articles <- read.csv(fn)
head(articles)

Here is a different link for the same file, used at German Rodriguez’s website. I’ve placed a copy on my website in case this link breaks. Also, this dataset is available with the pscl library of R under the name, bioChemists.

fn <- "http://www.pmean.com/00files/couart2.dta"
articles <- read.dta(fn)
head(articles)

##   art   fem     mar kid5  phd ment
## 1   0   Men Married    0 2.52    7
## 2   0 Women  Single    0 2.05    6
## 3   0 Women  Single    0 3.75    6
## 4   0   Men Married    1 1.18    3
## 5   0 Women  Single    0 3.75   26
## 6   0 Women Married    2 3.59    2

The variables are + gender (fem), + marital status (mar), + number of children (kid5), + prestige of graduate program (phd), + the number of articles published by the individual’s mentor (ment) + the number of articles published by the scientist (art: the outcome).

To fit a Poisson regression model in R, use the glm function with the argument, family=“poisson”.

pois1 <- glm(art ~ fem + ment + phd + mar + kid5, family=poisson, data = articles)
summary(pois1)

## 
## Call:
## glm(formula = art ~ fem + ment + phd + mar + kid5, family = poisson, 
##     data = articles)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5672  -1.5398  -0.3660   0.5722   5.4467  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.304617   0.102981   2.958   0.0031 ** 
## femWomen    -0.224594   0.054613  -4.112 3.92e-05 ***
## ment         0.025543   0.002006  12.733  < 2e-16 ***
## phd          0.012823   0.026397   0.486   0.6271    
## marMarried   0.155243   0.061374   2.529   0.0114 *  
## kid5        -0.184883   0.040127  -4.607 4.08e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1817.4  on 914  degrees of freedom
## Residual deviance: 1634.4  on 909  degrees of freedom
## AIC: 3314.1
## 
## Number of Fisher Scoring iterations: 5

The interpretation of the coefficients is tricky. The Poisson regression model is linear on the log scale. If you take the anti-log, you will convert this model to a multiplicative model. Here are the coefficents on the log scale.

pois1 %>%
  tidy %>%
  mutate(lower_limit=estimate-1.96*std.error) %>%
  mutate(upper_limit=estimate+1.96*std.error) %>%
  select(term, estimate, lower_limit, upper_limit) -> log_coefficients
log_coefficients

## # A tibble: 6 x 4
##   term        estimate lower_limit upper_limit
##   <chr>          <dbl>       <dbl>       <dbl>
## 1 (Intercept)   0.305       0.103       0.506 
## 2 femWomen     -0.225      -0.332      -0.118 
## 3 ment          0.0255      0.0216      0.0295
## 4 phd           0.0128     -0.0389      0.0646
## 5 marMarried    0.155       0.0349      0.276 
## 6 kid5         -0.185      -0.264      -0.106

log_coefficients %>%
  mutate(estimate=exp(estimate)) %>%
  mutate(lower_limit=exp(lower_limit)) %>%
  mutate(upper_limit=exp(upper_limit)) -> multiplicative_coefficients
multiplicative_coefficients

## # A tibble: 6 x 4
##   term        estimate lower_limit upper_limit
##   <chr>          <dbl>       <dbl>       <dbl>
## 1 (Intercept)    1.36        1.11        1.66 
## 2 femWomen       0.799       0.718       0.889
## 3 ment           1.03        1.02        1.03 
## 4 phd            1.01        0.962       1.07 
## 5 marMarried     1.17        1.04        1.32 
## 6 kid5           0.831       0.768       0.899

Analysis of rates

Often the Poisson observations are observed over varying numbers of patients, varying amounts of time, or both. Here’s an example. Auto insurance claims in Sweden are borken down into several subgroups, as described on this web page.

Here are the variables:

Kilometres Kilometres travelled per year
- 1: < 1000
- 2: 1000-15000
- 3: 15000-20000
- 4: 20000-25000
- 5: > 25000
Zone Geographical zone
- 1: Stockholm, G?teborg, Malm? with surroundings
- 2: Other large cities with surroundings
- 3: Smaller cities with surroundings in southern Sweden
- 4: Rural areas in southern Sweden
- 5: Smaller cities with surroundings in northern Sweden
- 6: Rural areas in northern Sweden
- 7: Gotland
Bonus No claims bonus. Equal to the number of years, plus one, since last claim
Make 1-8 represent eight different common car models. All other models are combined in class 9
Insured Number of insured in policy-years
Claims Number of claims
Payment Total value of payments in Skr

fn <- "http://www.statsci.org/data/general/motorins.txt"
swe <- read.delim(fn, header=TRUE, sep="\t")
head(swe)

##   Kilometres Zone Bonus Make Insured Claims Payment
## 1          1    1     1    1  455.13    108  392491
## 2          1    1     1    2   69.17     19   46221
## 3          1    1     1    3   72.88     13   15694
## 4          1    1     1    4 1292.39    124  422201
## 5          1    1     1    5  191.01     40  119373
## 6          1    1     1    6  477.66     57  170913

The number of drives and the months that they are covered by insurance varies from group to group. the measure, Insured, calculates the number of policy-years in each group. We want to estimate the effect of certain variables on the rate of claims (number of claims per policy year). You calculate the rate by including an offset variable equal to the natural log of policy years.

pois2 <- 
  glm(
    Claims~offset(log(Insured)) + factor(Kilometres) + factor(Zone) + Bonus + factor(Make),
    data=swe, family=poisson)
summary(pois2)

## 
## Call:
## glm(formula = Claims ~ offset(log(Insured)) + factor(Kilometres) + 
##     factor(Zone) + Bonus + factor(Make), family = poisson, data = swe)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.9411  -0.9325  -0.2387   0.6105   9.4199  
## 
## Coefficients:
##                      Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)         -1.739938   0.013630 -127.651  < 2e-16 ***
## factor(Kilometres)2  0.204203   0.007511   27.188  < 2e-16 ***
## factor(Kilometres)3  0.310530   0.008631   35.977  < 2e-16 ***
## factor(Kilometres)4  0.395459   0.012024   32.889  < 2e-16 ***
## factor(Kilometres)5  0.567998   0.012801   44.372  < 2e-16 ***
## factor(Zone)2       -0.239438   0.009495  -25.216  < 2e-16 ***
## factor(Zone)3       -0.389831   0.009668  -40.320  < 2e-16 ***
## factor(Zone)4       -0.585540   0.008650  -67.696  < 2e-16 ***
## factor(Zone)5       -0.328624   0.014529  -22.618  < 2e-16 ***
## factor(Zone)6       -0.529803   0.011874  -44.620  < 2e-16 ***
## factor(Zone)7       -0.735746   0.040697  -18.079  < 2e-16 ***
## Bonus               -0.198056   0.001288 -153.741  < 2e-16 ***
## factor(Make)2        0.078421   0.021239    3.692 0.000222 ***
## factor(Make)3       -0.240878   0.025092   -9.600  < 2e-16 ***
## factor(Make)4       -0.656632   0.024181  -27.155  < 2e-16 ***
## factor(Make)5        0.158690   0.020234    7.843 4.41e-15 ***
## factor(Make)6       -0.340402   0.017374  -19.593  < 2e-16 ***
## factor(Make)7       -0.056424   0.023343   -2.417 0.015642 *  
## factor(Make)8       -0.037993   0.031603   -1.202 0.229286    
## factor(Make)9       -0.068676   0.009955   -6.899 5.25e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 34070.6  on 2181  degrees of freedom
## Residual deviance:  4025.1  on 2162  degrees of freedom
## AIC: 11703
## 
## Number of Fisher Scoring iterations: 4

The coefficients are easier to interpret after a transformation.

pois2 %>%
  tidy %>%
  mutate(lower_limit=estimate-1.96*std.error) %>%
  mutate(upper_limit=estimate+1.96*std.error) %>%
  select(term, estimate, lower_limit, upper_limit) %>%
  mutate(estimate=exp(estimate)) %>%
  mutate(lower_limit=exp(lower_limit)) %>%
  mutate(upper_limit=exp(upper_limit)) -> multiplicative_coefficients
multiplicative_coefficients

## # A tibble: 20 x 4
##    term                estimate lower_limit upper_limit
##    <chr>                  <dbl>       <dbl>       <dbl>
##  1 (Intercept)            0.176       0.171       0.180
##  2 factor(Kilometres)2    1.23        1.21        1.24 
##  3 factor(Kilometres)3    1.36        1.34        1.39 
##  4 factor(Kilometres)4    1.49        1.45        1.52 
##  5 factor(Kilometres)5    1.76        1.72        1.81 
##  6 factor(Zone)2          0.787       0.773       0.802
##  7 factor(Zone)3          0.677       0.664       0.690
##  8 factor(Zone)4          0.557       0.547       0.566
##  9 factor(Zone)5          0.720       0.700       0.741
## 10 factor(Zone)6          0.589       0.575       0.603
## 11 factor(Zone)7          0.479       0.442       0.519
## 12 Bonus                  0.820       0.818       0.822
## 13 factor(Make)2          1.08        1.04        1.13 
## 14 factor(Make)3          0.786       0.748       0.826
## 15 factor(Make)4          0.519       0.495       0.544
## 16 factor(Make)5          1.17        1.13        1.22 
## 17 factor(Make)6          0.711       0.688       0.736
## 18 factor(Make)7          0.945       0.903       0.989
## 19 factor(Make)8          0.963       0.905       1.02 
## 20 factor(Make)9          0.934       0.916       0.952

Extensions

McCullagh and Nelder also extended their results to an even broader setting using an approach known as quasi likelihood. In a quasi likelihood model, you do not specify a distribution, but rather a link function that converts the nonlinear regression relationship to a linear one, and a function that describes the mean-variance relationship. The latter is very important. In many statistical models, there is a specific type of heterogeneity where the mean and variance march together in lock step. Groups that have larger means also have larger variances. Groups that have smaller means also have smaller variances.

In the Poisson model, we specify a mean-variance relationship that hte identify relationship. If a group has a mean that is 12, the variance is assumed to be 12 as well. If the mean is 5, then the variance is 5. That works fairly well for some data, but other times it does not work so well.

There are many alternatives. You can assume that the variance is not equal to the mean but equal to the mean times a mutliplicative constant which you can estimate from the data.

Notice, for example, in the first data set that

mean(articles$art)

## [1] 1.692896

var(articles$art)

## [1] 3.709742

The variance looks to be about twice as large as the mean. Additional evidence of this is provided by the deviance statistic. The deviance is a quantity produced by a generalized linear model and should equal the degrees of freedom in a Poisson regression model.

glance(pois1)

## # A tibble: 1 x 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1         1817.     914 -1651. 3314. 3343.    1634.         909   915

You will notice here that the deviance is about twice as big as the degrees of freedom.

You can fix this problem in seveal ways. First, you can specify a quasi-likelihood that behaves like the Poisson except that the function specifying the mean-variance relationship has an extra parameter, estimated from the data.

You can also use a counting distribution, the negative binomial distribution, that allows for greater flexibility in the mean-variance relationship.

Finally, you can adjust the zero probability upward from what a Poisson model would expect. This is commonly thought of as a mixture distribution where there is a certain probability p, of getting exactly zero, and a probability 1-p or getting a Poisson random variable (which can sometimes equal zero, but only sometimes). These models are known as zero-inflated models.

The web page cited earlier has more details about these models.