# Working with Poisson regression

## 2021-08-30

Poisson regression is a statistical model under the large umbrella of generalized linear models. I want to explain how Poisson regression works. I also will show and interpret an example.

suppressMessages(suppressWarnings(library(broom)))
suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(ggplot2)))
suppressMessages(suppressWarnings(library(foreign)))
suppressMessages(suppressWarnings(library(magrittr)))

### The linear regression model

A linear regression model is written as

$$Y_i=\beta_0+X_i\beta_1+\epsilon_i$$

and has two assumptions that are often violated:

1. the $$\epsilon_i$$ are normally distributed, and
2. the variance is the same for all $$\epsilon_i$$.

This led to a variety of alternative approaches that worked better, especially for dependent variables that represented proportions or counts.

### The generalized linear model

In 1982, McCullagh and Nelder wrote a book outlining the generalized linear model, an approach to regression that would work well both for situations that met the assumptions of linear regression and for situations that did not meet those assumptions.

It was the sort of breakthrough that Physicists are still hoping for in the grand unified theory of the electromagnetic, strong, and weak forces under one model.

McCullagh and Nelder developed a model that works for any statistical distribution in a very broad class of distributions known as the exponential family. The exponential family of distributions includes both continuous and categorical distributions. The only requirement is that the probability density function (for continuous distributions) or the probability mass function (for categorical variables) can be factored in a a way that isolates the parameter(s) of the distribution in the following way:

$$ln(f(y; \theta))=d(\theta)e(y)+g(\theta)+h(y)$$

The Poisson distribution

$$f(y; \theta)=\theta^{-y}e^{-\theta}/y!$$ for y=0, 1, …$clearly fits into the exponential family, because $$lnf(y; \theta)=y ln(\theta)-\theta-ln(y!)$$ Other distributions that fit into this family include the • bernoulli • beta • chi-squared • exponential • gamma • geometric • normal There are some distributions which are not part of the exponential family. The t-distribution is one example. If you know that a distribution is part of the exponential family, you know the general form of the maximum likelihood estimates and find a solution using iteratively reweighted least squares. ### More on Poisson regression Counts are never negative, so we need to insure that no matter what the values of the covariates in a Poisson regression model, the predicted values produced by this model are never negative. A simple way to achieve this is to calculate the regression coefficients on a log scale. The logarithm is an interesting function that converts multiplication into addition. $$log(ab)=log(a)+log(b)$$ Back before computers, tables of logarithms simplified complex calculations. To compute a product of 4,156 and 2,023, you could multiply things out directly, but it would look something like  2141 x 2123 ------ 6423 4282 2141 4282 ------- 4545343 Instead, you could look up the base 10 logarithm of each number, add them together, and then convert back from the log scale to the original scale. log10(2141) ## [1] 3.330617 log10(2123) ## [1] 3.32695 log10(2141)+log10(2123) ## [1] 6.657567 10^(log10(2141)+log10(2123)) ## [1] 4545343 It only saves you a little bit of time, but back when mathematicians had to do hundreds or thousands of these calculations, it added up. There’s a joke along these lines. After the flood waters receded, Noah’s ark landed on solid ground. Noah opened the door and let all the animals out with the exhortation “Go forth and multiply.” There were two snakes, however, that did not want to leave the ark. Noah looked at them and said, “I told you to go forth and multiply.” The snakes looked sadly at him and said “We can’t multiply. We’re adders.” So Noah left the ark and found some wood to build a small flat platform. “Here.” he said. “This is a log table. It will allow adders to multiply.” The Poisson regression model produces estimates on a log scale. This implies that the relationships are effectively multiplicative when transformed back to the original scale. ### Simple example Francis Huang has a nice web page showing two datasets and how to analyze them using the Poisson regression model as well as several more sophisticated models. The first data set is available for anyone to download. The link I originally used no longer works. fn <- "http://faculty.missouri.edu/huangf/data/poisson/articles.csv" articles <- read.csv(fn) head(articles) Here is a different link for the same file, used at German Rodriguez’s website. I’ve placed a copy on my website in case this link breaks. Also, this dataset is available with the pscl library of R under the name, bioChemists. fn <- "http://www.stata-press.com/data/lf2/couart2.dta" articles <- read.dta(fn) head(articles) ## art fem mar kid5 phd ment ## 1 0 Men Married 0 2.52 7 ## 2 0 Women Single 0 2.05 6 ## 3 0 Women Single 0 3.75 6 ## 4 0 Men Married 1 1.18 3 ## 5 0 Women Single 0 3.75 26 ## 6 0 Women Married 2 3.59 2 The variables are + gender (fem), + marital status (mar), + number of children (kid5), + prestige of graduate program (phd), + the number of articles published by the individual’s mentor (ment) + the number of articles published by the scientist (art: the outcome). To fit a Poisson regression model in R, use the glm function with the argument, family=“poisson”. pois1 <- glm(art ~ fem + ment + phd + mar + kid5, family=poisson, data = articles) summary(pois1) ## ## Call: ## glm(formula = art ~ fem + ment + phd + mar + kid5, family = poisson, ## data = articles) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.5672 -1.5398 -0.3660 0.5722 5.4467 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.304617 0.102981 2.958 0.0031 ** ## femWomen -0.224594 0.054613 -4.112 3.92e-05 *** ## ment 0.025543 0.002006 12.733 < 2e-16 *** ## phd 0.012823 0.026397 0.486 0.6271 ## marMarried 0.155243 0.061374 2.529 0.0114 * ## kid5 -0.184883 0.040127 -4.607 4.08e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 1817.4 on 914 degrees of freedom ## Residual deviance: 1634.4 on 909 degrees of freedom ## AIC: 3314.1 ## ## Number of Fisher Scoring iterations: 5 The interpretation of the coefficients is tricky. The Poisson regression model is linear on the log scale. If you take the anti-log, you will convert this model to a multiplicative model. Here are the coefficents on the log scale. pois1 %>% tidy %>% mutate(lower_limit=estimate-1.96*std.error) %>% mutate(upper_limit=estimate+1.96*std.error) %>% select(term, estimate, lower_limit, upper_limit) -> log_coefficients log_coefficients ## # A tibble: 6 x 4 ## term estimate lower_limit upper_limit ## <chr> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.305 0.103 0.506 ## 2 femWomen -0.225 -0.332 -0.118 ## 3 ment 0.0255 0.0216 0.0295 ## 4 phd 0.0128 -0.0389 0.0646 ## 5 marMarried 0.155 0.0349 0.276 ## 6 kid5 -0.185 -0.264 -0.106 log_coefficients %>% mutate(estimate=exp(estimate)) %>% mutate(lower_limit=exp(lower_limit)) %>% mutate(upper_limit=exp(upper_limit)) -> multiplicative_coefficients multiplicative_coefficients ## # A tibble: 6 x 4 ## term estimate lower_limit upper_limit ## <chr> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.36 1.11 1.66 ## 2 femWomen 0.799 0.718 0.889 ## 3 ment 1.03 1.02 1.03 ## 4 phd 1.01 0.962 1.07 ## 5 marMarried 1.17 1.04 1.32 ## 6 kid5 0.831 0.768 0.899 ### Analysis of rates Often the Poisson observations are observed over varying numbers of patients, varying amounts of time, or both. Here’s an example. Auto insurance claims in Sweden are borken down into several subgroups, as described on this web page. Here are the variables: • Kilometres Kilometres travelled per year • 1: < 1000 • 2: 1000-15000 • 3: 15000-20000 • 4: 20000-25000 • 5: > 25000 • Zone Geographical zone • 1: Stockholm, G?teborg, Malm? with surroundings • 2: Other large cities with surroundings • 3: Smaller cities with surroundings in southern Sweden • 4: Rural areas in southern Sweden • 5: Smaller cities with surroundings in northern Sweden • 6: Rural areas in northern Sweden • 7: Gotland • Bonus No claims bonus. Equal to the number of years, plus one, since last claim • Make 1-8 represent eight different common car models. All other models are combined in class 9 • Insured Number of insured in policy-years • Claims Number of claims • Payment Total value of payments in Skr fn <- "http://www.statsci.org/data/general/motorins.txt" swe <- read.delim(fn, header=TRUE, sep="\t") head(swe) ## Kilometres Zone Bonus Make Insured Claims Payment ## 1 1 1 1 1 455.13 108 392491 ## 2 1 1 1 2 69.17 19 46221 ## 3 1 1 1 3 72.88 13 15694 ## 4 1 1 1 4 1292.39 124 422201 ## 5 1 1 1 5 191.01 40 119373 ## 6 1 1 1 6 477.66 57 170913 The number of drives and the months that they are covered by insurance varies from group to group. the measure, Insured, calculates the number of policy-years in each group. We want to estimate the effect of certain variables on the rate of claims (number of claims per policy year). You calculate the rate by including an offset variable equal to the natural log of policy years. pois2 <- glm( Claims~offset(log(Insured)) + factor(Kilometres) + factor(Zone) + Bonus + factor(Make), data=swe, family=poisson) summary(pois2) ## ## Call: ## glm(formula = Claims ~ offset(log(Insured)) + factor(Kilometres) + ## factor(Zone) + Bonus + factor(Make), family = poisson, data = swe) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -8.9411 -0.9325 -0.2387 0.6105 9.4199 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.739938 0.013630 -127.651 < 2e-16 *** ## factor(Kilometres)2 0.204203 0.007511 27.188 < 2e-16 *** ## factor(Kilometres)3 0.310530 0.008631 35.977 < 2e-16 *** ## factor(Kilometres)4 0.395459 0.012024 32.889 < 2e-16 *** ## factor(Kilometres)5 0.567998 0.012801 44.372 < 2e-16 *** ## factor(Zone)2 -0.239438 0.009495 -25.216 < 2e-16 *** ## factor(Zone)3 -0.389831 0.009668 -40.320 < 2e-16 *** ## factor(Zone)4 -0.585540 0.008650 -67.696 < 2e-16 *** ## factor(Zone)5 -0.328624 0.014529 -22.618 < 2e-16 *** ## factor(Zone)6 -0.529803 0.011874 -44.620 < 2e-16 *** ## factor(Zone)7 -0.735746 0.040697 -18.079 < 2e-16 *** ## Bonus -0.198056 0.001288 -153.741 < 2e-16 *** ## factor(Make)2 0.078421 0.021239 3.692 0.000222 *** ## factor(Make)3 -0.240878 0.025092 -9.600 < 2e-16 *** ## factor(Make)4 -0.656632 0.024181 -27.155 < 2e-16 *** ## factor(Make)5 0.158690 0.020234 7.843 4.41e-15 *** ## factor(Make)6 -0.340402 0.017374 -19.593 < 2e-16 *** ## factor(Make)7 -0.056424 0.023343 -2.417 0.015642 * ## factor(Make)8 -0.037993 0.031603 -1.202 0.229286 ## factor(Make)9 -0.068676 0.009955 -6.899 5.25e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 34070.6 on 2181 degrees of freedom ## Residual deviance: 4025.1 on 2162 degrees of freedom ## AIC: 11703 ## ## Number of Fisher Scoring iterations: 4 The coefficients are easier to interpret after a transformation. pois2 %>% tidy %>% mutate(lower_limit=estimate-1.96*std.error) %>% mutate(upper_limit=estimate+1.96*std.error) %>% select(term, estimate, lower_limit, upper_limit) %>% mutate(estimate=exp(estimate)) %>% mutate(lower_limit=exp(lower_limit)) %>% mutate(upper_limit=exp(upper_limit)) -> multiplicative_coefficients multiplicative_coefficients ## # A tibble: 20 x 4 ## term estimate lower_limit upper_limit ## <chr> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.176 0.171 0.180 ## 2 factor(Kilometres)2 1.23 1.21 1.24 ## 3 factor(Kilometres)3 1.36 1.34 1.39 ## 4 factor(Kilometres)4 1.49 1.45 1.52 ## 5 factor(Kilometres)5 1.76 1.72 1.81 ## 6 factor(Zone)2 0.787 0.773 0.802 ## 7 factor(Zone)3 0.677 0.664 0.690 ## 8 factor(Zone)4 0.557 0.547 0.566 ## 9 factor(Zone)5 0.720 0.700 0.741 ## 10 factor(Zone)6 0.589 0.575 0.603 ## 11 factor(Zone)7 0.479 0.442 0.519 ## 12 Bonus 0.820 0.818 0.822 ## 13 factor(Make)2 1.08 1.04 1.13 ## 14 factor(Make)3 0.786 0.748 0.826 ## 15 factor(Make)4 0.519 0.495 0.544 ## 16 factor(Make)5 1.17 1.13 1.22 ## 17 factor(Make)6 0.711 0.688 0.736 ## 18 factor(Make)7 0.945 0.903 0.989 ## 19 factor(Make)8 0.963 0.905 1.02 ## 20 factor(Make)9 0.934 0.916 0.952 ### Extensions McCullagh and Nelder also extended their results to an even broader setting using an approach known as quasi likelihood. In a quasi likelihood model, you do not specify a distribution, but rather a link function that converts the nonlinear regression relationship to a linear one, and a function that describes the mean-variance relationship. The latter is very important. In many statistical models, there is a specific type of heterogeneity where the mean and variance march together in lock step. Groups that have larger means also have larger variances. Groups that have smaller means also have smaller variances. In the Poisson model, we specify a mean-variance relationship that hte identify relationship. If a group has a mean that is 12, the variance is assumed to be 12 as well. If the mean is 5, then the variance is 5. That works fairly well for some data, but other times it does not work so well. There are many alternatives. You can assume that the variance is not equal to the mean but equal to the mean times a mutliplicative constant which you can estimate from the data. Notice, for example, in the first data set that mean(articles$art)
## [1] 1.692896
var(articles\$art)
## [1] 3.709742

The variance looks to be about twice as large as the mean. Additional evidence of this is provided by the deviance statistic. The deviance is a quantity produced by a generalized linear model and should equal the degrees of freedom in a Poisson regression model.

glance(pois1)
## # A tibble: 1 x 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1         1817.     914 -1651. 3314. 3343.    1634.         909   915

You will notice here that the deviance is about twice as big as the degrees of freedom.

You can fix this problem in seveal ways. First, you can specify a quasi-likelihood that behaves like the Poisson except that the function specifying the mean-variance relationship has an extra parameter, estimated from the data.

You can also use a counting distribution, the negative binomial distribution, that allows for greater flexibility in the mean-variance relationship.

Finally, you can adjust the zero probability upward from what a Poisson model would expect. This is commonly thought of as a mixture distribution where there is a certain probability p, of getting exactly zero, and a probability 1-p or getting a Poisson random variable (which can sometimes equal zero, but only sometimes). These models are known as zero-inflated models.

The web page cited earlier has more details about these models.