Zero-inflated models

Steve Simon

2021-09-20

In models of counts, zeros represent a special case. Often there is more than one reason why you might observe a zero count, and you might want to model this probability separately. One approach is known as a zero inflated Poisson model.

suppressMessages(suppressWarnings(library(broom)))
suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(ggplot2)))
suppressMessages(suppressWarnings(library(foreign)))
suppressMessages(suppressWarnings(library(magrittr)))
suppressMessages(suppressWarnings(library(pscl)))

The zero inflated Poisson distribution

In any Poisson distribution, the probabilty of obtaining a zero is defined as

\(P[Y=0]=\frac{\lambda^0e^{-\lambda}}{0!}=e^{-\lambda}\)

Sometimes this probability is too low because there are multiple reasons why you might observe a zero. The zero-inflated Poisson model is a mixture of the Poisson random variable with a random variable that places all of its probability at zero.

\(\pi \ 0 + (1-\pi)\frac{e^{-\lambda}\lambda^y}{y!}\)

This changes the probability of zero to

\(P[Y=0]=\pi+(1-\pi)e^{-\lambda}\)

Since \(\lambda\) has to be a positive value, the probability above is greater than the Poisson probability for any value of \(\pi>0\).

The remaining probabilities are shrunk by a factor of \(1-\pi\) to compensate for the increase in the probability at zero.

You can develop a separate regression model for \(\pi\) the increase in the probability of zero in addition to the regression model for \(\lambda\), the mean of the Poisson distribution.

The zero-inflated Poisson probability distribution is not part of the exponential family, so you need separate methods for estimation. You can find a function in the pscl library that will fit a zero-inflated Poisson model.

Start with Poisson regression

It is easiest to understand the quasiPoisson model by comparing it to the Poisson model.

Francis Huang has a nice web page showing two datasets and how to analyze them using the Poisson regression model as well as several more sophisticated models. The first data set is available for anyone to download. The link I originally used no longer works.

fn <- "http://faculty.missouri.edu/huangf/data/poisson/articles.csv"
articles <- read.csv(fn)
head(articles)

Here is a different link for the same file, used at German Rodriguez’s website. I’ve placed a copy on my website in case this link breaks. Also, this dataset is available with the pscl library of R under the name, bioChemists.

fn <- "http://www.pmean.com/00files/couart2.dta"
articles <- read.dta(fn)
head(articles)
##   art   fem     mar kid5  phd ment
## 1   0   Men Married    0 2.52    7
## 2   0 Women  Single    0 2.05    6
## 3   0 Women  Single    0 3.75    6
## 4   0   Men Married    1 1.18    3
## 5   0 Women  Single    0 3.75   26
## 6   0 Women Married    2 3.59    2

The variables are + gender (fem), + marital status (mar), + number of children (kid5), + prestige of graduate program (phd), + the number of articles published by the individual’s mentor (ment) + the number of articles published by the scientist (art: the outcome).

To fit a Poisson regression model in R, use the glm function with the argument, family=“poisson”.

pois1 <- glm(art ~ fem + ment + phd + mar + kid5, family=poisson, data = articles)
summary(pois1)
## 
## Call:
## glm(formula = art ~ fem + ment + phd + mar + kid5, family = poisson, 
##     data = articles)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5672  -1.5398  -0.3660   0.5722   5.4467  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.304617   0.102981   2.958   0.0031 ** 
## femWomen    -0.224594   0.054613  -4.112 3.92e-05 ***
## ment         0.025543   0.002006  12.733  < 2e-16 ***
## phd          0.012823   0.026397   0.486   0.6271    
## marMarried   0.155243   0.061374   2.529   0.0114 *  
## kid5        -0.184883   0.040127  -4.607 4.08e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1817.4  on 914  degrees of freedom
## Residual deviance: 1634.4  on 909  degrees of freedom
## AIC: 3314.1
## 
## Number of Fisher Scoring iterations: 5

The interpretation of the coefficients should be done after exponentiating their values.

pois1 %>%
  tidy %>%
  mutate(lower_limit=estimate-1.96*std.error) %>%
  mutate(upper_limit=estimate+1.96*std.error) %>%
  select(term, estimate, lower_limit, upper_limit) %>%
  mutate(estimate=exp(estimate)) %>%
  mutate(lower_limit=exp(lower_limit)) %>%
  mutate(upper_limit=exp(upper_limit)) -> pois1_coefficients
pois1_coefficients
## # A tibble: 6 x 4
##   term        estimate lower_limit upper_limit
##   <chr>          <dbl>       <dbl>       <dbl>
## 1 (Intercept)    1.36        1.11        1.66 
## 2 femWomen       0.799       0.718       0.889
## 3 ment           1.03        1.02        1.03 
## 4 phd            1.01        0.962       1.07 
## 5 marMarried     1.17        1.04        1.32 
## 6 kid5           0.831       0.768       0.899

Fitting a zero-inflated-Poisson model

zip1 <- zeroinfl(art ~ fem + ment + phd + mar + kid5 | 1, family=poisson, data = articles)
## Warning in optim(fn = countloglikfun, gr = countgradfun, par = c(lmstart, :
## unknown names in control: family
## Warning in optim(fn = loglikfun, gr = gradfun, par = c(start$count,
## start$zero, : unknown names in control: family
summary(zip1)
## 
## Call:
## zeroinfl(formula = art ~ fem + ment + phd + mar + kid5 | 1, data = articles, 
##     family = poisson)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5296 -0.9976 -0.3026  0.5398  7.2884 
## 
## Count model coefficients (poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.553995   0.113836   4.867 1.14e-06 ***
## femWomen    -0.231609   0.058670  -3.948 7.89e-05 ***
## ment         0.021543   0.002160   9.973  < 2e-16 ***
## phd          0.002526   0.028511   0.089    0.929    
## marMarried   0.131972   0.066130   1.996    0.046 *  
## kid5        -0.170474   0.043296  -3.937 8.24e-05 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.6813     0.1558  -10.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 13 
## Log-likelihood: -1621 on 7 Df

We cannot use the broom package to extract coefficients.

References