Introduction
We show how to construct a Zero-Inflated Poisson (ZIP) Regression model in Excel using a Real Statistics data analysis tool, which in turn uses Solver.
Example
Example 1: Create a Poisson regression model for the number of fish caught (column A) based on the 250 rows of data in Figure 1 (only 15 of which are displayed).
Figure 1 – Fishing data
Using the Poisson Regression data analysis tool, we obtain the output in Figure 2.
Figure 2 – Poisson regression analysis
We are concerned about this model since the phi correction is high and p-values in cells P4 and P5 are quite low, indicating a poor fit. Looking at all the data in Figure 1, we see that a large portion of the people in the survey didn’t catch any fish. This leads us to trying ZIP regression.
Data Analysis Tool
To perform ZIP regression using the data in Figure 1, press the key sequence Ctrl-m and choose the Zero-Inflated Poisson Regression option from the Reg tab of the menu that appears. Next, fill in the dialog box that now appears as shown in Figure 3.
Figure 3 – ZIP Regression dialog box
After clicking on the OK button, the output in Figures 4, 5, and 6 appears. We next explain each of these figures one at a time.
Coefficients
We start by referring to the output on the upper left-hand side of Figure 4.
Figure 4 – ZIP regression (part 1)
The values in range G4:H7 are estimated by Solver to minimize the LL value in cell O14. We see from cell G9 that Solver converged to a solution. Range G4:G7 contains the Poisson regression coefficients and range H4:H7 contains the corresponding inflation coefficients based on logistic regression.
The full parameter report is provided in range F11:L20 using the array formula =ZIPParam(B2:D251,A2:A251,G4:H7,B1:D1,,O20) as described below.
We also see that 56.80% of the data contains zeros (no fish caught). This was accomplished via formulas =COUNTIF(AB4:AB253,0) in cell O10 and =O10/O7 in cell O11.
Covariance Matrix
The covariance matrix shown in Figure 5 is calculated using the array formula =NegBinomCov(D2:G13,B2:B13,J4:J9,C2:C13). Note that this matrix covers both the Poisson and logistics coefficients from Figure 4. Also note that the square root of the entries on the diagonal correspond to the standard errors of the coefficients from Figure 4. E.g. the square root of 0.106344, the value in cell Y11, is .326105, the value of cell H20 from Figure 4.
Figure 5 – ZIP regression (part 2)
Residuals
The residuals for each of the entries in Figure 1 are shown in Figure 6 (although only the first 12 of 250 are displayed).
Figure 6 – ZIP regression (part 3)
The entries are similar to those used for Poisson Regression. The main difference is that the πi values in column AE need to be considered.
E.g. the following formulas are used in row 4. The other formulas in row 4 are similar to those used for Poisson regression.
Figure 7 – Key residual formulas
Goodness of Fit
As for Poisson regression, the residuals can be used to determine the overall fit of the ZIP regression model.
The Pearson goodness of fit statistic (cell O4 of Figure 4) is equal to the sum of the squares of the Pearson residuals, as for Poisson regression.
LLfit in cell O14 is calculated using the formula =ZIPLL(B2:D251,A2:A251,G4:H7). AIC and BIC are calculated as for Poisson regression.
R-Square in cell O16 is calculated via the formula =1-O14/O13.
There are three choices for LLmin as reflected in the dialog box in Figure 3. The first choice is not to calculate LLmin at all, in which case the values in cells O13 and O16 contain N/A values. The second choice is to base LLmin on a model containing only the two intercepts. This is done using Solver with the intercept coefficients shown in J4:K4. In this case, cell O13 contains the formula =ZIPLL(B2:B251,A2:A251,J4:K5).
The third option is for Solver to use all the logistic regression coefficients but only the intercept for Poisson regression. In this case, Solver would fill in cell J4 and all the cells in range K4:K7 and the cells in range J5:J7 would contain zeros. Cell O13 would contain the formula =ZIPLL(B2:B251,A2:A251,J4:K7).
Predictions
Except for the standard errors, the predictions on the right side of Figure 6 are calculated as for Poisson regression. E.g. cell AL4 contains the formula =SQRT(AK4*(1+AC4*AE4)). In general, even for data not in the original sample, you can use the ZIPPredC and ZIPPredCC functions to predict the expected count for any combination of regressors. You can also use the ZIPProb function. See ZIP Regression Predictions for more details.
Worksheet Functions
Real Statistics Functions: The Real Statistics Resource Pack provides the following worksheet functions. These pertain to a ZIP regression model based on X data in Rx (with k columns), Y data in Ry, coefficients in Rc, and frequency data in Rt. If Rt is omitted it defaults to a column of ones. Rc is a k+1 × 2 array whose first column specifies the Poisson regression coefficients and whose second column specifies the logistic regression coefficients.
ZIPLL(Rx, Ry, Rc, Rt): returns the LL fitted value.
ZIPCov(Rx, Ry, Rc, Rt): returns a 2k+2 × 2k+2 array containing the coefficient covariance matrix.
ZIPParam(Rx, Ry, Rc, Rh, Rt, alpha): returns a coefficient parameter array similar to range F11:L20 of of Figure 4. Here, Rh is an optional row array specifying the names of the k regressors. If omitted then these names default to var1, … alpha defaults to .05.
Examples Workbook
Click here to download the Excel workbook with the examples described on this webpage.
References
Hilbe, J. M. (2014) Modeling count data. Cambridge University Press
https://assets.cambridge.org/97811070/28333/frontmatter/9781107028333_frontmatter.pdf
Hintze, J. L. (2007) Zero-Inflated Poisson regression. NCSS
https://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Zero-Inflated_Poisson_Regression.pdf
Long, J. S. and Freese, J. (2001) Regression models for categorical dependent variables using Stata
http://investigadores.cide.edu/aparicio/data/refs/Long%26Freese_RegModelsUsingStata_2001.pdf