class: center, middle, inverse, title-slide # Statistics for International Relations Research II ## Counts ###
James Hollway
--- class: center, middle .pull-1[.circleon[![](https://www.dictionary.com/e/wp-content/uploads/2018/04/Sophies-choice.jpg)]] .pull-1[.circleon[![](https://imgaz1.staticbg.com/thumb/view/oaupload/banggood/images/7D/F5/55632c9a-6d5a-4a7e-acf8-37642e2dc570.JPG)]] .pull-1[.circleon[![](https://cdn.shopify.com/s/files/1/0080/8372/products/tattly_triangle_yoko_sakao_ohama_00_1024x1024@2x.png?v=1575322215)]] --- class: center, middle .pull-1[.circleon[![](https://static.turbosquid.com/Preview/2016/07/05__06_16_48/FishSkeletonb.jpg46B6AAD0-198F-40AC-B932-1768C4B0F869Zoom.jpg)]] .pull-1[.circleon[![](https://www.psassets.ch/thumbs/2e/1d/08716b4f86e44e119ac2ec013b10-909889.jpg)]] .pull-1[.circleon[![](https://static.toiimg.com/thumb/imgsize-5231,msid-35966935,width-400,resizemode-4/35966935.jpg)]] ??? Ok, so this week we'll talk about the following: - What are event counts (and what not)? - Theory and estimation of the Poisson model for event counts - Violations of the Poisson assumptions. - Over- and underdispersion. - Theory and estimation of the Negative Binomial model. - Zero-inflated issues and estimation. - Hurdle models --- class: center, middle # Event counts and the Poisson model .pull-1[.circleon[![](https://static.turbosquid.com/Preview/2016/07/05__06_16_48/FishSkeletonb.jpg46B6AAD0-198F-40AC-B932-1768C4B0F869Zoom.jpg)]] .pull-1[.circleoff[![](https://www.psassets.ch/thumbs/2e/1d/08716b4f86e44e119ac2ec013b10-909889.jpg)]] .pull-1[.circleoff[![](https://static.toiimg.com/thumb/imgsize-5231,msid-35966935,width-400,resizemode-4/35966935.jpg)]] --- ## Event counts Event count models are models where the dependent variable, *Y*, is a non-negative integer count of events `\(\in 0,1,2,...,\infty\)`, bounded at zero below, unbounded above. -- Political scientific count data is often related to (supposedly) indivisible units, like individuals or states, or in relation to discrete events that occur. -- Some data that looks like count should still be treated differently: - Binary data that expresses successes/failures of known number of trials best estimated with binomial model - Ordinal data like Likert scales may look like counts but best estimated with ordinal logit instead --- ## Today's data .pull-left[ ```r # install.packages("devtools") # devtools::install_github("RamiKrispin/coronavirus") library(coronavirus) # update_datasets() covid <- coronavirus %>% gr oup_by(Country.Region, type) %>% summarise(total_cases = sum(cases)) %>% pivot_wider(names_from = type, values_from = total_cases) %>% rename(Death = death, Confirmed = confirmed, Recovered = recovered) %>% arrange(-Confirmed) library(ggplot2) ggplot(covid, aes(Death)) + geom_histogram(binwidth = 200, fill = "orange") + ylab("Number of states") ggplot(covid, aes(Death)) + geom_histogram(binwidth = 200, fill = "orange") + ylab("Number of states") + scale_y_sqrt() ``` ] .pull-right[ <img src="STAT_L7_Count_files/figure-html/covid-1.png" width="504" /><img src="STAT_L7_Count_files/figure-html/covid-2.png" width="504" /> ] ??? Ok, I couldn't resist any longer. We'll do coronavirus data. It's all on our minds anyway. --- ### Adding some additional data ```r library(countrycode) library(wbstats) # wbcache(lang = "en") covid <- covid %>% mutate(iso3c = countrycode(Country.Region, "country.name", "iso3c")) wbdata <- wb(country = "countries_only", indicator = c("SP.POP.TOTL", "AG.LND.TOTL.K2", "EN.POP.DNST", "EN.URB.LCTY", "NY.GDP.PCAP.KD"), startdate = 2018, enddate = 2020, return_wide = TRUE) covid <- left_join(covid, wbdata, by = "iso3c") %>% mutate(Population = SP.POP.TOTL/1000000, # millions of people Land = AG.LND.TOTL.K2/1000000, # millions of square kilometers LargestCity = EN.URB.LCTY/1000000, # millions of people in largest city Density = EN.POP.DNST/1000, # thousands of people per square kilometer GDPpc = NY.GDP.PCAP.KD/1000) # thousands of 2010 US$ per person ``` --- ### Why not use OLS? Event counts can be a bit hard to deal with because they are: - *Discrete* (meaning they only take on integer values) - *Strictly non-negative* (meaning they only take on positive values, or zero) -- This makes using OLS on event counts a bad idea (King 1988) because it yields estimates that are: - *Inaccurate* (e.g. OLS can predict negative counts) - *Inefficient* (because they fail to account for the heteroskedastic nature of event counts) ??? One obvious reason for not regressing directly against fatalities is that (in our example) fatalities is restricted to be non-negative, a restraint that linear regression can’t enforce. Other reasons include the wide distribution of values and the relative or multiplicative structure of errors on outcomes. More on the last thing in a bit... --- ## What's the alternative? The Poisson process Suppose we are interested in studying events, and that those events occur over time. We will consider the rate at which events occur `\(\lambda\)`. It’s useful to think of `\(\lambda\)` as the expected number of events in any particular time “period” of length *h*. If, in the data-generating process (DGP) that gives rise to the events in question, the events are or can be considered to be independent (that is `\(\Pr(event_i) = \Pr(event_i|event_j)\)`), we can conceive of this as a .red[Poisson process]: events occur independently with a constant probability equal to `\(\lambda\)` times the length of the interval (that is, `\(\lambda h\)`): - `\(\Pr(Y = 1|(t,t+h]) = \lambda h\)` - `\(\Pr(Y = 0|(t,t+h]) = 1 - \lambda h\)` <!-- Then as the length of the interval `\(h \to 0\)`, `\(\lambda\)` will just be the rate. --> --- ### The Poisson distribution Consider our outcome variable *Y* as the number of events that have occurred in the interval `\((t,t+h]\)`. For such a process, the probability that the number of events occurring equals some value in the set `\({0,1,2,3,...}\)` is: `$$\Pr(Y_t=y) = \frac{\exp(-\lambda h)\lambda h^y}{y!}\tag{1}\label{1}$$` If all the intervals are of the same length (and equal to 1), this reduces to: `$$\Pr(Y_t=y) = \frac{\exp(-\lambda)\lambda^y}{y!}\tag{2}\label{2}$$` This is the way we typically see the Poisson distribution written. By this logic, the Poisson distribution is the limiting distribution for the number of independent events occurring in some fixed period of length *h* (Eq. \ref{1}) or 1 (Eq. \ref{2}). Therefore the assumptions underlying the *Poisson process* – constant arrival rates, and independence across events – are key to deriving the *Poisson distribution* in this way. ??? Relaxing these assumptions results in other distributions... --- ### Characteristics of the Poisson distribution .pull-left[ <img src="STAT_L7_Count_files/figure-html/poissondis-1.png" width="504" /> ] .pull-right[ The Poisson distribution is a discrete probability distribution, with *support* on non-negative integers. The “rate” `\(\lambda\)` can be interpreted as the expected number of events during an observation period. As `\(\lambda\)` increases, - the mean/mode of the distribution gets larger - the variance of the distribution gets larger too - makes sense: it is bounded from below - in fact, for the Poisson, the mean equals the variance `\(E(Y) = Var(Y) = \lambda\)` - the distribution becomes more normal ] ??? So what kind of odd fish is this 'Poisson' anyway? Note also that the Poisson distribution is not preserved under affine transformations. That is, affine transformations of Poisson variates are not themselves (necessarily) Poisson variates as well. It is preserved under addition (convolution) provided that the components are independent. That is, for two Poisson variates `\(X \sim Poisson(\lambda_x)\)` and `\(Y \sim Poisson(\lambda_y)\)`, `\(Z = X + Y \sim Poisson(\lambda_{x+y})\)` if *X* and *Y* are independent. However, the same is not true for differences of Poisson variates. --- ### The Poisson regression model If we assume our event count is distributed according to a Poisson distribution, then the next thing we probably want to know is the effect of covariates *X* on (the expected value of) *Y*. Since we know `\(E(Y)>0\)`, we need to restrict the “link” to be positive. The exponential is the standard one for this. So, we have `\(E(Y_i) \equiv \lambda_i = \exp(X_i\beta)\)`. .pull-left[ Incorporating this into Eq. \ref{2} yields a probability model that looks like this: `$$\Pr(Y_i = y|X_i,\beta) = \frac{\exp[-\exp(X_i\beta)][\exp(X_i\beta)^y]}{y!} \tag{4}\label{4}$$` ] .pull-right[ Making the usual i.i.d. assumptions yields a pretty simple likelihood: `$$L = \prod_{i=1}^N \frac{[\exp-\exp(X_i\beta)][\exp(X_i\beta)^{y_i}]}{y_i!} \tag{5}\label{5}$$` ] And an equally simple log-likelihood: `$$\log L = \sum_{i=1}^N -\exp(X_i\beta) + Y_i X_i \beta - \log(Y_i!) \tag{6}\label{6}$$` Where the last term `\(- \log(Y_i!)\)` can be omitted because it doesn’t vary with `\(\beta\)`. ??? It’s actually pretty easy to demonstrate that this log-likelihood is globally concave, and so estimation is really easy and reliable using more-or-less any optimizer we care to adopt. We’ll talk about some tinkering with this model a bit later. --- ## Estimation .small[ ```r library(sjPlot) poi1 <- glm(Death ~ Population + Land + Density + LargestCity + GDPpc, covid, family = poisson) *tab_model(poi1, transform = NULL, digits = 4, * vcov.fun = "HC", vcov.type = "HC3", show.aic = T, show.loglik = T) ``` <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Death</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Log-Mean</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">4.6231</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">2.9788 – 6.2674</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Population</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.0013</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.0028 – 0.0055</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.529</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Land</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.0222</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.2844 – 0.3288</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.886</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Density</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.3594</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-1.4069 – 0.6880</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.498</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">LargestCity</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.0403</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.2487 – 0.3293</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.783</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">GDPpc</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.0403</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.0178 – 0.0628</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.001</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">139</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> Nagelkerke</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">1.000</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">AIC</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">204793.499</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">log-Likelihood</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">-102390.750</td> </tr> </table> ] ??? Ok, so let's try estimating a Poisson model for our data now. If we are concerned that maybe our assumptions are mildly violated, we can use robust standard errors. Note that we can also print out all the usual goodness-of-fit checks for the model, including a global LR test (usual F-type test against the global null), AIC etc, and pseudo-R2s, for what they're worth. In terms of 'sign-ificance', the signs of the `\(\beta\)`s indicate the effect on the expected number of counts. So, positive signs indicate positive effects – higher values of *X* drive *Y* (the count) higher – while the opposite is true for negative signs. So here GDP per capita seems to matter: the richer countries are hit with more coronavirus deaths, it seems. --- ### Incident rate ratios The Poisson model is a log-linear model not unlike the various flavors of logit we talked about before. This means that (surprise, surprise!) there is something like an odds-ratio interpretation of Poisson regression coefficients here too. In the Poisson case, the quantity of interest is known as the incident rate - that is, `\(\hat\lambda\)`. The natural way to compare two observations, then, is the incidence rate ratio (or IRR). For a binary covariate `\(X_D\)` this would look like: `$$\frac{\hat\lambda\mid X_D=1}{\hat\lambda\mid X_D=0} = \frac{\exp(\hat\beta_0+\bar{X}\hat\beta+(X_D=1)\hat\beta_0{X_D})}{\exp(\hat\beta_0+\bar{X}\hat\beta+(X_D=0)\hat\beta_0{X_D})} = \exp(\hat\beta_{X_D})$$` That is, we can tell the relative change in incidence rate for a one-unit change in any given variable by simply exponentiating its coefficient estimate. --- ```r # Fortunately, `sjPlot` even offers IRRs by default. tab_model(poi1, vcov.fun = "HC", vcov.type = "HC3", digits = 4, show.aic = T, show.loglik = T) ``` <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Death</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Incidence Rate Ratios</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">101.8072</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">19.6638 – 527.0949</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Population</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.0013</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.9972 – 1.0055</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.529</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Land</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.0225</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.7525 – 1.3893</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.886</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Density</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.6981</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.2449 – 1.9897</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.498</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">LargestCity</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.0411</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.7798 – 1.3900</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.783</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">GDPpc</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.0411</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.0180 – 1.0648</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.001</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">139</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> Nagelkerke</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">1.000</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">AIC</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">204793.499</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">log-Likelihood</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">-102390.750</td> </tr> </table> ??? Substantively, this means that a one-unit ($1000) increase in GDP per capita results in a exp(0.04) = 1.04 = 4% increase in the IRR. --- ### Predicted counts We can also calculate and graph the predicted counts, holding all other variables at their mean. ```r plot_model(poi1, terms = "GDPpc", type = "pred") ``` <img src="STAT_L7_Count_files/figure-html/margins-1.png" width="504" /> --- ## Exposure Of course, you can't die from coronavirus if you don't have coronavirus. So maybe the association between GDP per capita and deaths is just because richer countries have more cases (so far). -- Note that the format for the Poisson distribution in \ref{1} includes *h* as a time interval, and we've been assuming that this is constant/shared for all. We can conceive of this as the “exposure” each subject *i* has that they can generate a count from. We have been assuming that the number of "Bernoulli trials" `\(M \to \infty\)`, but there's an upper limit in our case: If a state has 3000 cases, it *cannot* have 4000 deaths. This upper limit can change our interpretation. A state that has 200 fatalities from 2000 cases is different from a state that has 200 fatalities from 4000 cases. Lastly, states have different numbers of cases, and so the 200/4000 fatalities is clearly doing better than the 200/2000. Their *rates* of fatality clearly differ, and so the amount of exposure needs to be accounted for in some way, or the model is misspecified. ??? King (2001, 124-126) says that we can think of the exposure issue as being an unequal observation interval issue. So far we have assumed that each of our observations of event counts come from intervals of time that have the same length. --- ### Offsets An offset is a predictor variable whose coefficient is fixed at 1. Let's add `\(Z\)`, an offset/exposure variable, to our standard Poisson regression setup: `$$\begin{align} \log E(Y) & = \beta X + \log(Z)\\ \log E(Y) - \log(Z) & = \beta X\\ \log E(Y/Z) & = \beta X\\ \end{align}$$` Your underlying random variable is still *Y*, but by dividing by `\(Z\)` we've converted the LHS into a *rate* of events per unit exposure. Formally, if observations don’t have the same exposure, then the expected count is proportional to the exposure. Since this division also alters the variance of the response, we'd need to reweight by `\(Z\)` or just declare an offset. ```r *poi2 <- glm(Death/Confirmed ~ Population + Land + Density + LargestCity + GDPpc, * weights = Confirmed, covid, family = poisson) poi3 <- glm(Death ~ Population + Land + Density + LargestCity + GDPpc, * offset = log(Confirmed), covid, family = poisson) ``` --- ```r tab_model(poi2, poi3, vcov.fun = "HC", vcov.type = "HC3", digits = 4, show.aic = T, show.loglik = T, show.r2 = F) ``` <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Death/Confirmed</th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Death</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Incidence Rate Ratios</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Incidence Rate Ratios</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; col7">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.0720</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.0277 – 0.1868</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.0720</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.0277 – 0.1868</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Population</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.0003</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.9991 – 1.0015</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.602</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.0003</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.9991 – 1.0015</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7">0.602</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Land</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.8563</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.7752 – 0.9458</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.002</strong></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.8563</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.7752 – 0.9458</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong>0.002</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Density</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.6162</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.0963 – 3.9435</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.607</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.6162</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.0963 – 3.9416</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7">0.607</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">LargestCity</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.0187</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.9487 – 1.0938</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.608</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.0187</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.9487 – 1.0938</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7">0.608</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">GDPpc</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.9993</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.9844 – 1.0144</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.927</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.9993</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.9844 – 1.0144</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7">0.927</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">139</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">139</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">AIC</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">Inf</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">21098.745</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">log-Likelihood</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">-Inf</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">-10543.373</td> </tr> </table> Now it looks like GDP per capita is no longer significant (as we thought), but land is, and negative. ??? Note that exposure only really matters if: - observations have very different exposure, and/or - observations have different proportions of exposure --- ### When your DV is a percentage When your DV is a proportion or a percentage, then it will at least be bounded between 0 and 1. It may additionally be sigmoidal (like a flattened S). You have several options: Stick with OLS - if your data fall in the middle, linear section of the curve (generally between .2 and .8, though I've heard .3 to .7 is better), then you don't really need to worry about inconsistent predicted values. - if you plan to make the model complex in other ways, such as a complicated mixed model or structural equation model, it may be better to base this on a more straightforward model Use logit/probit regression - treat the proportion as a binary response - this will only work if the proportion can be thought of and you have the data for the number of successes and the total number of trials Use Poisson or other count model - and treat the proportion as a censored continuous variable - the censoring means that you don’t have information below 0 or above 1 - this could be a Poisson with an offset, or a two-limit tobit (Long, 1997) - this approach works best if there isn’t an excessive amount of censoring (values of 0 and 1). --- ### A bit more on link functions and data transforms Poisson regression is unbiased and calibrated: it preserves the conditional expectations and rollups of the training data. Regressing against the log of the outcome not calibrated, but will have lower relative error than a Poisson regression. .small[ <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Death</th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">log(Death + 1)</th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Death</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; col7">p</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; col8">Incidence Rate Ratios</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; col9">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; 0">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-120.65</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-536.66 – 295.36</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.567</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.07</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.66 – 1.47</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">101.81</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">100.08 – 103.56</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Population</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.31</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.99 – 3.60</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.262</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.00 – 0.00</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7">0.153</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">1.00</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">1.00 – 1.00</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Land</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">10.79</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-146.93 – 168.51</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.893</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.02</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.13 – 0.18</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7">0.753</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">1.02</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">1.02 – 1.03</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Density</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-193.46</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-618.80 – 231.89</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.370</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.47</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.88 – -0.06</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong>0.026</strong></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">0.70</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">0.68 – 0.72</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">LargestCity</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">21.96</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-41.12 – 85.04</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.492</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.12</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.06 – 0.18</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">1.04</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">1.04 – 1.04</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">GDPpc</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">26.78</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">11.35 – 42.22</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.001</strong></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.07</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.05 – 0.08</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">1.04</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">1.04 – 1.04</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">139</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">139</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">139</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> / R<sup>2</sup> adjusted</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">0.114 / 0.081</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">0.489 / 0.470</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">1.000</td> </tr> </table> ] ??? But even those who recognise that to use OLS here could be an issue disagree about the appropriate approach. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. --- class: center, middle # Overdispersion and the Negative Binomial model .pull-1[.circleoff[![](https://static.turbosquid.com/Preview/2016/07/05__06_16_48/FishSkeletonb.jpg46B6AAD0-198F-40AC-B932-1768C4B0F869Zoom.jpg)]] .pull-1[.circleon[![](https://www.psassets.ch/thumbs/2e/1d/08716b4f86e44e119ac2ec013b10-909889.jpg)]] .pull-1[.circleoff[![](https://static.toiimg.com/thumb/imgsize-5231,msid-35966935,width-400,resizemode-4/35966935.jpg)]] --- ## Some fishy assumptions Of course, some of the assumptions underlying the Poisson may not be appropriate for most IRPS count data. We might say there are four assumptions about the latent process underlying the Poisson distribution: 1. We observe from the start of the period (or account for how many events occurred before start of study period) 1. Periods are all of the same length (though we can account for this with an offset) 1. Events can't occur at the same time (here we're treating fatalities as if they were occurring in continuous time) 1. Probability of an event is constant across a period and independent of other events in the same period If these assumptions hold, for `\(\lambda>0\)` and `\(y\in 0,1,2,3,...\)`: `$$\Pr(Y_t = y) = \frac{\exp(-\lambda)\lambda^y}{y!}$$` Where the parameter `\(\lambda\)` is 1. the unobserved “rate” of occurrence 2. the expected value of the variable Y, *and* 3. the variance of the variable Y ??? Recall that event count models are similar to logits and probits in that they can be characterized as a realization of a latent process. In the event count case, what is latent is the rate at which events occur. --- ## Dispersion This last part, `\(\lambda = E(Y) = Var(Y)\)`, doesn't always make sense... -- - often there is *overdispersion*: `\(\lambda = E(Y) < Var(Y)\)` - less often there is *underdispersion*: `\(\lambda = E(Y) > Var(Y)\)` An example could be observations of cats, antelopes and foxes in our backyard each day: 1. cats: Y = 0,1,1,0,2,0,1,0,3,1,2,1,0,2 - `\(\mu = 1\)`, `\(\sigma = 0.92\)`, i.e. Poisson dispersion: `\(E(Y) = Var(y) \equiv \sigma = 1\)` 2. antelope: Y = 0,0,0,0,0,0,0,0,0,0,0,0,7,7 - `\(\mu = 1\)`, `\(\sigma = 6.46\)`, i.e. overdispersion: `\(E(Y) = Var(y) \equiv \sigma > 1\)` 3. foxes: Y = 1,0,1,1,1,1,1,2,1,1,1,1,1,1 - `\(\mu = 1\)`, `\(\sigma = 0.15\)`, i.e. underdispersion: `\(E(Y) = Var(y) \equiv 0 < \sigma < 1\)` Note what we are interested in here is whether data is *conditionally* Poisson distributed, i.e. conditional on the effects of the independent variables [i.e. `\(Var(Y \mid X,\beta)\)`]. When we introduce covariates, we factor out some of the heterogeneity in the data. So, as model specification gets better, our data can go from overdispersed, to Poisson, to underdispersed. But political science data are often not Poisson distributed, even conditionally. We regularly study dependent processes leaded to unobserved heterogeneity or our measures are noisy (see Prentice 1986). ??? Cats seem like a rather independent process, but antelope cluster or come together in groups, and foxes are usually solitary and almost never seen together. Underdispersion would be observed where there are remarkably uniform observations. Overdispersion is often observed where there are large asymmetries in event counts, as there often are in political science. Overdispersion is much more common in social scientific settings than is underdispersion. This could be due to groupings, contagion, or other reinforcement effects that affect our assumption that these observations are independent. A non-constant rate induces greater random variability in Y than would a constant `\(\lambda\)`. --- ### Diagnosing overdispersion We can test for overdispersion pretty easily. You could estimate a Poisson model and see whether the (squared) “errors” have a variance statistically different from `\(\lambda\)`. But a better way is to do a posthoc test based on the results of the negative binomial: ```r library(pscl); library(lmtest) nb1 <- glm.nb(Death ~ Population + Land + Density + LargestCity + GDPpc + offset(log(Confirmed)), covid) odTest(nb1) # = lrtest(poi3, nb1) ``` ``` ## Likelihood ratio test of H0: Poisson, as restricted NB model: ## n.b., the distribution of the test-statistic under H0 is non-standard ## e.g., see help(odTest) for details/references ## ## Critical value of test statistic at the alpha= 0.05 level: 2.7055 ## Chi-Square Test Statistic = 20057.972 p-value = < 2.2e-16 ``` ??? Note that a t-test of 𝛼 = 0 will, `\(\theta\)` will, asymptotically, give the same results. --- ### So what if there is overdispersion? Dispersion can indicate dependence that needs to be taken care of. After all, all of the systematic influences on `\(\lambda\)` are supposed to be included in `\(X\beta\)` so that only (constant) random noise remains. (So your first option is to improve model specification.) Continuing to fit a Poisson model to these data imposes `\(\lambda = E(Y) = Var(Y)\)`, (the mean-variance equality restriction). This artificially overestimates the variability in Y, leading us to overestimate the standard errors, and underestimate our precision/confidence in the model parameters. --- ## Models for overdispersed counts To address dispersion, we drop the assumption that `\(\lambda\)` is constant within an observation, and add a random variable that expresses that variance. The result is that we have the usual conditional Poisson variate, but with a random error term: `$$E(Y_i) \equiv \exp(X_i \beta + u_i) = \exp(X_i \beta)\exp(u_i) = \lambda_i v_i$$` We then have to specify the distribution of `\(u_i\)` to identify the model. We usually use the Gamma distribution because: 1. It is nonnegative 2. It has a nice (if complicated) closed-form solution If *v*s are assumed to be randomly distributed according to a one-parameter Gamma distribution with `\(E(v) = 1\)` and `\(Var(v) \equiv \sigma^2 = \frac{1}{\theta}\)`, then the marginal density of *Y* is .red[negative binomial]: `$$\Pr(Y_i = y \mid \lambda_i,\theta) = \Bigl(\frac{\Gamma(\theta^{-1}+Y_i)}{\Gamma(\theta^{-1})\Gamma(Y_i+1)}\Bigr)\Bigl(\frac{\theta^{-1}}{\theta^{-1}+\lambda_i}\Bigr)^{\theta^{-1}}\Bigl(\frac{\lambda_i}{\lambda_i+\theta^{-1}}\Bigr)^{Y_i}\tag{7}\label{7}$$` where `\(\Gamma\)` is the Gamma function: `$$\Gamma(\theta)=\int^\infty \exp(-t)t^{\theta-1}dt$$` --- ### The negative binomial distribution .pull-left[ <img src="STAT_L7_Count_files/figure-html/nbdis1-1.png" width="504" /> ] .pull-right[ <img src="STAT_L7_Count_files/figure-html/nbdis2-1.png" width="504" /> ] --- ### Estimation To summarize: If we assume an event count process with gamma heterogeneity in the rate of event arrivals, the resulting event count will follow a negative binomial distribution. With `\(\lambda_i = \exp(X_i \beta)\)` we get the log-likelihood: `$$\log L_{NB} = \sum_{i=1}^N \Bigl[ \Bigl(\sum_{j=0}^{Y_i-1} \log(j+\theta^{-1})\Bigr) -\log Y_i! - (Y_i - \theta^{-1}) \log[1+\theta \exp(X_i\beta)] + Y_i \log \theta + Y_iX_i\beta\Bigr]$$` We still model `\(\lambda_i = \exp(X_i \beta)\)`, but `\(E(Y) = \lambda(1+\theta\lambda)\)` where `\(\theta>0\)`. This means: - the model remains log-linear (which means IRR and predicted count interpretations remain open) - variance can be greater than mean - variance still depends on mean (which we want for an event count model) - larger values of `\(\theta\)` means greater dispersion - the model reduces to a Poisson when `\(\theta=0\)` - because `\(\theta>0\)`, R estimates `\(\frac{1}{\theta}\)` or `\(\log(\frac{1}{\theta})\)` ??? For underdispersion, you will need to use the “continuous parameter binomial” (CPB). It is a variant of the binomial distribution which is “scaled” to ensure that probabilities sum to 1.0. It has `\(E(Y_i)=\lambda_i\)` again and `\(Var(Y)=\lambda_i\theta\)` with `\(0<\theta<1\)`. It also reduces to the standard Poisson when `\(\theta=1\)`. The CPB distribution therefore imposes a theoretical “upper limit” on the count variable. This limit is due to the fact that the variability of *Y* is constrained by `\(\theta\)`; as `\(\theta \to 1\)` the upper limit disappears (i.e. goes to infinity). --- .pull-left-1[ ## Comparing Poisson and NB models Note that the significant theta in the NB model indicates overdispersion. The log-likelihoods and AICs confirm that we should prefer the NB here. Also the coefficients changed quite a bit. - pop and gdppc no longer significant (Poisson's tend to underestimate ses) - largest city seems to have grown in relative importance ] .pull-right-2[ .small[ ```r stargazer::stargazer(poi3, nb1, type = "html", style = "io", single.row = T, align = T, intercept.bottom = F, model.numbers = F) ``` <table style="text-align:center"><tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="2">Death</td></tr> <tr><td style="text-align:left"></td><td><em>Poisson</em></td><td><em>negative</em></td></tr> <tr><td style="text-align:left"></td><td><em></em></td><td><em>binomial</em></td></tr> <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">CONSTANT</td><td>-2.631<sup>***</sup> (0.016)</td><td>-3.606<sup>***</sup> (0.129)</td></tr> <tr><td style="text-align:left">POPULATION</td><td>0.0003<sup>***</sup> (0.00002)</td><td>0.0001 (0.001)</td></tr> <tr><td style="text-align:left">LAND</td><td>-0.155<sup>***</sup> (0.003)</td><td>-0.082<sup>**</sup> (0.039)</td></tr> <tr><td style="text-align:left">DENSITY</td><td>-0.484<sup>***</sup> (0.040)</td><td>-0.258<sup>**</sup> (0.117)</td></tr> <tr><td style="text-align:left">LARGESTCITY</td><td>0.018<sup>***</sup> (0.001)</td><td>0.037<sup>**</sup> (0.016)</td></tr> <tr><td style="text-align:left">GDPPC</td><td>-0.001<sup>**</sup> (0.0003)</td><td>0.002 (0.004)</td></tr> <tr><td style="text-align:left"><em>Observations</em></td><td>139</td><td>139</td></tr> <tr><td style="text-align:left"><em>Log likelihood</em></td><td>-10,543.370</td><td>-515.387</td></tr> <tr><td style="text-align:left">theta</td><td></td><td>1.454<sup>***</sup> (0.212)</td></tr> <tr><td style="text-align:left"><em>Akaike information criterion</em></td><td>21,098.740</td><td>1,042.773</td></tr> <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Notes:</em></td><td colspan="2" style="text-align:left"><sup>***</sup>p < .01; <sup>**</sup>p < .05; <sup>*</sup>p < .1</td></tr> </table> ] ] --- class: center, middle # Zero-Inflation and the Hurdle model .pull-1[.circleoff[![](https://static.turbosquid.com/Preview/2016/07/05__06_16_48/FishSkeletonb.jpg46B6AAD0-198F-40AC-B932-1768C4B0F869Zoom.jpg)]] .pull-1[.circleoff[![](https://www.psassets.ch/thumbs/2e/1d/08716b4f86e44e119ac2ec013b10-909889.jpg)]] .pull-1[.circleon[![](https://static.toiimg.com/thumb/imgsize-5231,msid-35966935,width-400,resizemode-4/35966935.jpg)]] --- ## Heaps of zeros .pull-left[ Ok, let's go back to the start. We've taken care of the large dispersion in this distribution, but there is still a literal heap of zeros... Indeed 25% have not had any fatalities yet (tragically subject to change, I'm sure). That is more than we'd expect from such an overdispersed distribution with a reasonably high `\(\lambda\)`. We can thus characterise this as an excessive number of zero outcomes for the distribution, or an 'inflation' of zeros. ] .pull-right[ <img src="STAT_L7_Count_files/figure-html/rehisto-1.png" width="504" /> ] --- ### Zero-Inflation We may have two different reasons for observing a 0 here: 1. some zeros are quite possible due to the main count/Poisson process - i.e. the rate is running, but just hasn't led to fatalities yet 2. other zeros might be due to some other, often unobserved process - i.e. (this is a bit of a stretch) reporting differences in coronavirus fatalities due to capacity or governmental edict A zero-inflated model assumes that 0s are due to two different processes: 1. a count model (usually NB because zero-inflation often goes hand-in-hand with overdispersion, but could be a Poisson) 2. a binary model (usually a logit) to model whether a zero observation is part of the first or the second population The zero-inflated negative binomial regression generates two separate models and then combines them. 1. a logit model is generated for the "certain zero" cases described above, predicting whether or not a country-year would be in this group 2. a negative binomial model is generated predicting the counts for country-years that are not certain zeros 3. the two models are combined; the expected count is expressed as a combination of the two processes. --- ## Estimating a ZINB When running the zero-inflated negative binomial in R, we must specify both models: first the count model, then the model predicting the certain zeros. If you don't, R will apply the same formula to both parts of the model. Sometimes this is what you want, but you *can* specify them differently. ```r library(pscl) zinb1 <- zeroinfl(Death ~ Population + Land + Density + LargestCity + GDPpc + offset(log(Confirmed)), covid, dist = "negbin") ``` In this example, we are predicting the count with all previous covariates, and predicting the certain zeros with them as well (depending on our expectations we can also include less). This is the response variable predicted by the full model. This refers to the logistic model predicting whether or not a country-year is a certain zero. This is the natural log of alpha. This is the dispersion parameter of the count model. We are mostly interested in the first part of the estimation (and report that). But the logistic model could be of interest as well. However, none of our variables seems to be associated to structural absence/presence of refugees. Nonetheless, the alpha and lnalpha are different from 0, so we know there is dispersion and a Poisson would be inappropriate. --- ### Diagnosing zero-inflation To know whether the ZINB was actually necessary, we use the Vuong test on both fitted models. ```r vuong(nb1, zinb1) ``` ``` ## Vuong Non-Nested Hypothesis Test-Statistic: ## (test-statistic is asymptotically distributed N(0,1) under the ## null that the models are indistinguishible) ## ------------------------------------------------------------- ## Vuong z-statistic H_A p-value ## Raw -2.3858147 model2 > model1 0.0085207 ## AIC-corrected 0.3284238 model1 > model2 0.3712956 ## BIC-corrected 4.3108548 model1 > model2 8.1312e-06 ``` The significant p-value shows us that we were right in going with the ZINB instead of a regular NB. --- ## REALLY two processes Zero-inflated models are based around the idea that there are two processes: a model for the general count (that can include some zeros) and a model for the 'extra' zeros. -- But what if they are **really** two processes? That is, that there is a process for whether or not there are any, and then if there are some how many. This is called a .red[hurdle model], since the first process represents a hurdle that must be crossed before another (count) process engages. ```r hur1 <- hurdle(Death ~ Population + Land + Density + LargestCity + GDPpc + offset(log(Confirmed)), covid, dist = "negbin") ``` --- class:center, middle.pull-left[ <img src="STAT_L7_Count_files/figure-html/zeroinf-1.png" width="504" /> ] .pull-right[ <img src="STAT_L7_Count_files/figure-html/hurdly-1.png" width="504" /> ] --- .tiny[ ```r texreg::htmlreg(list(poi3, nb1, zinb1, hur1), digits = 4, single.row = T, doctype = F, custom.model.names = c("Poisson","Negative Binomial","Zero-Inflated NB","Hurdle")) ``` <table cellspacing="0" align="center" style="border: none;"> <caption align="bottom" style="margin-top:0.3em;">Statistical models</caption> <tr> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b></b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Poisson</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Negative Binomial</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Zero-Inflated NB</b></th> <th style="text-align: left; border-top: 2px solid black; border-bottom: 1px solid black; padding-right: 12px;"><b>Hurdle</b></th> </tr> <tr> <td style="padding-right: 12px; border: none;">(Intercept)</td> <td style="padding-right: 12px; border: none;">-2.6315 (0.0157)<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-3.6058 (0.1287)<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> </tr> <tr> <td style="padding-right: 12px; border: none;">Population</td> <td style="padding-right: 12px; border: none;">0.0003 (0.0000)<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">0.0001 (0.0006)</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> </tr> <tr> <td style="padding-right: 12px; border: none;">Land</td> <td style="padding-right: 12px; border: none;">-0.1552 (0.0026)<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-0.0822 (0.0393)<sup style="vertical-align: 0px;">*</sup></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> </tr> <tr> <td style="padding-right: 12px; border: none;">Density</td> <td style="padding-right: 12px; border: none;">-0.4841 (0.0403)<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-0.2577 (0.1166)<sup style="vertical-align: 0px;">*</sup></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> </tr> <tr> <td style="padding-right: 12px; border: none;">LargestCity</td> <td style="padding-right: 12px; border: none;">0.0185 (0.0011)<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">0.0365 (0.0162)<sup style="vertical-align: 0px;">*</sup></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> </tr> <tr> <td style="padding-right: 12px; border: none;">GDPpc</td> <td style="padding-right: 12px; border: none;">-0.0007 (0.0003)<sup style="vertical-align: 0px;">*</sup></td> <td style="padding-right: 12px; border: none;">0.0016 (0.0040)</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> </tr> <tr> <td style="padding-right: 12px; border: none;">Count model: (Intercept)</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-3.5396 (0.1396)<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-3.6630 (0.1503)<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;">Count model: Population</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">0.0001 (0.0005)</td> <td style="padding-right: 12px; border: none;">0.0002 (0.0006)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Count model: Land</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-0.0827 (0.0444)</td> <td style="padding-right: 12px; border: none;">-0.0892 (0.0442)<sup style="vertical-align: 0px;">*</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;">Count model: Density</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-0.2500 (0.1212)<sup style="vertical-align: 0px;">*</sup></td> <td style="padding-right: 12px; border: none;">-0.2722 (0.1275)<sup style="vertical-align: 0px;">*</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;">Count model: LargestCity</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">0.0332 (0.0185)</td> <td style="padding-right: 12px; border: none;">0.0383 (0.0194)<sup style="vertical-align: 0px;">*</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;">Count model: GDPpc</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">0.0004 (0.0041)</td> <td style="padding-right: 12px; border: none;">0.0030 (0.0042)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Count model: Log(theta)</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">0.4006 (0.1451)<sup style="vertical-align: 0px;">**</sup></td> <td style="padding-right: 12px; border: none;">0.3717 (0.1675)<sup style="vertical-align: 0px;">*</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;">Zero model: (Intercept)</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">130.1756 (12.2067)<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">-3.6424 (0.5258)<sup style="vertical-align: 0px;">***</sup></td> </tr> <tr> <td style="padding-right: 12px; border: none;">Zero model: Population</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">0.3628</td> <td style="padding-right: 12px; border: none;">-0.0064 (0.0041)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Zero model: Land</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-45.1725 (12.1865)<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">1.0116 (0.7045)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Zero model: Density</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">107.8707 (40.8595)<sup style="vertical-align: 0px;">**</sup></td> <td style="padding-right: 12px; border: none;">-0.0068 (0.9325)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Zero model: LargestCity</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-3.3828 (3.9407)</td> <td style="padding-right: 12px; border: none;">0.1326 (0.1551)</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Zero model: GDPpc</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;">-215.1866 (4.7755)<sup style="vertical-align: 0px;">***</sup></td> <td style="padding-right: 12px; border: none;">0.0019 (0.0432)</td> </tr> <tr> <td style="border-top: 1px solid black;">AIC</td> <td style="border-top: 1px solid black;">21098.7451</td> <td style="border-top: 1px solid black;">1042.7731</td> <td style="border-top: 1px solid black;">1044.2251</td> <td style="border-top: 1px solid black;">1048.3435</td> </tr> <tr> <td style="padding-right: 12px; border: none;">BIC</td> <td style="padding-right: 12px; border: none;">21116.3520</td> <td style="padding-right: 12px; border: none;">1063.3144</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> </tr> <tr> <td style="padding-right: 12px; border: none;">Log Likelihood</td> <td style="padding-right: 12px; border: none;">-10543.3726</td> <td style="padding-right: 12px; border: none;">-514.3866</td> <td style="padding-right: 12px; border: none;">-509.1126</td> <td style="padding-right: 12px; border: none;">-511.1718</td> </tr> <tr> <td style="padding-right: 12px; border: none;">Deviance</td> <td style="padding-right: 12px; border: none;">20538.1600</td> <td style="padding-right: 12px; border: none;">149.2866</td> <td style="padding-right: 12px; border: none;"></td> <td style="padding-right: 12px; border: none;"></td> </tr> <tr> <td style="border-bottom: 2px solid black;">Num. obs.</td> <td style="border-bottom: 2px solid black;">139</td> <td style="border-bottom: 2px solid black;">139</td> <td style="border-bottom: 2px solid black;">139</td> <td style="border-bottom: 2px solid black;">139</td> </tr> <tr> <td style="padding-right: 12px; border: none;" colspan="6"><span style="font-size:0.8em"><sup style="vertical-align: 0px;">***</sup>p < 0.001, <sup style="vertical-align: 0px;">**</sup>p < 0.01, <sup style="vertical-align: 0px;">*</sup>p < 0.05</span></td> </tr> </table> ] --- ### Rootograms .pull-left[ ```r countreg::rootogram(poi3) ``` <img src="STAT_L7_Count_files/figure-html/root1-1.png" width="504" /> ```r # countreg::rootogram(nb1) ``` ] .pull-right[ ```r countreg::rootogram(zinb1) ``` <img src="STAT_L7_Count_files/figure-html/root2-1.png" width="504" /> ```r # countreg::rootogram(hur1) ``` ] ??? We can use a rootogram to visualize the fit of a count regression model. The `rootogram()` function in the `countreg` package makes this easy. ThBasically, the function `counter::rootogram(model)` plots the empirical/observed distribution (black bars) as 'hanging' off the predicted distribution (red line). Bars that do not reach zero indicate overprediction (because the model expected more than there were) and bars that reach beyond zero into the negative indicate underprediction (because the model expected fewer than there were). You can use these rootograms to not only establish which fits better overall, but where there are still issues with the fit. The counts have been transformed with a square root transformation to prevent smaller counts from getting obscured and overwhelmed by larger counts. - class: center, middle # Summary .pull-1[.circleon[![](https://static.turbosquid.com/Preview/2016/07/05__06_16_48/FishSkeletonb.jpg46B6AAD0-198F-40AC-B932-1768C4B0F869Zoom.jpg)]] .pull-1[.circleon[![](https://www.psassets.ch/thumbs/2e/1d/08716b4f86e44e119ac2ec013b10-909889.jpg)]] .pull-1[.circleon[![](https://static.toiimg.com/thumb/imgsize-5231,msid-35966935,width-400,resizemode-4/35966935.jpg)]]