class: center, middle, inverse, title-slide # Statistics for International Relations Research II ## Maximum Likelihood Estimation ###
James Hollway
--- class: center, middle .pull-1[.circleon[![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTc5EWzkqd2tomQAKz3aV9rA-Fk_IKD623ZsLLr76-tfkI9Ienf)]] .pull-1[.circleon[![](https://api.time.com/wp-content/uploads/2019/11/icecream.jpg?w=800&quality=85)]] .pull-1[.circleon[![](https://thumbs.dreamstime.com/b/isometric-businessman-walking-to-different-way-other-people-think-stand-out-crowd-unique-concept-68511039.jpg)]] --- class: center, middle .pull-1[.circleon[![](https://is4-ssl.mzstatic.com/image/thumb/Purple125/v4/5f/8f/55/5f8f55e1-6ca3-3911-b3f1-5aad674a5095/source/512x512bb.jpg)]] .pull-1[.circleon[![](https://i.ya-webdesign.com/images/polo-drawing-cartoon-4.png)]] .pull-1[.circleon[![](https://www.topratedforexbrokers.com/wp-content/uploads/comparison-list.svg)]] --- class: center, middle # Introduction to MLE .pull-1[.circleon[![](https://is4-ssl.mzstatic.com/image/thumb/Purple125/v4/5f/8f/55/5f8f55e1-6ca3-3911-b3f1-5aad674a5095/source/512x512bb.jpg)]] .pull-1[.circleoff[![](https://activecalculus.org/single/images/1_6_Intro.svg)]] .pull-1[.circleoff[![](https://www.topratedforexbrokers.com/wp-content/uploads/comparison-list.svg)]] --- ## Today's data .pull-left-2[ ```r library(WDI) wdi <- WDI(country = "all", indicator = c("EN.POP.DNST", "EN.ATM.CO2E.KT", "NY.GDP.PCAP.PP.CD"), start = 2012, end = 2012, extra = TRUE, cache = NULL) wdi <- subset(wdi, region != "Aggregates") wdi <- na.omit(wdi) names(wdi)[4:7] <- c("PopDensity", "CO2", "GDPpc","WB") table1::table1(~ CO2 + GDPpc + PopDensity, wdi) ```
Overall
(N=192)
CO2
Mean (SD)
172000 (837000)
Median [Min, Max]
9420 [11.0, 10000000]
GDPpc
Mean (SD)
19000 (22200)
Median [Min, Max]
11100 [670, 142000]
PopDensity
Mean (SD)
333 (1540)
Median [Min, Max]
82.2 [1.82, 18900]
] .pull-right-1[ ![](https://farm6.staticflickr.com/5540/31519242204_424deb06ca_b.jpg) ] --- ## Ordinary method Recall that inference is making statements about parameter(s), \\(\theta\\), given a sample of data, \\(X_1, X_2, ..., X_n\\) So what we could do is we could try to estimate the effect of GDP per capita on `\(CO_2\)` emissions using OLS. Recall that, assuming that the conditions for the Gauss-Markov Theorem are met, all we need to do to identify the .blue[best linear unbiased estimates] for the data is plug in the values of _x_ and _y_ into the following equations: $$ \hat{\beta} = \frac{\sum_i^n (x_i - \bar{x}) (y_i - \bar{y})}{\sum_i^n (x_i - \bar{x})^2} $$ $$ \hat{\alpha} = \bar{y} - \hat{\beta}\bar{x} $$ -- .pull-left-1[![:scale 60%](https://media.giphy.com/media/JUqiFbumTAPYIeM8yJ/giphy.gif)] .pull-right-2[Cool, _but_... what if: - some assumptions not met, e.g. nonlinear? - dependent variable not continuous?] --- ### Beyond the bounds of OLS OLS assumes a linear model but also an unbounded, continuous outcome. .blockquote[The linear model is a good one, but the world around can be both nonlinear and bounded. ~ Ward and Ahlquist] -- - For example, need to take logs of both `\(CO_2\)` and per capita GDP for the model to work. - Or want to predict a probability, which varies between 0 and 1. (OLS may predict values below 0 or above 1...) - Or perhaps `\(Y\)` _isn’t_ normally distributed around mean for each value of `\(X\)`... e.g. for a binary or discrete response. --- ## Introduction to MLE .red[Maximum likelihood estimation] (MLE) is an efficient and widely-used approach. It is an iterative process that estimates the best fitting parameter values, \\(\theta\\), _for any probability distribution_. -- Basically, MLE searches for the value(s) of \\(\theta\\) that _maximises the likelihood_ of observing the sample data collected. -- .pull-down[ ... Er, what does 'likelihood' mean? ] --- ## ‘Probably’ a probability review would be useful When we talk about probability (e.g. Stats I), we were talking about the probability of observing events - probabilities of different outcomes for a certain event must add up to 1 (.red[law of total probability]) - e.g. if heads = `\(p\)`, then tails = `\(1-p\)` (the .red[complement]) If events independent, then probability of pairs of outcomes is the product of these probabilities (.red[joint probability]). - e.g. probability of observing two heads from an unbiased coin = 0.5 x 0.5 = 0.25 - useful for expressing more complex problems, as we shall see... -- But events rarely fully independent... - e.g. though schizophrenia in population is 1%, having sister with schizophrenia greatly increases risk of your becoming schizophrenic (.red[conditional probability]) - `\(\Pr(X|Y)\)`, e.g. `\(\Pr(\text{Sz}|\text{sister has Sz})\)` Can abstract proportion of obs. expected with smooth, continuous curve called a .red[probability density function] (PDF). - Probability that, for any one observation `\(i\)`, `\(Y\)` will take on the particular value `\(y\)`. - A function of expected value of distribution (e.g. \\(\mu\\)) and variability of distribution around that mean (e.g. \\(\sigma^2\\)). ??? A PDF also has a corresponding .red[cumulative density function] (CDF) that shows the cumulated probability to that point. --- ### PDF, CDF, come again? - The PDF (Probability Density Function) looks at probability at one point. - The CDF (Cumulative Density Function) is the total probability of anything below it. .center[![:scale 60%](http://www.fekete.com/SAN/WebHelp/FeketeHarmony/Harmony_WebHelp/Content/images/Reference_Material/Analysis_Methods_Theory/Risk/Normal_Distribution_Plot_500x304.png) ] --- ### Likelihood ≠ Probability Now I want to clear something up: These are not the same. The _probability_ of an event `\(X\)` depends on model parameters \\(\theta\\): i.e. \\(\Pr(X|\theta)\\) - Knowing parameters ‐> Prediction of outcome. The _likelihood_ concerns how likely parameter values are given the data: \\(L(\theta|X)\\) - Observation of data ‐> Estimation of parameters. The likelihood is being maximised with respect to the parameters, not the sample data. -- .pull-down[ What does this mean? Be careful when using the term “likelihood” when you are talking about model results. Don't say “there is a high likelihood that the stock market will crash tomorrow”; that’s a probability. ] --- ## Intuition Ok, with that out of the way, let’s learn the intuition behind MLE on a toy example first: -- Let’s say that you are having a disagreement with your friends about the best beer (or wine, or coffee, or tea). You prefer one type of beer, and all three of your friends prefer the other type. You want to show them. -- You devise a [taste test](https://adland.tv/adnews/schlitz-schlitz-vs-michelob-live-1981-060-usa). You put both types into unmarked mugs and ask your friends to indicate which is their favourite. You mark down 1 if they choose your favourite and 0 if they choose their favourite. ```r drinks <- c(1,1,0) ``` <!-- Given this data, what is the unknown probability distribution most likely to have generated it? --> --- .pull-left-2[ ### Bernoulli This is a binary outcomes problem. If the probability of success is \\(\theta\\) then the complement is \\(1- \theta\\). In other words, this is a Bernoulli\\(^1\\) distribution: \\(Y \sim f_B(y|\theta)\\) with \\(E[Y] = \theta\\) and \\(\operatorname{Var}(Y) = \theta(1-\theta)\\), such that the probability of observing (a particular set of) data is: $$ \theta^{1s} (1 - \theta)^{0s} $$ .footnote[1 Jacob Bernoulli was a Swiss mathematician who derived the law of large numbers, discovered the mathematical constant _e_, and formulated the eponymous Bernoulli and binomial distributions.] Let's consider how likely three particular parameter values are for each observation: $$ L(\theta|x_i) $$ ] -- .pull-right-1[ .panelset[ .panel[ .panel-name[Drinks indistinguishable] Maybe drinks are indistinguishable (\\(\hat\theta = 0.5\\)): ```r 0.5^1 * (1-0.5)^0 ``` ``` ## [1] 0.5 ``` ```r 0.5^1 * (1-0.5)^0 ``` ``` ## [1] 0.5 ``` ```r 0.5^0 * (1-0.5)^1 ``` ``` ## [1] 0.5 ``` ] .panel[ .panel-name[Friends are right] Or maybe your friends are right (\\(\hat\theta = 0\\)): ```r 0^1 * (1-0)^0 ``` ``` ## [1] 0 ``` ```r 0^1 * (1-0)^0 ``` ``` ## [1] 0 ``` ```r 0^0 * (1-0)^1 ``` ``` ## [1] 1 ``` ] .panel[ .panel-name[I’m right] Or maybe I'm right (\\(\hat\theta = 1\\)): ```r 1^1 * (1-1)^0 ``` ``` ## [1] 1 ``` ```r 1^1 * (1-1)^0 ``` ``` ## [1] 1 ``` ```r 1^0 * (1-1)^1 ``` ``` ## [1] 0 ``` ] ] ] --- ### Aggregating individual likelihoods <!-- Let \\(X_i\\) be random variables and \\(x_i\\) their realizations (actual observed values). --> <!-- Let’s assume \\(X_i\\) are distributed according to a certain probability distribution \\(D\\) (e.g. Bernoulli) --> <!-- with density \\(f(\theta)\\) where \\(\theta\\) is a parameter of said distribution. --> If we assume that observations are: - .red[independent and identically distributed] (i.i.d.), i.e. not related to one another, and - governed by the same .red[data generating process] (DGP), \\(f(x|\theta)\\) then joint probability of the observations, \\( \Pr(Y_i = y_i \forall_i) \\), is simply the product of the marginals because unconditional probabilities compound. $$ f(x_1|\theta)f(x_2|\theta)f(x_3|\theta) = \prod_i^N \Pr(X_i = x_i) = L(\theta|x_1, x_2, x_3) $$ <!-- The joint probability that our observations were generated by our distribution \\(D(\theta)\\) is therefore: --> This product is our .red[likelihood function], \\(L(\theta|x_1, x_2, ..., x_n)\\), the probability of observing all data as a function of \\(\theta\\)! Maximizing this function yields \\(\hat\theta\\), the value of the parameter that makes the sample the most probable. --- .pull-left[ ### Grid search To find the most likely estimates \\(\hat\theta\\) for this data, you could use .red[grid search]: a systematic and evenly spaced grid of points to see which has the higher likelihood. Since \\(\theta\\) must be between 0 and 1, we check every .1 increment and calculate the likelihood for this data. If we compare likelihood function at two parameter points and find \\(L(\hat\theta_1|x) > L(\hat\theta_2|x)\\), then observed data more likely to have occurred if \\(\theta = \hat\theta_1\\) than if \\(\theta = \hat\theta_2\\), i.e. \\(\hat\theta_1\\) more plausible value for \\(\theta\\) than \\(\hat\theta_2\\). Their beer totally discredited, but nor are equiprobable guess and my supposition that mine is best most likely to have generated this data. But what about in between .6 and .7? Do we just repeatedly guess? ] .pull-right[ ```r thetahat <- seq(0, 1, .1) likeli <- as.data.frame(t(sapply(thetahat, function(x) c(x, x^table(drinks)[2] * (1 - x)^table(drinks)[1])))) ggplot(likeli, aes(V1, `1`)) + geom_point() + scale_x_continuous(breaks = seq(0, 1, .1)) + xlab("theta") + ylab("likelihood") ``` <img src="STAT_L4_MLE_files/figure-html/plotintuition-1.png" width="504" /> ] --- ### Differential calculus -- <!-- But what is the most likely parameter estimate for _all_ the data? (i.e. maybe the parameter is none of those 3...) --> <!-- We will call the expression that describes the joint probability of the observed data as a function of the parameters the likelihood function: \\(L(y|\theta)\\) --> Of course, we don’t know \\(\hat\theta\\); that's what we want to figure out. We want to know most likely values of \\(\theta\\), given \\(X\\): $$ L(\hat{\theta}|X) \propto \Pr(X|\hat{\theta}) $$ We can get this by treating the likelihood as a function (which it is) and use tools from calculus to solve for the maximum: $$ L = \theta^2(1-\theta)^1 $$ $$ \log L = 2 \log \theta + 1 \log(1-\theta) $$ $$ \frac{\partial \log L}{\partial \theta} = \frac{2}{\theta} - \frac{1}{(1-\theta)} = 0 $$ $$ \hat\theta = \frac{2}{3} $$ The value of \\(\theta\\) that maximises the likelihood function is call the .red[maximum likelihood estimate] or MLE. .footnote[Ward and Ahlquist show how you would get the same result by using the binomial distribution instead of the Bernoulli distribution.] ??? Ok, I swear this is still intuition, we'll get to the guts of this next. <!-- --- --> <!-- ## Binomial distribution --> <!-- $$ P(x) = \frac{n!}{x!(n-x)!} p^x (1-p)^{n-x} $$ --> <!-- - \\(P(x)\\) is number of heads obtained --> <!-- - \\(n!\\) is count of how many times we flip the coin (# event can occur) --> <!-- - \\(x!\\) deletes duplications --> <!-- - \\((n-x)!\\) ends count of events --> <!-- - \\(p^x\\) is probability of success (head) for n trials --> <!-- - \\((1-p)^{n-x}\\) is probability of failure (tail) for n trials --> <!-- Joint probability of obtaining h heads and n-h tails is simply the product of all the individual probabilities. --> --- class: center, middle # Deeper into MLE .pull-1[.circleoff[![](https://is4-ssl.mzstatic.com/image/thumb/Purple125/v4/5f/8f/55/5f8f55e1-6ca3-3911-b3f1-5aad674a5095/source/512x512bb.jpg)]] .pull-1[.circleon[![](https://activecalculus.org/single/images/1_6_Intro.svg)]] .pull-1[.circleoff[![](https://www.topratedforexbrokers.com/wp-content/uploads/comparison-list.svg)]] --- ## Back to our data Ok, let's get back to our main example for today: `\(CO_2\)` emissions. ```r x.mat <- cbind(1, log(wdi$GDPpc)) y.vec <- log(wdi$CO2) ``` -- ### Express joint probability If observations _i.i.d._, then we can express the joint probability thus: $$ \Pr(y_1|\theta_1) = f(y_1|\theta_1) $$ $$ \Pr(y_1,y_2|\theta_1,\theta_2) = f(y_1|\theta_1) \times f(y_2|\theta_2) $$ $$ \Pr(y_1,...,y_n|\theta_1,...,\theta_n) = \prod_i^n f(y_i|\theta_i) $$ --- ### Convert joint probability into a likelihood $$ L(\theta|y) = h(y) \Pr(y|\theta) \propto \Pr(y|\theta) = \prod_i^n f(y_i|\theta_i) $$ Note the presence of _h(y)_, an _unknown_ scaling factor that is a function of the data but not the parameter. Because it is unknown and cannot be estimated, the likelihood is only defined up to this multiplicative constant and does not have a direct interpretation as a probability; we have to interpret likelihoods relative to one another. _But_ the .red[likelihood principle] states that, after the data have been collected, all information relevant for inference about \\(\theta\\) is contained in the likelihood function for the observed data under the assumed probability model. Two likelihood functions that are proportional to one another contain the same information about theta. Because it is unknown and cannot be estimated, and just adds notational complexity, we’re going to ignore it, and just say that the likelihood is ‘_proportional to_’ \\( \propto \Pr(y|\theta) \\) $$ L(\theta|x) = h(x) f(x|\theta) \propto f(x|\theta) $$ --- ## Two decisions Using maximum likelihood estimation means making two decisions (this relates back to our discussion about [partitioning the data into pattern and residuals](https://jhollway.github.io/RISP062/STAT_L1_Refresher.html?panelset5=r-code6&panelset6=r-code7&panelset7=r-code8&panelset8=r-code9&panelset9=r-code10#55)...): 1. pattern: what is our model for the parameters of the assumed probability distribution? - also called the _systematic component_: \\( \theta_i = \beta_0 + \beta_1 x_i \\) 1. residuals: what probability distribution are we assuming govern our data generating process? - also called the _stochastic component_: \\( Y_i \sim f(\theta) \\) -- Addressed the systematic component in previous classes, but now much more flexibility on the stochastic component. -- In \\( Y_i \sim f(\theta) \\) , \\(f\\) could be one of a pretty wide range of probability distribution or mass functions. - tackle a number this semester: e.g. the binomial, Poisson, or Weibull, or even a mixture (NB: mixture ≠ mixed) --- ### Gaussian Here we choose the .red[Gaussian], where residuals are expected to be normally distributed around the mean. We therefore consider a model that looks like \\(Y_i \sim f_N(\mu, \sigma^2)\\), where \\(E(Y) = \mu\\) and \\(\operatorname{Var}(Y) = \sigma^2\\), giving a PDF: $$ \Pr(Y_i = y_i) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp[-\frac{(Y_i - \mu)^2}{2 \sigma ^2}] $$ \\( \theta_i = (\mu_i, \sigma^2) \\) are the parameters we'll be estimating, with \\( \mu_i = \beta_0 + \beta_1 x_i \\) representing our systematic component \\(\exp[-\frac{(Y_i - \mu)^2}{2 \sigma ^2}]\\) ensures the variable is symmetric around the mean \\(-\frac{(Y_i - \mu)^2}{2 \sigma ^2}\\) ensures the PDF lies between 0 and 1 for all values of the variable between 1 and \\(-\infty\\) \\(\frac{1}{\sqrt{2 \pi \sigma^2}}\\) ensures the PDF integrates to a total mass of 1 This means our likelihood is: $$ L(\mu, \sigma^2|y) = \prod_i^n (2\pi\sigma^2)^\frac{-1}{2} exp[\frac{-(y_i - \mu_i)^2}{2\sigma^2}] $$ --- ## Simplify expression .pull-left[ $$ L(\mu, \sigma^2|y) = \prod_i^n (2\pi\sigma^2)^\frac{-1}{2} exp[\frac{-(y_i - \mu_i)^2}{2\sigma^2}] $$ ] .pull-right[ Problem: likelihood functions often look scary: - products generally hard to deal with, and .. - issues with numeric precision when we get into teeeeny probabilities... ] -- So, we work with the _log_ of this function. The log is a monotonic transformation which preserves important properties (i.e. extrema occur in the same place), but turns a product into a sum, which is easier to work with: Whereas \\(L(\theta|x_1, x_2) = f(x_1|\theta)f(x_2|\theta)\\), the .red[log-likelihood] is \\(\log L(\theta|x_1, x_2) = \log f(x_1|\theta) + \log f(x_2|\theta)\\), or $$ \log(\prod_i^N \Pr(X_i = x_i)) = \sum_i^N \log(\Pr(X_i = x_i)) $$ That is because one of the main .red[logarithmic identities] or .red[logarithmic laws] is that the logarithm of a product is the sum of the logarithms of the numbers being multiplied: $$ \log(xy) = \log(x) + \log(y) $$ ??? Multiplying small numbers yields even smaller numbers, which makes the floating point approximation used by a computer less accurate. The asymptotic properties of sums are easier to analyze. Products are not numerically stable: they tend to converge quickly to zero or to infinity. Sums are instead more stable from a numerical standpoint; this is important because the maximum likelihood problem is often solved numerically on computers where limited machine precision does not allow to distinguish a very small number from zero and a very large number from infinity. --- ### Taking the log of our normal PDF $$ \log L(\mu, \sigma^2|y) = \sum_i^n \log[ (2\pi\sigma^2)^\frac{-1}{2} exp[\frac{-(y_i - \mu_i)^2}{2\sigma^2}] ] $$ $$ = \sum_i^n [ - \frac{1}{2} \log(2\pi) - \frac{1}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} (y_i - (\beta_0 + \beta_1 x_i))^2] $$ $$ -2 \log L(\mu, \sigma^2|y) = \sum_i^n [ \log(2\pi) + \log(\sigma^2) + \frac{1}{\sigma^2} (y_i - (\beta_0 + \beta_1 x_i))^2] $$ ### Eliminating unknown parameters Since the first term, \\(\log(2\pi)\\), is just a constant unrelated to any of the parameters to be estimated, it can be safely dropped, leaving us with: $$ -2logL(\mu, \sigma^2|y) = \sum_i^n [\log(\sigma^2) + \frac{1}{\sigma^2} (y_i - (\beta_0 + \beta_1 x_i))^2] $$ --- ## Find the extrema of this expression Therefore, the maximum likelihood estimator \\(\hat\theta\\), is: $$ \hat\theta = \arg \max_n L(\theta|x) $$ It is easy to estimate linear normal regression in R using MLE. Here are the main steps, showing you how to program the likelihood directly so that you might understand what is happening ‘under the hood’. First let’s write up a custom log-likelihood function for calculating the log-likelihood for our data. ```r loglik.my <- function(par, X, y){ y <- y.vec X <- x.mat k <- ncol(X); n <- nrow(X) xbeta <- X %*% par[1:k] # matrix notation sigma <- sqrt(1 / (n-k) * sum((y - xbeta)^2)) # assumed known here sum(-log(sigma) - (1/(2*sigma^2)) * (y - xbeta)^2) # log-likelihood } ``` --- ### Marco Polo To find the \\(\hat\theta\\) that maximises the likelihood, we could just try many values of \\(\theta\\) (i.e. grid search, if evenly spaced). But this is tedious and carries no guarantee that the maximum found will actually be the .red[global maximum]. -- .pull-left[![](https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fpbs.twimg.com%2Fmedia%2FCoyZEnbWIAAIegl.jpg&f=1&nofb=1)] .pull-right[ Instead we can play a sophisticated game of ‘Marco Polo’ or ‘hotter-colder’. Rather than a grid search, we’re going to check where we are at each estimate, get feedback on whether we are close to the maximum and which way we should go, move in that direction, and then check again. The ‘hotter/colder’ information such algorithms utilize comes from the way in which the likelihood function changes as we move across the parameter space. ] --- ### Alpine Skiing/Climbing .pull-left[ The intuition of Newton-Raphson is pretty simple: you pick a starting point and look around for the steepest slope. - if you are trying to minimise a function, ski downhill a little way, check again, ski again, etc until you can't go any lower - if you are trying to maximise a function, climb uphill until you can't get any higher We need two pieces of information at each check point to do this: - whether on a plateau or not - which way to go if not The neat thing about this approach is that it will work for any family of models that you can write an equation for. ] .pull-right[ <img src="STAT_L4_MLE_files/figure-html/plotintuition2-1.png" width="504" /> ] --- ### Regularity So long as some .red[regularity conditions] are met... - \\(L\\) should be smooth without any kinks or holes (i.e. continuous), at least in the neighbourhood of \\(\hat\theta\\) - \\(\hat\theta\\) should not fall on some edge or boundary of the parameter space - the support of the distribution or mass function for `\(X\)` should not depend on the value of \\(\theta\\) - likelihoods with a single maximum, rather than many small, local hills or plateaus, are easier to work with ...then the log-likelihood should be twice differentiable around the MLE: - the .red[first derivative], \\(f'(y)\\), gives us information about whether we are on a plateau and which direction to move in to maximise/minimise log-likelihood - the .red[second derivative], \\(f''(y)\\), gives us information about whether we are at a maximum or minimum, and how much uncertainty there is --- #### First derivative We want to move the vector to the point at which the value of the likelihood is at its highest, and to do this we want information about the slope of the likelihood function at each point. This information is offered by the matrix of first derivatives of the (log-)likelihood w.r.t. the \\(\theta\\)s, called the .red[gradient] matrix \\(\nabla\\). The .red[score function], \\(S(\theta)\\), is the gradient of the log-likelihood. Evaluated at a particular point \\(\theta\\), the score function will describe the steepness of the log-likelihood. - If the gradient is positive, then increasing \\(\hat\theta\\) is increasing the likelihood, so we should increase our estimate more... - If the gradient is negative, then increasing \\(\hat\theta\\) is decreasing the likelihood, so we should decrease our estimate of \\(\hat\theta\\) - If the function is continuous, then at the maximum or minimum, this vector of first partial derivatives equals 0, telling us when we are at a maximum (or minimum). -- We can therefore find an extrema for this curve by setting the first derivative to zero and solving. -- But is it the maximum or the minimum? --- #### Second derivative The second derivative is closely related to the notion of curvature of a function, and is critical in both estimating the MLE and describing our uncertainty. Second derivative gives us the direction in which the slope of the function is changing: <!-- - If \\( f''(Y)\\) positive, then slope of \\( f''(Y)\\) is becoming more positive in the variable, indicating a minimum. --> - If \\( f''(Y)\\) negative, then slope of \\( f'(Y)\\) becoming less positive, indicating a maximum. - The bigger the absolute value of the second derivative, the more ‘curved’ the function, which tells us the rate at which the slope is changing... -- #### Putting it together .pull-left[ $$ \hat\theta_k = \hat\theta_j + \Delta_j \lambda_j $$ - ∆ (delta) tells us the direction we want to take a step in - λ (lambda) tells us the amount by which we want to change our estimates (the “length” of the step size) - using _j_ here for _k - 1_; the old parameter estimates... ] .pull-right[ ![](https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fimg.sparknotes.com%2Ffigures%2F6%2F6feb2790f92651105afa8c668bf0b7ba%2Ffigure3-8p-3.gif&f=1&nofb=1) ] --- ### Using optim ```r mle.fit <- optim( par = c(5,5), # practically speaking, OLS estimates usually good starting guesses fn = loglik.my, # function to minimise method = "BFGS", # algorithm choice (a quasi-Newton-Raphson method) control = list( trace = T, # trace optimisation progress maxit = 1000, # maximum number of iterations fnscale = -1 # changes minimiser to maximiser ), hessian = T) # return Hessian matrix (later) ``` ``` ## initial value 814.675620 ## iter 10 value 251.514767 ## iter 10 value 251.514767 ## iter 10 value 251.514767 ## final value 251.514767 ## converged ``` ```r if(mle.fit$convergence != 0) print("WARNING: Convergence Problems; Try again!") ``` <!-- https://www.r-bloggers.com/how-to-build-a-logistic-regression-model-from-scratch-in-r/ --> --- ## Establishing uncertainty .pull-left[ The second derivative is also important for describing our uncertainty about \\(\hat\theta\\). The negative of the second derivative tells us how much (a quantity, not a function) information about \\(\theta\\) we have at the MLE. This is called .red[Fisher information], \\(I(\hat\theta)\\). When \\(\theta\\) is a _k_-dimensional vector, the matrix of second partial derivatives is known as the .red[Hessian matrix], \\(H(\theta)\\). Inverting the Hessian matrix, \\(-H(\hat\theta)^{-1}\\), will provide the variance-covariance matrix, the square root of the diagonal of which contains the estimated standard errors for the parameters. ] .pull-right[ ```r ses <- sqrt(-diag(solve(mle.fit$hessian))) z <- mle.fit$par/ses p.z <- 2 * (1 - pnorm(abs(z))) # p-values out.table <- data.frame(Est = mle.fit$par, SE = ses, Z = z, pval = p.z) round(out.table, 2) ``` ``` ## Est SE Z pval ## 1 0.39 1.27 0.30 0.76 ## 2 0.96 0.14 7.08 0.00 ``` ] These results imply that, in the set of countries/year examined, higher levels of GDP are associated with higher levels of `\(CO_2\)` emissions. The estimated coefficient for GDP, `\(\beta_1\)`, means that for every order of magnitude increase in GDP per capita, there will be an order of magnitude increase in annual `\(CO_2\)` emissions, on average. ??? Therefore, expected Fisher information, written in terms of the Hessian matrix, is: $$ I(\theta) \equiv E[S(\theta)S(\theta)^T] = -E[H(\theta)] $$ --- ## MLE Properties .red[Invariance] - a transformation of the MLE equals the MLE of that transformation .red[Large-sample properties] based on the .red[Law of Large Numbers] and the .red[Central Limit Theorem] 2. .red[Consistency]: given infinite data, the MLE collapses to the ‘true’ parameter value 3. .red[Asymptotic efficiency]: as sample size increases, they become unbiased minimum variance estimators 3. .red[Asymptotic normality]: as sample size increases, standard error becomes normal - allowing conventional hypothesis testing and _z_-scores for results - enables predictions of yet-to-be observed data -- _No guarantee_ that MLE is unbiased, and often is in small samples. But bias only a second-order concern here: 1. bias disappears as _n_ increases 1. bias in the MLE can be estimated and account for, if needed, using cross-validation 1. an unbiased estimator is not robust to re-parameterisation --- class: center, middle # Diagnostics and Comparisons .pull-1[.circleoff[![](https://is4-ssl.mzstatic.com/image/thumb/Purple125/v4/5f/8f/55/5f8f55e1-6ca3-3911-b3f1-5aad674a5095/source/512x512bb.jpg)]] .pull-1[.circleoff[![](https://activecalculus.org/single/images/1_6_Intro.svg)]] .pull-1[.circleon[![](https://www.topratedforexbrokers.com/wp-content/uploads/comparison-list.svg)]] --- ## Diagnostics Because we cannot get _h(x)_, the likelihood is itself not meaningful as a raw value but, so long as you are basing these calculations on the _exact same data_, we can use the likelihood to compare across models through the: - likelihood ratio - score test - Wald statistics - Akaike information criterion - Bayesian information criterion NB: Sometimes including a new parameter will mean losing some observations when only non-missing data is retained. In this case, it is difficult to compare/contrast likelihoods. _Again_, comparing likelihood-based measures only useful if _same data_. --- ### Likelihood ratios You can simply divide one by the other, canceling out the unknown scaling factor. The likelihood ratio is used to compare ‘nested’ models, i.e. when one is a restricted case of the other. $$ LR(\theta_R, \theta_G|x) = -2 \log \frac{L(\theta_R|x)}{L(\theta_G|x)} $$ This likelihood ratio statistics is distributed according to the \\(\chi^2\\) distribution under the null hypothesis that the more restricted model is the “correct” one. The degrees of freedom of the null distribution equals the number of parameter restrictions imposed as we go from the general to restricted model. Therefore we can calculate the probability of observing a likelihood ratio statistic at least as great as the one we observe for our dataset and proposed models. Statistical packages often report a _model_ \\(\chi^2\\). This compares the fitted model to one that includes _only_ the intercept. That is, is the model better than just using the mean? --- .pull-left[ ### Score tests The score test relies on the score function to consider the slope of the log-likelihood when we constrain a particular parameter to some hypothesised value. If the constraint is appropriate, then the slope of the log-likelihood near the constraint should be about 0 since, at the maximum of the likelihood function, the derivative with respect to the parameters equals 0. ] .pull-right[ ![](https://stats.idre.ucla.edu/wp-content/uploads/2016/02/nested_tests.gif) <!-- ![](https://zhanxw.com/blog/wp-content/uploads/2014/12/FAQ__How_are_the_likelihood_ratio__Wald__and_Lagrange_multiplier__score__tests_different_and_or_similar_.png) --> ] ### Wald tests The Wald test is simply the squared difference of the estimated parameters from some constrained value (typically 0), weighted by the curvature of the log-likelihood. The Wald test is a generalised version of the standard _t_-test. --- ### Information criteria Maximised likelihood, _L_, also serves as basis for AIC and BIC, where _k_ is the number of estimated parameters in the fitted model, and _n_ is the number of observations: .pull-left-1[ $$AIC = -2\log L + 2k $$ $$AICc = AIC + \frac{2k^2 = 2k}{n - k - 1} $$ $$BIC = -2 \log L + k \log n $$ ] .pull-left-2[ As relative measures of model parsimony, only meaningful in comparison to other model specifications _on the exact same data_. Lower score means more parsimonious than fit with higher AIC. BIC imposes a higher penalty so long as \\(n ≥ 8\\). AICc corrects for small sample sizes. ] -- What’s the difference between LR, score, and Wald tests and ICs? - ICs can compare non-nested models (e.g. linear to polynomial) - ICs penalise models for greater complexity ??? A few pieces of advice: Model selection conducted with the AIC will choose the same model as leave-one-out cross validation (where we leave out one data point and fit the model, then evaluate its fit to that point) for large sample sizes. Don't compare too many models with the AIC. You will run into the same problems with multiple model comparison as you would with p-values, in that you might by chance find a model with the lowest AIC, that isn’t truly the most appropriate model. When using the AIC you might end up with multiple models that perform similarly to each other. So you have similar evidence weights for different alternate hypotheses. --- ## What if L is wrong? Note that information about \\(\theta\\) comes from both the observed data _and_ our assumptions about the DGP as embodied in the probability model \\(f(x|\theta)\\). Many of the MLE’s desirable properties are predicated on the untestable assumption that our probability model correctly captured some objective, “true” DGP. If the model is wrong, then parameter estimates can be (but are not necessarily) biased and inconsistent. What this comes down to is that we need to choose the best probability model given what we know about the DGP for a particular application. We can use AIC and BIC to compare different model specifications, and this includes the use of a different probability model. -- Under certain circumstances we may also want to correct for some misspecification problems. For example, if our estimates are still consistent with respect to `\(\theta\)`, then robust standard errors can help. --- ### Model identification issues It is not always possible to find one set of parameter values that uniquely optimizes the log-likelihood. 1. If there are too many parameters (model is ‘underidentified’). 1. Local minima/maxima can happen! ![](https://www.researchgate.net/profile/Eugene_Fink2/publication/220080335/figure/fig1/AS:341297194913803@1458383000853/Illustration-of-the-local-maxima-surrounding-a-minimum-a-i-These-maxima-are-used-in.png) -- If this happens, estimates may not converge at all. If convergence is difficult, try the following: - Make sure that the model is properly- (or, at least, well-) specified. - Delete all missing data explicitly. - Rescale the variables so that they’re all more-or-less on the same scale. - Perhaps try another optimization algorithm. --- ## OLS vs MLE So is MLE better than OLS? -- Not really. Indeed they are equivalent for certain situations. -- Ward and Ahlquist say that OLS is just marketed as the 'ordinary' model, but it is really just a special case of MLE. -- .pull-left-2[But there are some important conceptual advantages to MLE: - more than just a collection of statistical models or even an estimation procedure - a unified way of thinking about model construction, estimation and evaluation (Ron Fisher) - enables us to custom-design our own statistical estimators for our own research questions] .pull-right-1[ ![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fa%2Faa%2FYoungronaldfisher2.JPG%2F220px-Youngronaldfisher2.JPG&f=1&nofb=1) ] --- ## Other estimation techniques MLE is just one family of estimation techniques. Others include .red[general method of moments] or .red[Bayesian] approaches. -- Actually MLE and Bayesian estimation quite related. Bayesian thinking builds on Rev. Thomas Bayes' Rule, which provides a principled basis for combining prior knowledge and beliefs with newly acquired data to generate an updated set of beliefs. $$ \Pr(\theta|x) = \frac{\Pr(x|\theta) \Pr(\theta)}{\Pr(x)} \propto \Pr(x|\theta) \Pr(\theta) \propto L(\theta)\Pr(\theta) $$ From Bayes' Rule, we see that the posterior \\( \Pr(\theta|x) \\) is the prior \\( \Pr(\theta) \\) times the likelihood \\( \Pr(x|\theta) \\), scaled by the marginal probability of the data, \\(\Pr(x) \\). - As the volume of observed data increases, the information in the data dominates our prior beliefs and the posterior comes to resemble the likelihood. - If we assume that the prior, \\(\Pr(\theta)\\), follows a uniform distribution with support that includes \\(\hat\theta\\), then the mode of the posterior will be the MLE. While Bayesian estimation is more involved than MLE (and thus beyond this course) and the final interpretation of the parameter estimates differ, in many situations ML and Bayesian estimation yield similar results. ??? Maximum likelihood is based on the idea that the observed data are more likely to have come about as a result of a particular set of parameters. That is, rather than consider the data as random and the parameters as fixed, MLE treats the observed data as fixed and asks: What parameter values are most likely to have generated the data? It is thus quite similar to a Bayesian framework. --- ## Generalized Linear Models So what is MLE good for then? -- MLE is most often used as a flexible estimation technique for .red[Generalized Linear Models] (GLMs). Whereas .red[General Linear Models] (I know, I know) require that the errors follow the normal distribution (also known as the bell curve or Gaussian Distribution), .red[General_ized_ Linear Models] allows the specification of models where errors follow different distributions. GLMs are one of the most useful modern statistical tools, because they can be applied to many different types of data: e.g. binary, ordinal, categorical, count, and waiting time data. GLMs use different link functions to denote a different relationship between the linear model and the response variable (e.g. inverse, logit, log, etc). For example, for the poisson regression, the link function is the “log”. The link function is not the same thing as the conditional distribution of the response. In R, we often use the `glm()` function for GLMs. There we specify the conditional distribution, e.g. `binomial`, as well as sometimes (there are canonical defaults) the link function, e.g. `link = logit`. ```r ?family # for more ``` ??? For _general_ linear models, the link function is just a simple identity function, `\(f(z)=z\)`. See [here](https://www.r-bloggers.com/generalized-linear-models-understanding-the-link-function/). --- class: center, middle # Summary .pull-1[.circleon[![](https://is4-ssl.mzstatic.com/image/thumb/Purple125/v4/5f/8f/55/5f8f55e1-6ca3-3911-b3f1-5aad674a5095/source/512x512bb.jpg)]] .pull-1[.circleon[![](https://activecalculus.org/single/images/1_6_Intro.svg)]] .pull-1[.circleon[![](https://www.topratedforexbrokers.com/wp-content/uploads/comparison-list.svg)]] Try to digest all this a bit before next week... --- background-image: url(https://thumbs.gfycat.com/BlackPoisedGiantschnauzer-max-1mb.gif) background-size: contain <!-- url(https://media.giphy.com/media/xUNd9OonKkNhU1mF9e/giphy.gif) -->