class: center, middle, inverse, title-slide # Statistics for International Relations Research II ## Modelling ###
James Hollway
--- class: center, middle .pull-1[.circleon[![](https://graduateinstitute.ch/sites/default/files/styles/medium/public/2019-01/James%20Hollway.jpg?itok=1Yw0keum)]] .pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.istockphoto.com%2Fvectors%2Fgrade-a-plus-result-vector-icon-school-red-mark-handwriting-a-plus-in-vector-id1136966571%3Fk%3D6%26m%3D1136966571%26s%3D612x612%26w%3D0%26h%3DS3pDI_xutxq1nLoWAW_D3h5j9wfwkVhWe7LSoVJRA00%3D&f=1&nofb=1)]] .pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fc%2Fcf%2FRewind_button.svg%2F240px-Rewind_button.svg.png&f=1&nofb=1)]] --- class: center, middle .pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.encyclopediaofmath.org%2Flegacyimages%2Fcommon_img%2Fe036910a.gif&f=1&nofb=1)]] .pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn4.iconfinder.com%2Fdata%2Ficons%2Fevil-icons-user-interface%2F64%2Fdownload-512.png&f=1&nofb=1)]] .pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fd%2Fdf%2FTwo_Parallel_lines.svg%2F200px-Two_Parallel_lines.svg.png&f=1&nofb=1)]] --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/d/d1/Swiss_children.jpg) background-size: 105% 105% --- .pull-left[ ```r library(table1) data("swiss", package = 'Zelig') table1::table1(~. , swiss) ```
Overall
(N=47)
Fertility
Mean (SD)
70.1 (12.5)
Median [Min, Max]
70.4 [35.0, 92.5]
Agriculture
Mean (SD)
50.7 (22.7)
Median [Min, Max]
54.1 [1.20, 89.7]
Examination
Mean (SD)
16.5 (7.98)
Median [Min, Max]
16.0 [3.00, 37.0]
Education
Mean (SD)
11.0 (9.62)
Median [Min, Max]
8.00 [1.00, 53.0]
Catholic
Mean (SD)
41.1 (41.7)
Median [Min, Max]
15.1 [2.15, 100]
Infant.Mortality
Mean (SD)
19.9 (2.91)
Median [Min, Max]
20.0 [10.8, 26.6]
] .pull-right[ In 1888, Switzerland was entering a period known as the demographic transition; i.e., its fertility was beginning to fall from the high level typical of underdeveloped countries. The data collected are for 47 French-speaking “provinces” circa 1888. - **Fertility**: Ig, ‘common standardized fertility measure’ - **Agriculture**: % of males involved in agriculture as occupation - **Examination**: % draftees receiving highest mark on army examination - **Education**: % education beyond primary school for draftees. - **Catholic**: % ‘catholic’ (as opposed to ‘protestant’). - **Infant.Mortality**: live births who live less than 1 year. All variables but ‘Fertility’ give proportions of the population. ] ??? There really are _heaps_ of R packages for creating data summary tables. Swetha shared [this link](https://thatdatatho.com/easily-create-descriptive-summary-statistic-tables-r-studio/) which reviews `{amisc}`, `{arsenal}`, `{compareGroups}`, `{furniture}`, `{htmltable}`, `{qwraps2}`, `{table1}`, `{tableone}`, and `{tangram}`, but there are (even) more out there now since 2018, for example [`{gtsummary}`](https://github.com/ddsjoberg/gtsummary). The `{bookdown}` package has a pretty good [overview and current suggestions](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html). --- ## Assumption breakdown structure .large[ 1. Defining 1. Diagnosing 1. So what? 1. Dealing ] --- class: center, middle # Nonlinearity .pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.encyclopediaofmath.org%2Flegacyimages%2Fcommon_img%2Fe036910a.gif&f=1&nofb=1)]] .pull-1[.circleoff[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn4.iconfinder.com%2Fdata%2Ficons%2Fevil-icons-user-interface%2F64%2Fdownload-512.png&f=1&nofb=1)]] .pull-1[.circleoff[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fd%2Fdf%2FTwo_Parallel_lines.svg%2F200px-Two_Parallel_lines.svg.png&f=1&nofb=1)]] --- ## Defining nonlinearity Recall that ordinary least squares (OLS) regression is a *linear* regression model. `$$Y =\alpha + \beta _{1}X_{1} + \beta _{2}X_{2} + \cdots + \beta _{k}X_{k} + \varepsilon$$` In statistics, the functional form (defining characteristic) of a regression model is *linear* when: - all terms in the model are either the constant or an independent variable multiplied by a parameter - the coefficients \\(\beta\\)s and the error term \\(\varepsilon\\) are only added (or subtracted) together This makes \\(Y\\) a linear function of \\(X\\) and \\(\varepsilon\\), or “linear in its parameters”, thus allowing relatively straightforward estimation of the parameters. > The regression model is linear in parameters Note that we are _not_ saying that the dependent variable must be linear. It is the relationship between dependent variable on the one hand and independent variables and error term on the other hand that must be linear. Obvs linear regression is not appropriate when the relationship between variables is *nonlinear*... <!-- Related: The Difference Between Linear and Nonlinear Regression and How to Specify a Regression Model --> --- ## Diagnosing non-linearity .pull-left[ To detect nonlinearity, I'm going to create a quick function that will show us what we're missing: ```r library(ggplot2); library(gridExtra) plotpreds <- function(model, data){ df_plt <- data.frame("fitted" = fitted(model), "observed" = data$Fertility) p1 <- ggplot(df_plt, aes(x=fitted, y=observed)) + geom_point() + stat_smooth(method="loess") + geom_abline(intercept = 1, col="red") + xlab("Predicted") + ylab("Observed") + ggtitle("Observed vs. Predicted Values") p2 <- ggplot(model, aes(.fitted, .resid)) + geom_point() + stat_smooth(method="loess") + geom_hline(yintercept=0, col="red") + xlab("Predicted") + ylab("Residuals") + ggtitle("Residuals vs. Predicted Values") grid.arrange(p1, p2, ncol=2) } ``` ] .pull-right[ ```r cath <- lm(Fertility ~ Catholic, swiss) plotpreds(cath, swiss) ``` <img src="STAT_L2_Modelling_files/figure-html/nonlinearity-1.png" width="504" /> Points should be symmetrically distributed around red line in both cases (with constant variance: next week). A 'bowed' pattern indicates the model makes systematic errors, over- or underpredicting in some parts. ] --- ## So what if there's nonlinearity? If there is a pattern to observations (i.e., not just due to inherent variation), but the pattern is *not linear* (i.e. _nonlinear_ or _curvilinear_), then using linear regression can lead to: - underestimating 'true' association (i.e. biased estimates of the parameters) - serious prediction errors, especially .red[out-of-sample] (data not used in fitting/training the model) The model expects Catholicism to increase fertility, but while that might be generally true, it seems that there are some communes that are not very catholic but nonetheless moderately fertile: Grandson, Paysd'enhaut, and Oron. It means that for different values of Catholicism, the slope of the line is different: while Catholicism is generally associated with fertility, at first a drop in Catholicism can lead to higher fertility, and then an increase in Catholicism leads to higher fertility. This suggests that this relationship is _nonlinear_. --- ## Dealing with non-linearity Fortunately, not that difficult to deal with non-linear relationships, even within a linear model family: 1. Squared terms - a transformation by including the square (root) of an independent variable as an additional variable 1. Logged terms - another transformation, usually of an independent variable 1. Adding features that were not considered before - e.g. interactions 1. Nonlinear regression - lets us fit a squiggly line to the data... --- ### 1. Squared terms .pull-left[ The most commonly used way to take account of non-linearity is to use .red[polynomial regression], such as a quadratic model: `$$Y =\alpha + \beta _{1}X_{1} + \beta _{2}X_{1}^{2} + \varepsilon$$` Note while \\(X^2\\) is raised to the power of 2, model still _linear in its parameters_. You can probably increase the fit by adding further polynomials, e.g. `\(x^3\)`, but the trade-offs to such complexity are rarely worth it in political science... ```r cath2 <- lm(Fertility ~ Catholic + I(Catholic^2), swiss) ``` ] .pull-right[ <img src="STAT_L2_Modelling_files/figure-html/squared-1.png" width="504" /> ] --- #### Interpreting squared terms .pull-left[ \\(X^2\\) term being statistically significant is evidence that relationship indeed non-linear. <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Fertility</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">70.96</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">65.40 – 76.52</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Catholic</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.67</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-1.13 – -0.20</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.006</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Catholic^2</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.01</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00 – 0.01</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.001</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">R<sup>2</sup> / R<sup>2</sup> adjusted</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">0.389 / 0.362</td> </tr> </table> ] .pull-right[ When interpreting model results, _always_ remember that terms that rely on same underlying variable _must_ be interpreted together. - \\(\beta_2\\) is interpreted as showing the marginal effect of one unit change in `\(x\)` with the level of `\(x\)` itself. - Depending on the value of \\(\beta_2\\) the function will be convex (if \\(\beta_2 > 0\\)) or concave (if \\(\beta_2 < 0\\)). - What if both parameters related to `\(x\)` have the same direction? Positive or negative? Since \\(\beta_1 < 0 \\) and \\(\beta_2 > 0 \\), the marginal effect of `\(x\)` on `\(y\)` starts negative and then goes up, becoming positive for large values of `\(x\)`. ] --- #### Examples of quadratic interpretations .panelset[ .panel[.panel-name[R Code] ```r sq_intrp <- tibble(x = 0:10, x2 = (0:10)^2) %>% mutate(y_1 = 0.2 + -0.2*x + 2*x2, y_2 = 200 + -20*x + 1*x2, y_3 = 80 + -20*x + 2*x2, y_4 = 100 + .2*x + -1*x2, y_5 = -20 + 20*x + -1*x2, y_6 = -20 + 20*x + -2*x2) library(gridExtra) g1 <- ggplot(sq_intrp, aes(x = x, y = y_1)) + geom_line() g2 <- ggplot(sq_intrp, aes(x = x, y = y_2)) + geom_line() g3 <- ggplot(sq_intrp, aes(x = x, y = y_3)) + geom_line() g4 <- ggplot(sq_intrp, aes(x = x, y = y_4)) + geom_line() g5 <- ggplot(sq_intrp, aes(x = x, y = y_5)) + geom_line() g6 <- ggplot(sq_intrp, aes(x = x, y = y_6)) + geom_line() grid.arrange(g1,g2,g3,g4,g5,g6, ncol = 3) ``` ] .panel[.panel-name[Plot] ![](STAT_L2_Modelling_files/figure-html/sqintrp-1.png) ] ] --- ### 2. Logged terms .pull-left[ Another option is to take the log of an independent variable: `$$Y =\alpha + \beta _{1}log(X_{1}) + \varepsilon$$` If log-scale, then \\(2=1, 4=2, 8=4, ...\\) - this bunches up high values relative to low values - can convert nonlinear `\(X,Y\)` relationship close to linear `\(\log(X), Y\)` relationship - especially useful when relationship looks like an exponential function Again, note that the model still _linear in its parameters_. ```r cathl <- lm(Fertility ~ log(Catholic), swiss) ``` ] .pull-right[ <img src="STAT_L2_Modelling_files/figure-html/logged-1.png" width="504" /> In this case, just seems to have shifted the 'bow'. ] --- #### Interpreting logged terms .pull-left[ The effect is statistically significant, but certainly not as well-fitting as the quadratic model (we'll come back to fit later). <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Fertility</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">61.15</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">53.11 – 69.19</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Catholic [log]</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">3.08</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.60 – 5.56</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.016</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">R<sup>2</sup> / R<sup>2</sup> adjusted</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">0.122 / 0.102</td> </tr> </table> ] .pull-right[ Careful with interpretation though `\((\log(X) \neq X)\)`: - if \\(Y = \alpha + \beta_1 log(X), \beta_1\\) interpreted as change in `\(Y\)` for 1 percentage in `\(X\)` - if \\(log(Y) = \alpha + \beta_1 log(X), \beta_1\\) interpreted as percentage change in `\(Y\)` for 1 percentage in `\(X\)` Be _very careful extrapolating_ from these models, as they can easily go beyond bounds of plausibility... Note that `\(\log(0)\)` and e.g. `\(\log(-1)\)` undefined. - if just a few 0s, add 1 to all scores `\((\log(1) = 0)\)` - if negative, maybe rescale variable but _careful with interpretation_ ] --- .pull-left[ ### 3. Interaction terms Interaction terms _can_ also help with nonlinearity: `$$Y =\alpha + \beta _{1}X_{1} + \beta _{2}X_{2} + \beta _{3}X_{1}X_{2} + \varepsilon$$` Express the effect of a _combination_ of binary and continuous variables. Though involve multiplications, _terms_ still additive. Note though that this only helps if main effects of both variables are linear. \\(>1\\) interaction possible, but difficult to interpret. Mismodelling main effect may introduce spurious interactions... ] .pull-right[ ```r cathi <- lm(Fertility ~ Catholic * Education, data = swiss) plotpreds(cathi, swiss) ``` <img src="STAT_L2_Modelling_files/figure-html/interaction-1.png" width="504" /> Remember, always interpret effects with the same underlying variables together: - marginal effect of \\(x_1\\) is \\(b_1 + b_3 x_2\\) - marginal effect of \\(x_2\\) is \\(b_2 + b_3 x_1\\) - if \\(\beta_1\\) significant, then \\(x_1\\) effect on `\(Y\)` when \\(x_2 = 0\\) - if \\(\beta_1\\) not significant, then no \\(x_1\\) effect on `\(Y\)` when \\(x_2 = 0\\) ] ??? For nonlinear relationships, consider multivariable fractional polynomials interactions (MFPI) --- .pull-left-1[ #### Interpreting interaction terms <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Fertility</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">70.94</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">64.67 – 77.20</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Catholic</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.18</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.07 – 0.29</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.002</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Education</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.43</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.95 – 0.10</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.108</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Catholic * Education</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.01</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.02 – 0.00</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.119</td> </tr> </table> No statistically significant interaction, so could ignore/drop interaction or test whether linear combination of effects together are significant... ] .pull-right-2[ .panelset[ .panel[.panel-name[xCatholic] ```r library(multcomp) summary(glht(cathi, linfct = c("Catholic + Catholic:Education = 0"))) ``` ``` ## ## Simultaneous Tests for General Linear Hypotheses ## ## Fit: lm(formula = Fertility ~ Catholic * Education, data = swiss) ## ## Linear Hypotheses: ## Estimate Std. Error t value Pr(>|t|) ## Catholic + Catholic:Education == 0 0.17462 0.04966 3.516 0.00105 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## (Adjusted p values reported -- single-step method) ``` ] .panel[.panel-name[xEducation] ```r summary(glht(cathi, linfct = c("Education + Catholic:Education = 0"))) ``` ``` ## ## Simultaneous Tests for General Linear Hypotheses ## ## Fit: lm(formula = Fertility ~ Catholic * Education, data = swiss) ## ## Linear Hypotheses: ## Estimate Std. Error t value Pr(>|t|) ## Education + Catholic:Education == 0 -0.437 0.255 -1.714 0.0938 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## (Adjusted p values reported -- single-step method) ``` ] ] ] ??? ...if we were to interpret, would mean that with every extra unit of education, expect to see less fertility by Catholicism. Important to test whether the marginal effect `\(x_1\)` on `\(y\)` at different values of `\(x_2\)` is significant. If `\(b_1 > 0\)` (i.e. positive), `\(b_{int} < 0\)` (i.e. negative), and both are `\(p<0.05\)`, then it might be that once `\(x_2 > 0\)`, `\(b_1\)` becomes insignificant. Solution: Set `\(x_2\)` to its mean and test if `\(b_1 + b_{int}(x_2) ≠ 0\)`. --- #### DIY .pull-left[ ```r plot_model(cathi, type = "int") # plots minmax by default ``` <img src="STAT_L2_Modelling_files/figure-html/intplot-1.png" width="504" /> ] .pull-right[ This model shows how Catholicism affects provinces with low education differently than those that are highly educated. For those provinces with .red[low education], increasing Catholicism (vs Protestantism) leads to increased fertility. But for those provinces with .blue[high education], increasing Catholicism (vs Protestantism) leads to _decreased_ fertility. (But why are the confidence intervals for the blue so wide?) ] --- class: center .panelset[ .panel[.panel-name[R Code] ```r new_d <- expand.grid(Catholic=c(min(swiss$Catholic), mean(swiss$Catholic), max(swiss$Catholic)), Education=c(min(swiss$Education), mean(swiss$Education), max(swiss$Education))) new_d$Fertility <- predict(cathi,new_d) ggplot(swiss, aes(x=Catholic,y=Fertility,color=Education))+ geom_point()+ geom_line(data=new_d,aes(group=Education)) + scale_color_continuous(low="green",high="blue") + geom_text(aes(x=50,y=85,label="Slope: (0.18 - 0.01 * 1) * Catholic"), color="green",angle=8,family="Arial") + geom_text(aes(x=50,y=65,label="Slope: (0.18 - 0.01 * 11) * Catholic"), color="brown",angle=3,family="Arial") + geom_text(aes(x=50,y=25,label="Slope: (0.18 - 0.01 * 53) * Catholic"), color="blue",angle=-11,family="Arial") ``` ] .panel[.panel-name[Plot] ![](STAT_L2_Modelling_files/figure-html/interplot-1.png) ] ] --- ## Nonlinearity summary - Nonlinearity is a serious but relatively easily remediable issue for most linear models. - Most common solution is to include a squared term as this is flexible, does not usually require any further transformations, and is usually pretty interpretable. - Where the curvilinear relationship is particularly pronounced and behaves in an exponential-like manner, include a log of the `\(X\)` variable, though be careful of zeros/negative values and with interpretation. - Interactions can help, and can lead to interesting insights, but only if main effects are linear, and need to watch out for potentially spurious 'discoveries'. - There are also more sophisticated methods for dealing with nonlinearity, some of which are beyond this course... --- class: center, middle # Endogeneity .pull-1[.circleoff[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.encyclopediaofmath.org%2Flegacyimages%2Fcommon_img%2Fe036910a.gif&f=1&nofb=1)]] .pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn4.iconfinder.com%2Fdata%2Ficons%2Fevil-icons-user-interface%2F64%2Fdownload-512.png&f=1&nofb=1)]] .pull-1[.circleoff[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fd%2Fdf%2FTwo_Parallel_lines.svg%2F200px-Two_Parallel_lines.svg.png&f=1&nofb=1)]] --- ## Defining endogeneity I'm going to talk about two sets of assumptions here: > All independent variables are uncorrelated with the error term - remember that the error term is supposed to represent unpredictable random error - if an independent variable is correlated with the error term, then we could use that to predict errors - there are three types of this kind of endogeneity... -- > Observations of the error term are uncorrelated with each other - if one observation is correlated with another observation, say an observation is more positive than expected and this systematically increases the probability that the following observation is also positive - this problem known generally as .red[autocorrelation] --- ## When is the exogeneity assumption violated? When features and residuals are correlated somehow... 1. **measurement error** -> *attenuation bias* - `\(X\)` causes `\(Y\)` but we cannot observe `\(X\)`, and `\(Z\)` (which we observe) is influenced by `\(X\)` but also by `\(Y\)` - does higher income both decrease child mortality and fertility rates? 1. **instantaneous/reverse causation** -> *simultaneity bias* - `\(X\)` influences `\(Y\)`, but `\(Y\)` reinforces `\(X\)` too - does child mortality lead to higher fertility rates, vice versa, or both? 1. **omitted variables** -> *selection bias* - `\(Z\)` causes both `\(X\)` and `\(Y\)` fairly contemporaneously - does education both decrease child mortality and lower fertility rates? Actually selection/omitted variable bias is *the* problem in observational research that undermines causal inference. Measurement error and instantaneous causation can be both posed as problems of omitted variables. --- ## The OVB Triangle .pull-left[ Omitted variable bias (OVB) is one of the most common and vexing problems in ordinary least squares regression. <!-- https://www.albany.edu/faculty/kretheme/PAD705/SupportMat/OVB.pdf --> OVB occurs when a variable that is correlated with both the dependent and one or more included independent variables is omitted from a regression equation. An “OVB Triangle” exists when there exists a third variable that is: - not currently in the model, and is - correlated with _both_ the independent and dependent variable OVs usually because: - data simply not available - effect of explanatory variable on response variable unknown ] .pull-left[
] --- ## Diagnosing omitted variable bias .pull-left[ ### 1. Plotting residuals One place to look for evidence of OVB are the residuals. If there seems to be a pattern of residuals becoming smaller or larger, this would be prima facie evidence of OVB. From this plot... _maybe_. ] .pull-right[ ```r plotpreds(cath2, swiss) ``` <img src="STAT_L2_Modelling_files/figure-html/ovbresid-1.png" width="504" /> ] --- ### 2. Ramsey RESET test The RESET test is a popular diagnostic for correctness of the specification of a linear model. Basically, it sees whether additional variables might have significant influence by testing whether an alternative model taking powers of the fitted response or regressor variables, e.g. `\((x_1, x_1^2, x_1^3)\)`, fits better than the current fitted model. If the F-test (see below) shows a significant improvement, this suggests the model is misspecified in the sense that there might be some missing variable (e.g. an OV), or a polynomial or other nonlinear functional form would be a better fit. .panelset[ .panel[.panel-name[Fitted Response Variable] ```r library(lmtest) resettest(cath2, type = "fitted") ``` ``` ## ## RESET test ## ## data: cath2 ## RESET = 2.3524, df1 = 2, df2 = 42, p-value = 0.1076 ``` ] .panel[.panel-name[Regressor Variables] ```r resettest(cath2, type = "regressor") ``` ``` ## ## RESET test ## ## data: cath2 ## RESET = 4.2654, df1 = 4, df2 = 40, p-value = 0.00573 ``` ]] --- ### 3. Testing correlations between IVs and residuals Lastly, we can check the Pearson correlation between each feature and the residuals, and report the `\(p\)`-value for testing the lack of correlation between the two considered series. ```r for (i in 1:(dim(swiss)[2])){ cor_test <- cor.test(swiss[, i], cath2$residuals) print(paste(colnames(swiss)[i], '--- correlation:', as.character(round(cor_test$estimate,2)), ', p-value:', as.character(round(cor_test$p.value,3)), sep = " ", collapse = NULL)) } ``` ``` ## [1] "Fertility --- correlation: 0.78 , p-value: 0" ## [1] "Agriculture --- correlation: -0.06 , p-value: 0.68" ## [1] "Examination --- correlation: -0.29 , p-value: 0.047" ## [1] "Education --- correlation: -0.41 , p-value: 0.004" ## [1] "Catholic --- correlation: 0 , p-value: 1" ## [1] "Infant.Mortality --- correlation: 0.49 , p-value: 0.001" ``` Looks like there might be some further candidate features here... You can test each one so: ```r cor.test(swiss$Education, cath2$residuals) ``` ??? We can reject the null hypothesis (that there is a lack of correlation) for several pairs, and cannot reject the null hypothesis for a couple of others. Which ones? Low p-values suggest we can reject the null hypothesis that the "true" correlation is 0. This provides some evidence of potential OVB. Additionally, we sometimes say that if adding in variables that could be correlated with both catholicism and the residuals changes the coefficient for catholicism by more than about 3-5%, then this is indicative of OVB. A reasonable supposition would be that we are still missing at least one key variable. --- ## So what if there's omitted variable bias? Violating this assumption _biases the coefficient estimate_. -- Remember the error term always explains some of variability in the dependent variable. When an independent variable correlates with the error term, OLS incorrectly attributes some of the variance that the error term actually explains to the independent variable instead. The amount of "extra" distance between `\(Y\)` and \\(\hat{Y}\\) created by leaving out the omitted variable will be correlated with any included variables the OV is correlated with. In other words, if there are omitted variables, `\(E[\varepsilon_i|x_i] \neq 0\)` which biases your estimates, no matter how many observations (this is called .red[inconsistent]). -- The bias is in predictable ways _if you knew all the correlations_ also with features beyond the model, though you typically don't... | | cor(X, Z) > 0 | cor(X, Z) < 0 | |:-------------:|:-------------:|:-------------:| | cor(Z, Y) > 0 | Bias + | Bias - | | cor(Z, Y) < 0 | Bias - | Bias + | This leaves us in a pretty uncertain position because there could be a serious problem (bias) with our inference and we wouldn't necessarily know about it or in which direction it goes. ??? However, the error term is also defined as the difference between the actual observations (Yi) – what we actually earn – and the predicted line found via regression ( ˆ ) – what our regression says we should earn. --- ## Dealing with endogeneity ### 1. Instrumental variables (IV) - An instrumental variable is strongly correlated with the explanatory variable that has problems with endogeneity (a _strong first stage_) and is not itself correlated with the error term (the _exclusion restriction_), conditional on all other covariates. - E.g. Acemoglu, Johnson and Robinson (2001): good institutions => economic development, _but_ economic development => good institutions? Try instrumental variable: - strong first stage: settler mortality rates (exogenous: disease environments) => colonization strategies => modern institutions => economic development - exclusion restriction: settler mortality rates ≠> economic development - e.g. high African mortality deterred all except those interested in imposing predatory institutions, whereas low North American mortality attracted droves of migrants that demanded institutions of equal rights. Institutions persisted in both cases, but only the latter resulted in long-run economic growth. - Creativity required to identify appropriate instrumental variables, and sometimes impossible... --- .pull-left[ ### 2. Difference-in-difference (DID) - A difference-in-difference design is also quasi-experimental, but relies on an assumption that you can identify a (quasi) treatment and control group and that you have two or more panel waves. - Assumption that development of the two groups from pre to post would have been parallel _but for_ the causal impact of the treatment. ### 3. Ignore? - Not advisable because OLS inconsistent - But bounded rationality: cannot identify all variables potentially correlated with IVs and DV and, even if, many difficult to measure - Since bias decreases as covariance between OV and IV decreases, can live with a little bias... ] .pull-right[ ![](https://upload.wikimedia.org/wikipedia/commons/d/da/Illustration_of_Difference_in_Differences.png) ] --- ## Defining autocorrelation Autocorrelation or serial correlation is related to problems of endogeneity, but instead of the errors being correlated with an independent variable, they are correlated with themselves. Imagine our predictor of the weather is the season. We don’t predict that it will be sunny in winter. _But_ if it is sunny on 2 March (so we have a large residual), it’s likely to be sunny on 2 March (and again we have a large residual). When one residual is high, next one also likely to be high. Serial/autocorrelation common in panel data (observations of same entity over time e.g. country-year designs, actor-month, etc.), but also spatial and structural (i.e. network) data... --- .pull-left[ ### 1. Visualise with acf plot The easiest way to visually diagnose autocorrelation is using an autocorrelation function (ACF) plot. The ACF plot shows us how correlated observations or residuals are with those at different lags (steps away). - if not time-series data, then row number is used - `\(x\)`-axis corresponds to the lags of the residual, increasing in steps of 1 - leftmost line shows correlation of residual with itself (Lag0), therefore always =1 If values are significantly correlated with lagged values, then there is autocorrelation. - if not autocorrelated, lag1 onwards will drop beneath blue bounds (ideally randomly distributed) - significant correlations lie beyond the blue bounds ] .pull-right[ ```r library(forecast) ggAcf(cath2$residuals) ``` <img src="STAT_L2_Modelling_files/figure-html/acf-1.png" width="504" /> <img src="STAT_L2_Modelling_files/figure-html/acf2-1.png" width="504" /> ] ??? Do we observe autocorrelation here? Looks like very weak autocorrelation in the residuals of our cath2, but looks like strong autocorrelation in the observations of air passengers (and maybe seasonality?). --- .pull-left-1[ ### 2. Durbin-Watson test The Durbin-Watson (DW) test tests significant residual autocorrelation at lag 1. The DW test statistic is approximately equal to \\(2(1-a)\\), where \\(a\\) is the lag 1 residual autocorrelation. Some notes on the Durbin-Watson test: - test statistic always has a value between 0 and 4 - \\(2\\) means no autocorrelation in the sample - \\(< 2\\) indicates positive autocorrelation - \\(> 2\\) indicates negative autocorrelation ] .pull-right-2[ ```r lmtest::dwtest(cath) ``` ``` ## ## Durbin-Watson test ## ## data: cath ## DW = 0.90423, p-value = 1.156e-05 ## alternative hypothesis: true autocorrelation is greater than 0 ``` ```r lmtest::dwtest(cath2) ``` ``` ## ## Durbin-Watson test ## ## data: cath2 ## DW = 0.92734, p-value = 1.155e-05 ## alternative hypothesis: true autocorrelation is greater than 0 ``` Looks like we can reject the null hypothesis of no autocorrelation, and it looks like there is indeed some positive autocorrelation at lag1. ] --- .pull-left-1[ ### 3. Runs test Unlike the Durbin-Watson test, which only tests whether there is autocorrelation at lag 1, the Wald–Wolfowitz runs test assesses the distribution of the length of runs. The run test is based on the null hypothesis that each element in the sequence is independently drawn from the same distribution ] .pull-right-2[ ```r lawstat::runs.test(cath$residuals) ``` ``` ## ## Runs Test - Two sided ## ## data: cath$residuals ## Standardized Runs Statistic = -1.9149, p-value = 0.0555 ``` ```r lawstat::runs.test(cath2$residuals) ``` ``` ## ## Runs Test - Two sided ## ## data: cath2$residuals ## Standardized Runs Statistic = -4.2757, p-value = 1.906e-05 ``` Looks like we cannot reject the hypothesis that there are random runs/sequences for our basic catholicism model, and can safely assume that residuals are not autocorrelated there, but it looks like there is a definite pattern to the residuals in the quadratic model. ] --- ## So what if there's autocorrelation? Moderate autocorrelation reduces the precision of OLS estimates. Extreme autocorrelation can also signal a badly misspecified model: - e.g. violation of linearity assumption (underlying exponential growth model will present autocorrelation) - or bias due to omitted variables (interaction terms or dummy variables for identifiable conditions). Possible in both time-series models (current value depends on previous (historic) values) and non-time-series models (current value depends on spatially or structurally adjacent values). --- ## Dealing with autocorrelation .pull-left[ If DW < 1, possibly model misspecification: - adding lags of variables using the `slide()` function in `DataCombine` package - stationarize time-series variables by differencing, logging, and/or deflating - account for seasonal components using dummy variables or seasonally adjust variables - add variables that may capture this information (like with endogeneity): - temporal effects - spatial effects - structural (network) effects In DW > 3, some variables possibly overdifferenced: - include a linear (trend) term for consistent increasing/decreasing pattern in the residuals - use Generalized Least Squares ] .pull-right[ Mixed models: - .red[Fixed effects] (FE) controls for entity-specific time-invariant characteristics we cannot observe or quantify (ex: culture, religion, gender, race) - good for studying changes within an entity over time. - .red[Random effects] (RE) used when we suspect variation between entities is random and is an important determinant of our DV; can include entity specific time-invariant characteristics - good for studying changes across entities over time. Autocorrelation can be treated as a _nuisance_ or a _feature_. May need to be creative; more like art than algorithm. ] --- class: center, middle # Multicollinearity .pull-1[.circleoff[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.encyclopediaofmath.org%2Flegacyimages%2Fcommon_img%2Fe036910a.gif&f=1&nofb=1)]] .pull-1[.circleoff[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn4.iconfinder.com%2Fdata%2Ficons%2Fevil-icons-user-interface%2F64%2Fdownload-512.png&f=1&nofb=1)]] .pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fd%2Fdf%2FTwo_Parallel_lines.svg%2F200px-Two_Parallel_lines.svg.png&f=1&nofb=1)]] --- ## Defining multicollinearity Multicollinearity means that independent variables we are interested in are closely related _with each other_. This contravenes the assumption that: > No independent variable is a perfect linear function of other explanatory variables i.e. that all features are linearly _independent_, i.e. we cannot predict one feature from another. - e.g. `\(x_1 = 2 + 3 * x_2\)` violates the assumption Say we are interested in EU elections and ask respondents whether they are in favour of a common foreign policy and common defence policy (e.g. ordinal scale). - these are likely to be (positively) correlated - this makes it difficult to work out what happens when foreign policy goes up and defence stays the same Unfortunately, this is a common problem in social science because our concepts/variables _often_ overlap... --- ## Diagnosing multicollinearity ### 1. High \\(R^2\\) but no statistically significant variables .pull-left[ ```r lin_reg <- lm(Fertility ~ ., swiss) tab_model(lin_reg, show.obs = FALSE) ``` - but this is really just suggestive... - and doesn't seem to be a problem here ] .pull-right[ <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Fertility</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">66.92</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">45.29 – 88.54</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Agriculture</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.17</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.31 – -0.03</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.019</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Examination</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.26</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.77 – 0.25</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.315</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Education</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.87</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-1.24 – -0.50</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Catholic</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.10</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.03 – 0.18</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.005</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Infant.Mortality</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.08</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.31 – 1.85</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.007</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">R<sup>2</sup> / R<sup>2</sup> adjusted</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">0.707 / 0.671</td> </tr> </table> ] --- .pull-left[ ### 2. High correlations between IVs Obviously if variables are highly correlated, then there is multicollinearity. A nice, visual way to check is using a _correlogram_. There are a bunch of correlogram packages available in R. Here's one I like for its flexibility: Watch out for "dummy variable trap": - unless you exclude a reference category, there will be perfect correlation between dummy variables and the constant Note this only explores pairwise correlations among explanatory variables... ] .pull-right[ .panelset[ .panel[.panel-name[R Code] ```r library(ggcorrplot) ggcorrplot(cor(swiss[, -1]), type = "lower", p.mat = cor_pmat(swiss[, -1])) ``` ] .panel[.panel-name[Plot] ![](STAT_L2_Modelling_files/figure-html/corrplot-1.png) ] ] ] --- ### 3. High variance inflation factors .pull-left[ A VIF measures how much other explanatory variables explain a variable's own variance, calculated as: $$ VIF = 1 / (1-R^2) $$ where \\(R^2\\) is for the model regressing `\(X\)` on all other predictors. This is proportional to how much larger `\(X\)`'s `\(SE\)` would be if it were uncorrelated with all other features. Practically, - lower VIFs are better - if no features correlated, then all VIFs = 1 - higher VIFs mean the information in those variables already provided by other variables and thus redundant - highly correlated variables each have high VIFs ] .pull-right[ ```r library(car) t(t(vif(lin_reg))) ``` ``` ## [,1] ## Agriculture 2.284129 ## Examination 3.675420 ## Education 2.774943 ## Catholic 1.937160 ## Infant.Mortality 1.107542 ``` We conventionally use a threshold of 4 for identifying candidates for exclusion - i.e. that 75% of the variance in that IV is explained by the other IVs - since 1/(1-0.75) = 1/0.25 = 4 Doesn't seem to be too much of a problem with multicollinearity here fortunately... ] --- ```r library(performance) plot(check_collinearity(lin_reg)) ``` <img src="STAT_L2_Modelling_files/figure-html/vvif-1.png" width="1008" /> --- ## So what if there's multicollinearity? .pull-left[ _Perfect correlation_ (+1 or -1) is a show stopper... - any statistical software will simply break because it cannot find any way to distinguish these variables. - suggests variables are different forms of the same variable - e.g. games won and games lost (-1) - e.g. temperature in Fahrenheit and temperature in Celsius (+1) - you must remove one of the variables from the model or _you shall not pass!_ ] .pull-right[ ![](https://media.giphy.com/media/OdyGA2spBFRCg/giphy.gif) ] _Imperfect but strong correlations_ in OLS models can be fit, but: - while it won't reduce a model's overall predictive power, SEs will be high and it will be difficult to discern statistically significant results - while estimates still unbiased, less efficient/precise - estimates will be highly sensitive to particular sets of data - makes it more difficult to interpret regression coefficients since variables (kind of) move together ??? Multicollinearity even more of a problem under iterative estimation techniques because the little independent information makes it more difficult to find the most likely estimates --- ## Dealing with multicollinearity 1. Just exclude one of the highly correlated independent variables - say the one with the highest VIF - makes sense when the causal relationships are quite clear - may need to do this several times, be wary of OVB though! 1. Make a scale/factor out of highly correlated independent variables - use principal component analysis (PCA) to reduce features to a smaller set of uncorrelated components - makes sense when there is an underlying/latent variable that we haven’t measured - requires additional expertise and can make interpreting effects of individual IVs difficult 1. Get more data/variables - basically, multicollinearity just means not enough independent variation to separately identify marginal effects - if you think variables _ought_ to vary independently, then maybe more data might introduce more independent variation with which to identify effects - sometimes easier said than done... 1. Do nothing - report \\(R^2\\) and \\(F\\) statistics and explain joint significance of the IVs in the model which accounts for the lack of individual significance --- ## Model fit .pull-left-1[ ![](https://i.pinimg.com/originals/c3/42/4b/c3424b53196516a7e9a7436786aae4de.gif) ] .pull-right-2[ - We've talked about when a model is the ‘best’, but when is it ‘good’? - For linear regression, we normally use a measure called \\(R^2\\) - measures proportion of variation in `\(Y\)` that is explained by the fitted model - varies between 0 and 1, with 1 representing 'perfect' explanation - increases monotonically with each additional feature ] - However, sometimes \\(R^2\\) not the most useful measure of model ‘performance’ - High \\(R^2\\) values don’t automatically make your model a “good” model - Many social science models have low \\(R^2\\) values - We still might be interested in whether there is a relationship between `\(X\)` and `\(Y\)` though - Or whether it makes sense to favour a slightly more complex model over a more parsimonious one --- .pull-left[ ### Overfitting Overfitting is “analysis that corresponds too closely to a particular set of data and may therefore fail to fit additional data or predict future observations reliably”. *Curve-fitting* can lead to *false precision*. To lessen the chance of, or amount of, overfitting, one can use: - adjusted model selection methods to avoid unhelpfully inflating model complexity - .red[regularization] to explicitly penalize overly complex models - .red[cross-validation] to test model's ability to generalize by evaluating its performance on a set of data not used for training the model, approximating the unseen data a model might encounter ] .pull-right[ ![](https://i.gifer.com/g3IK.gif) ![](https://upload.wikimedia.org/wikipedia/commons/6/68/Overfitted_Data.png) ] --- background-image: url(https://imgs.xkcd.com/comics/curve_fitting_2x.png) background-size: contain --- ### Adjusting R2 For model selection, we use adjusted \\(R^2\\), Akaike's information criterion (AIC), or Bayesian information criterion (BIC). <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="2" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Fertility</th> <th colspan="2" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Fertility</th> <th colspan="2" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Fertility</th> <th colspan="2" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Fertility</th> <th colspan="2" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Fertility</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; col7">p</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; col8">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; col9">p</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; 0">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; 1">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">64.43</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">70.96</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">70.94</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">48.68</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9"><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0">62.10</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 1"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Catholic</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.14</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.001</strong></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.67</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.006</strong></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.18</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong>0.002</strong></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">0.10</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9"><strong>0.001</strong></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0">0.12</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 1"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Catholic^2</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.01</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong>0.001</strong></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 1"></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Education</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.43</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7">0.108</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">-0.76</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9"><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0">-0.98</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 1"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Catholic * Education</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.01</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7">0.119</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 1"></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Infant.Mortality</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">1.30</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9"><strong>0.002</strong></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0">1.08</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 1"><strong>0.007</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Agriculture</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0">-0.15</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 1"><strong>0.029</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">R<sup>2</sup> / R<sup>2</sup> adjusted</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="2">0.215 / 0.198</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="2">0.389 / 0.362</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="2">0.598 / 0.570</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="2">0.663 / 0.639</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="2">0.699 / 0.671</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">AIC</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">364.348</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">354.535</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">336.882</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">328.668</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="2">325.241</td> </tr> </table> ??? Adjusted `\(R^2\)` is based off of the `\(R^2\)` value, but introduces a penalization for additional terms. Higher is better. AIC (and AICc) and BIC are similar, but based on the concept of the information criterion. Lower is better. --- ## Model selection I have some data, how should I analyze it? ### Kitchen-sink One way of analyzing data is to throw in *every* possible variable and see which were statistically significant -- This is a nonsensical way to proceed, of course. -- 1. Just because two variables correlated, does not mean there is causation: e.g. does belief in fairies determine height? 2. Moreover, every 100 variables tested results in ~5 statistically significant (at \\(\alpha = 0.05\\) level) just by chance alone. 3. And then are we supposed to include every possible interaction as well? \\(k^{k-1}\\) just for 2nd-order interactions... 4. Since independent variables often correlated with one another, including them all doesn't really make sense --- ### Bathwater .pull-left[ ![](https://4kwallpaper.wiki/wp-content/uploads/2019/09/304552.jpg) ] .pull-right[ - Generally begin with previous empirical model (or models) based on a theory (or theories) - literature review crucial for making sure competing/alternative explanations not responsible and for making sensible decisions that readers can recognise and understand: don't throw baby out with bathwater! - basically your baseline from previous work - Then introduce any additional variables or interactions according to your tweak/transformed/totally new theory - e.g. previous work suggests that father’s height predicts their son’s height. We think fathers that worked as miners also have shorter sons, so we include this ] --- .pull-left[ ### F... When’s an Increase a Real Increase? We can test whether increases are statistically significant using something called a *F-test* - This is based on a distribution called the F-distribution. - `\(F\)` can take only nonnegative values. - Distribution is skewed right. - Mean is approximately equal to 1 (closer to 1 for large `\(n\)`). - Exact shape depends on two `\(df\)` values: - `\(df_1 = k\)` (number of explanatory variables in model) - `\(df_2 = n – (k + 1)\)` (sample size – no. model parameters) - Test tells us whether we can reject `\(H_0\)` that the increase in model fit is 0 ] .pull-right[ We are testing the collective influence of all explanatory variables in a model on Y: - `\(H_0\)` : \\(b_1 = b_2 = ... = b_k = 0\\) - (i.e., `\(y\)` is independent of all explanatory variables) - `\(H_A\)`: At least one \\(b ≠ 0\\) - (i.e. at least one explanatory variable has an effect on `\(y\)`, controlling for the others in the model) ```r anova(cathm, catha) ``` ``` ## Analysis of Variance Table ## ## Model 1: Fertility ~ Catholic + Education + Infant.Mortality ## Model 2: Fertility ~ Catholic + Education + Infant.Mortality + Agriculture ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 43 2422.2 ## 2 42 2158.1 1 264.18 5.1413 0.02857 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] --- ### Stepwise model selection A popular, usually more exploratory method for searching for an overall fitted model rather than testing a particular hypothesis involves fitting several models and using information about them to make decisions about which features should be in the theoretical model - **Forward stepwise model selection** starts with a minimal model (just the established state of the art or just your variable(s) of interest) and adds features as alternative explanations or controls - **Backward stepwise model selection** starts with a maximal model (all the credible effects you could include) and removes features However, be careful with semi-automated model selection: _commonly used, commonly abused_. Often coupled with (mistaken) belief that _only_ statistically significant features should be retained and presented. - goes against principle of hypothesis testing - __a lack of statistical significance does not mean definitely no effect; it just means we cannot reject the hypothesis that there is no effect__ - sometimes just an ambiguous relationship, particular if nonlinear/endogeneity/multicollinearity - moreover, can lose valuable predictive information from deleting marginally significant variables ??? For more on (machine-learning style) model selection, see [this link](http://r-statistics.co/Model-Selection-in-R.html). There are, of course, [other variable selection approaches](http://r-statistics.co/Variable-Selection-and-Importance-With-R.html). --- .pull-left[ ```r library(olsrr) model <- lm(Fertility ~ ., data = swiss) ols_step_forward_aic(model) ``` ``` ## ## Selection Summary ## --------------------------------------------------------------------------- ## Variable AIC Sum Sq RSS R-Sq Adj. R-Sq ## --------------------------------------------------------------------------- ## Education 348.422 3162.719 4015.236 0.44062 0.42818 ## Catholic 337.564 4123.786 3054.169 0.57451 0.55517 ## Infant.Mortality 328.668 4755.710 2422.245 0.66254 0.63900 ## Agriculture 325.241 5019.885 2158.069 0.69935 0.67071 ## --------------------------------------------------------------------------- ``` ```r # ols_step_forward_aic(model, details = T) ``` ] .pull-right[ ```r library(olsrr) model <- lm(Fertility ~ ., data = swiss) ols_step_backward_aic(model) ``` ``` ## ## ## Backward Elimination Summary ## ---------------------------------------------------------------------- ## Variable AIC RSS Sum Sq R-Sq Adj. R-Sq ## ---------------------------------------------------------------------- ## Full Model 326.072 2105.043 5072.912 0.70674 0.67097 ## Examination 325.241 2158.069 5019.885 0.69935 0.67071 ## ---------------------------------------------------------------------- ``` ```r # ols_step_backward_aic(model, details = T) ``` ] ??? For more on `olsrr`'s stepwise functions, see [this link](https://olsrr.rsquaredacademy.com/articles/variable_selection.html). --- ### KISS Basically: - if significant and expected sign, cool keep it - if insignificant but main feature of the theory or an alternative explanation, keep it as a control - if insignificant and only weak theoretical grounds for inclusion, drop it (though keep as robustness check) - if significant and unexpected sign, review data, review interactions with other variables, review statistical power, review theory... Generally, KEEP IT SIMPLE STUPID (KISS) - our models are meant to help us understand reality; who could keep in mind an 80 parameter model? - small datasets hold less information, thus less statistical power to identify many 'true' effects - unnecessary variables likely to increase VIFs and thus SEs for all variables, making everything difficult to interpret - consider Christopher Achen's ART (a rule of three) imperative... do you agree? - as much art as science Last words: It's ok to say that we don't know if you've tried everything and can point out what doesn't work and what might work is beyond what we have (e.g. richer dataset, better methods). You have helped us get to the field's frontier! --- ### Cheat ```r library(performance) plot(compare_performance(cath, cath2, cathi, cathm, catha, rank=T)) ``` <img src="STAT_L2_Modelling_files/figure-html/modelPerform-1.png" width="504" /> --- class: center, middle # Summary .pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.encyclopediaofmath.org%2Flegacyimages%2Fcommon_img%2Fe036910a.gif&f=1&nofb=1)]] .pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn4.iconfinder.com%2Fdata%2Ficons%2Fevil-icons-user-interface%2F64%2Fdownload-512.png&f=1&nofb=1)]] .pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fd%2Fdf%2FTwo_Parallel_lines.svg%2F200px-Two_Parallel_lines.svg.png&f=1&nofb=1)]]