Statistics for International Relations Research II

class: center, middle, inverse, title-slide

# Statistics for International Relations Research II
## Refresher
### <large>James Hollway</large>

---

class: center, middle

.pull-1[.circleon[![](https://graduateinstitute.ch/sites/default/files/styles/medium/public/2019-01/James%20Hollway.jpg?itok=1Yw0keum)]]
.pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.istockphoto.com%2Fvectors%2Fgrade-a-plus-result-vector-icon-school-red-mark-handwriting-a-plus-in-vector-id1136966571%3Fk%3D6%26m%3D1136966571%26s%3D612x612%26w%3D0%26h%3DS3pDI_xutxq1nLoWAW_D3h5j9wfwkVhWe7LSoVJRA00%3D&f=1&nofb=1)]]
.pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fc%2Fcf%2FRewind_button.svg%2F240px-Rewind_button.svg.png&f=1&nofb=1)]]

---
class: center, middle
# Introductions

.pull-1[.circleon[![](https://graduateinstitute.ch/sites/default/files/styles/medium/public/2019-01/James%20Hollway.jpg?itok=1Yw0keum)]]
.pull-1[.circleoff[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.istockphoto.com%2Fvectors%2Fgrade-a-plus-result-vector-icon-school-red-mark-handwriting-a-plus-in-vector-id1136966571%3Fk%3D6%26m%3D1136966571%26s%3D612x612%26w%3D0%26h%3DS3pDI_xutxq1nLoWAW_D3h5j9wfwkVhWe7LSoVJRA00%3D&f=1&nofb=1)]]
.pull-1[.circleoff[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fc%2Fcf%2FRewind_button.svg%2F240px-Rewind_button.svg.png&f=1&nofb=1)]]

---
background-image: url(https://media.giphy.com/media/lQPHNPMPZ7tcIwseGn/giphy.gif)
background-size: contain

---
class: center, middle, inverse, iheid-red

## Us

.center[
.pull-left[
.polaroid[![](https://graduateinstitute.ch/sites/default/files/styles/medium/public/2019-01/James%20Hollway.jpg?itok=1Yw0keum)]

James Hollway (Instructor)

james.hollway@ graduateinstitute.ch

Office hours: Fridays, 14–16
]

.pull-left[
.polaroid[![](https://0.academia-photos.com/13004503/4247925/4942167/s200_juliette.ganne.jpg_oh_dbe60a07890a3ced925fd1a9f71769fc_oe_54811513___gda___1415625964_87641a50172edbca992e61c6c445d7c3)]

Juliette Ganne (TA)

juliette.ganne@ graduateinstitute.ch

Office hours: By appointment
]
]

---
background-image: url(https://media.giphy.com/media/b7sNm0aR2EOqI/giphy.gif)
background-size: contain

---
## You

.pull-left[
![](https://owl.excelsior.edu/wp-content/uploads/sites/2/2018/07/Name-Tag-1.png)
]

.pull-right[
- Your name
- Your area of research interest
- One thing you remembered from Stats I
- One thing you are looking forward to in this class
- One surprising thing about you
]

---
class: center, middle
# Course

.pull-1[.circleoff[![](https://graduateinstitute.ch/sites/default/files/styles/medium/public/2019-01/James%20Hollway.jpg?itok=1Yw0keum)]]
.pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.istockphoto.com%2Fvectors%2Fgrade-a-plus-result-vector-icon-school-red-mark-handwriting-a-plus-in-vector-id1136966571%3Fk%3D6%26m%3D1136966571%26s%3D612x612%26w%3D0%26h%3DS3pDI_xutxq1nLoWAW_D3h5j9wfwkVhWe7LSoVJRA00%3D&f=1&nofb=1)]]
.pull-1[.circleoff[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fc%2Fcf%2FRewind_button.svg%2F240px-Rewind_button.svg.png&f=1&nofb=1)]]

---
## Course aims

This is an intermediate statistics course for applied researchers...

Primary goals are to:
- identify, explain, and evaluate model choices in terms of questions
- recognise key inferential assumptions and strategies for if they are not met 
- apply statistical terms, concepts, and programming
- generate, interpret, and communicate statistical findings verbally and in writing
- use data to make evidence-based decisions
- think critically about contemporary methodological issues

Basically I want to help with two more general problems researchers have:
1. recognise what data and models you would need to answer the questions that you have
1. recognise what questions you could answer with available data

This course is about developing this knowledge.

---
class: left, middle

.blockquote[99% of statistics only tell 49% of the story.

~ Ron Delegge II]

---
## Stats II in one slide!

.pull-left[
.full-width[.content-box-blue[1: Refresher]]

.full-width[.content-box-blue[2: Modelling]]

.full-width[.content-box-yellow[3: no class]]

.full-width[.content-box-blue[4: Assumptions]]

.full-width[.content-box-purple[5: MLE]]

.full-width[.content-box-blue[6: Models for Binary Outcomes]]

.full-width[.content-box-blue[7: Models for Multinomial Outcomes]]
]
.pull-left[
.full-width[.content-box-blue[8: Models for Longitudinal Outcomes]]

.full-width[.content-box-blue[9: Models for Text Outcomes]]

.full-width[.content-box-blue[10: Models for Network Outcomes]]

.full-width[.content-box-purple[11: Advanced]]

.full-width[.content-box-orange[12: Consultancies]]

.full-width[.content-box-orange[13: Consultancies]]

.full-width[.content-box-red[14: Projects]]
]

---
## Course sessions

.pull-left[
**Lectures**, James Hollway

Room: Zoom

Tuesdays, 16:15-18:00

1. Mix of conceptual and practical

1. Complementary to readings
]
--
.pull-left[
**Lab sessions**, Juliette Ganne

Room: S8

Wednesdays, 16:15-18:00

1. Deepening comprehension

1. DIY experience
]

---
## Course evaluation

**30% Exercises** 
- Submit 6/8 weekly exercises by noon following week, worth 20 points each
- Download and fill in the RMarkdown template, e.g. "E1_Surname.Rmd"; ensure replicability

**15% Reviews** 
- Submit a short review of a statistical publication of your choice in your field
- Check publication with us by 16th March; needs to use a method covered in class
- Due one week after session that uses method in the paper; guidelines and grading rubric shared on Moodle

**40% Project Report**
- Submit an RMarkdown report analysing data of your choice with a method discussed during the seminar
- Check with us by 16th April with an (ungraded) data report describing the data you are interested in
- Due final week; will be shared online and includes grade for commenting on each others' projects

**15% Participation**
- Submit answers _and_ questions about conceptual or practical matters on Moodle
- Nonlinear grading scheme rewards all those who put the extra effort in to understand/help others understand

---
##  Books

No single textbook for this course, but some good starting points...

.center[

]

All additional required and recommended readings are provided in electronic format on Moodle or will be available on reserve in the library.

---
## Software

In this class, we will use `$\mathcal{R}$`. It is **free**. **Flexible**. And **facilitates** more modern research.

.center[

]

There are many highly recommended resources for learning `$\mathcal{R}$` which I will share with you on Moodle,
and Juliette will make sure you have the basic understanding to get started during the lab sessions.

In the meantime, please download and install the most recent versions of both `$\mathcal{R}$` (the statistical software) and RStudio (a front-end which makes using R much easier).
RStudio makes it easier not only to manage what `$\mathcal{R}$` is doing,
but also to run Python (e.g. for webscraping) and C++ (for faster programming),
as well as writing assignments (like in this course), presentations, 
even apps that help you communicate your research more widely.
The assignments in this course will be in Rmarkdown.

???

Basically, GUI statistical software (e.g. Excel, SPSS) [were a mistake for modern research](https://www.r-bloggers.com/2020/11/graphical-user-interfaces-were-a-mistake-but-you-can-still-make-things-right/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29).

---
## Since you are here...

...you might also be interested in...

.center[

]

- for writing IHEID-style consistent dissertations (and syllabi and presentations..)
  - less prone to corruption, crashing, or inconsistency than Word
  - easier to write (and read) than `$\LaTeX$`, but nearly as flexible and nearly as small files
  - automatic figure, table, and cross-referencing; includes dedications, tables of contents, bibliography
  - write separate sections (with comments) and compile in chapters/papers (for supervisors) or as a whole (for submission),
  with automatic versioning
  - much, much more!

---
## Suggested approach

Learning statistics is all about:
- *repetition*

- *repetition*

- and application.

In this sense, learning statistics is _very_ much like learning a language.

Some are going to have a different familiarity or affinity with the material, have a stronger vocabulary or 'pronunication', etc.

That's fine -- you do you! -- but statistics is still an extraordinarily useful toolbelt for understanding the world (and people) around you and so it's worth leaning into it and getting out what you can. This is your chance.

So read, listen, do exercises, ask questions, play with ideas, get things wrong, ask for advice, etc...

---
class: center, middle
# Recap

.pull-1[.circleoff[![](https://graduateinstitute.ch/sites/default/files/styles/medium/public/2019-01/James%20Hollway.jpg?itok=1Yw0keum)]]
.pull-1[.circleoff[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.istockphoto.com%2Fvectors%2Fgrade-a-plus-result-vector-icon-school-red-mark-handwriting-a-plus-in-vector-id1136966571%3Fk%3D6%26m%3D1136966571%26s%3D612x612%26w%3D0%26h%3DS3pDI_xutxq1nLoWAW_D3h5j9wfwkVhWe7LSoVJRA00%3D&f=1&nofb=1)]]
.pull-1[.circleon[![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Fc%2Fcf%2FRewind_button.svg%2F240px-Rewind_button.svg.png&f=1&nofb=1)]]

---
## Stats I in one slide!

.pull-left[
.full-width[.content-box-blue[1: Introduction]]

.full-width[.content-box-blue[2: Sampling and Measurement]]

.full-width[.content-box-blue[3: Descriptive Statistics]]

.full-width[.content-box-blue[4: Probability Distributions]]

.full-width[.content-box-blue[5: Confidence Intervals]]

.full-width[.content-box-orange[6: Mock Exam]]

.full-width[.content-box-red[7: Midterm]]
]
.pull-left[
.full-width[.content-box-yellow[8: no class]]

.full-width[.content-box-blue[9: Significance Testing]]

.full-width[.content-box-blue[10: Bivariate Analysis]]

.full-width[.content-box-blue[11: Linear Regression]]

.full-width[.content-box-blue[12: Multiple Regression]]

.full-width[.content-box-orange[13: Review Week]]

.full-width[.content-box-red[14: Take-Home Exam]]
]

---
## Modelling

Let me try and pick up more or less where you left off,
and talk a bit about statistical **modelling**.

The goal of a model is to provide a *simple, low-dimensional summary* of historical data
that might help us understand: 
- what basically happened here?

With a reasonable 'model' of the data, of what basically appeared to happen here, we can explore all sorts of potential applications...

---
## Why model?

--
My favourite answer to this important question is from Joshua Epstein (2008), who provides at least 16 reasons:

1. Explain (very distinct from predict) 
1. Guide data collection
1. Illuminate core dynamics
1. Suggest dynamical analogies
1. Discover new questions
1. Promote a scientific habit of mind
1. Bound (bracket) outcomes to plausible ranges 
1. Illuminate core uncertainties.
1. Offer crisis options in near-real time
1. Demonstrate tradeoffs / suggest efficiencies
1. Challenge the robustness of prevailing theory through perturbations 
1. Expose prevailing wisdom as incompatible with available data
1. Train practitioners
1. Discipline the policy dialogue
1. Educate the general public
1. Reveal the apparently simple (complex) to be complex (simple)

???

The choice, then, is not whether to build models; it's whether to build explicit ones.

First, cherry picked examples is a terrible bias. We live cases but (should) decide on populations.

In explicit models, assumptions are laid out in detail, so we can study exactly what they entail. On these assumptions, this sort of thing happens. When you alter the assumptions that is what happens.

One can sweep a huge range of parameters over a vast range of possible scenarios to identify the most salient uncertainties, regions of robustness, and important thresholds, which is difficult to do with only an implicit model.

Note this does not obviate the need for judgment. But by revealing tradeoffs, uncertainties, and sensitivities, models can discipline the dialogue about options and make unavoidable judgments more considered.

---
## Three “models”

The term “model” can sometimes be confusing, because it can mean different things:

1. a **theoretical model** captures how we think concepts relate to one another such that it _can_ (but need not) be written as an equation, e.g.
$$y = \alpha + \beta_1 x_1 - \beta_2 x_2 $$

1. a **family of models** express a precise but generic pattern like a straight line or quadratic curve as an equation, e.g.
`$$y = a + b_1 x_1 - b_2 x_2 + e$$`

1. a **fitted model** finds the specific model from the family that is closest to your data, e.g.
`$$y = 7 + 3x_1 - 2x_2$$`

Note a fitted model is just the closest model from a family of models.
It implies that you have the “best” model (according to some criteria); it doesn’t imply that you have a “good” model and it certainly doesn’t imply that the model is “true”.

So this is sometimes why talk about 'model choice' is plagued by misunderstandings...

---
## Form and notation for linear models

Note that the model family expressed in the previous slide is a common one.
It is called a *linear model*, and the general form of the multiple regression model reads:

$$ Y = \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + \varepsilon  $$

where `$Y$` is the dependent or response variable, and `$x_{1}, x_{2}, ..., x_{k}$` represent a set of explanatory variables.

It is a pretty straightforward model, which is why it is so commonly used.
- we can express a direct, linear relationship between the left-hand side (LHS) and right-hand side (RHS) variables
- we can include numerous independent variables
- we can interpret coefficients as indicating the change of the dependent variable associated with a one unit increase/decrease of the independent variable holding all other variables constant

That is:

$$ E(Y) =  \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k  $$

---
## Some totally made-up data

.pull-left[Before we start using models on real, interesting datasets, let's explore the basics of how models work.

Let's use the simulated dataset `sim1`, included with the `{modelr}` package.

Let's plot its two continuous variables, *x* and *y*, to see how they’re related...

You can see a strong (linear) pattern in the data: the more *x* the same amount more/less *y* and vice versa.

What do you think this data could be??

Let’s use a model to capture that pattern and make it explicit...
]

.pull-right[

.panelset[
.panel[.panel-name[R Code]

```r
library(modelr)
library(ggplot2)
ggplot(sim1, aes(x, y)) + geom_point()
```
]

.panel[.panel-name[Plot]

![](STAT_L1_Refresher_files/figure-html/simulatedData-1.png)
]
]

]

---
## Lots of models

.pull-left[
Let’s start by getting a feel for what models from that family look like 
by randomly generating a few and overlaying them on the data.

```r
seed(123)
models <- tibble(a1 = runif(250, -20, 40), 
                 a2 = runif(250, -5, 5))

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept = a1, slope = a2), 
              data = models, alpha = 1/4) +
  geom_point()
```

There are 250 models on this plot, but lots are *really* bad!

.red[What does that *mean*?]

We need a way to measure the distance/closeness between the data and the model
so that we can work out which ones are good, bad, and ugly.
]

.pull-right[

.panelset[
.panel[.panel-name[R Code]

```r
models <- tibble(a1 = runif(250, -20, 40), a2 = runif(250, -5, 5))

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
  geom_point()
```
]

.panel[.panel-name[Plot]

![](STAT_L1_Refresher_files/figure-html/lotsOfModels-1.png)
]
]

]

---
## Throwing shapes

.pull-left[
A good place to start is to consider the vertical distance between each observed point and the value the model would predict based on one of the dimensions

OLS does this for all observations, not just the one we selected. It collapses them all into a single number: the “root-mean-squared deviation”.

RMSE calculates the difference between all actual and predicted observations, squares them, averages them, and then takes the square root:

$$ RMSE = \sqrt{\frac{1}{n} \sum\limits_{i=1}^{n} (Y_i - \hat{Y}_i)^2}   $$

RMSE has lots of appealing mathematical properties, which we’re not going to talk about here...
]

.pull-right[

.panelset[
.panel[.panel-name[R Code]

```r
library(dplyr)
candidates <- models %>% filter(a1 > 0) %>% 
  filter(a2 > 0) %>% filter(a1 < 15) %>% 
  slice(1:2)
obs <- sim1 %>% filter(x == 5) %>% slice(3)
pred <- candidates[,1] + candidates[,2]*5
dists <- (pred$a1 - obs$y)
squares <- data.frame(x1 = pmin(5, 5+dists), 
                      x2 = pmax(5, 5+dists), 
                  y1 = pmin(obs$y, 
                            obs$y+dists), 
                  y2 = pmax(obs$y, 
                            obs$y+dists))

ggplot(sim1, aes(x, y)) + geom_point() +
  scale_x_continuous(limits = c(-5, 30)) +
  scale_y_continuous(limits = c(-5, 30)) +
  geom_abline(data = candidates, 
              aes(intercept = a1, slope = a2), 
              color = c("blue", "red"),
              alpha = 1/4, show.legend = F) +
  geom_rect(data=sim1[1,], 
            aes(xmin=squares[1,1], xmax=squares[1,2],
                ymin=squares[1,3], ymax=squares[1,4]), 
            fill="blue", alpha=0.2) +
  geom_rect(data=sim1[2,], 
            aes(xmin=squares[2,1], xmax=squares[2,2],
                ymin=squares[2,3], ymax=squares[2,4]), 
            fill="red", alpha=0.2)
```
]

.panel[.panel-name[Plot]

![](STAT_L1_Refresher_files/figure-html/twoModels-1.png)
]
]

]

---
.pull-left[
## The better models

We could iterate over all the 250 (random) models we created and calculate the RMSE for each one.

Then we just highlight the candidate model that comes closest to (or, least far from) all the points.

Let's overlay the 10 best models from our random selection on to the data.

I’ve coloured the models by `-dist`: this is an easy way to make sure that the best models (i.e. the ones with the smallest distance) get the brighest colours.

]

.pull-right[

.panelset[
.panel[.panel-name[R Code]

```r
model1 <- function(a, data) {
  a[1] + data$x * a[2]
}
measure_distance <- function(mod, data) {
  diff <- data$y - model1(mod, data)
  sqrt(mean(diff ^ 2))
}
sim1_dist <- function(a1, a2) {
  measure_distance(c(a1, a2), sim1)
}
models <- models %>% 
  mutate(dist = purrr::map2_dbl(a1, a2, 
                                sim1_dist))

ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = a1, slope = a2, 
        colour = -dist), 
    data = filter(models, rank(dist) <= 10)
  )
```
]

.panel[.panel-name[Plot]

![](STAT_L1_Refresher_files/figure-html/plotBestModels-1.png)
]
]

]

---

.pull-left[
## The best model

However, since these belong to the family of *linear models*, 
we can exploit some connections between geometry, calculus, and linear algebra to quickly identify the single best solution. 
So instead of trying lots of random models "brute force", 
R has a tool specifically designed for fitting linear models called `lm()`:

```r
sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
```

```
## (Intercept)           x 
##    4.220822    2.051533
```

`lm()` has a special way to specify the model family: formulas. Formulas look like y ~ x, which `lm()` will translate to a function like

$$ y = a + b_1 x $$
]

.pull-right[

.panelset[
.panel[.panel-name[R Code]

```r
sim1_mod <- lm(y ~ x, data = sim1)
alpha_est <- sim1_mod$coefficients[1]
beta_est <- sim1_mod$coefficients[2]

ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(intercept = alpha_est, 
              slope = beta_est)
```
]

.panel[.panel-name[Plot]

![](STAT_L1_Refresher_files/figure-html/linearRegression-1.png)
]
]

]

???

R uses a formula system for specifying a model.

- You put the outcome variable on the left
- A tilde (`~`) is used for saying "predicted by"
- Exclude an intercept term by adding `-1` to your formula
- You can use a `.` to predict by all other variables e.g. `y ~ .`
- Use a `+` to provide multiple independent variables e.g. `y ~ a + b`
- You can use a `:` to use the interaction of two variables e.g. `y ~ a:b`
- You can use a `*` to use two variables and their interaction e.g. `y ~ a*b` (equivalent to `y ~ a + b + a:b`)
- You can construct features on the fly e.g. `y ~ log(x)` or use `I()` when adding values e.g. `y ~ I(a+b)`

For more info, check out `?formula`

Some other useful parameters
- `na.action` can be set to amend the handling of missings in the data
- `model`,`x`,`y` controls whether you get extra info about the model and data back. Setting these to `FALSE` saves space

---
## Interpretation

Cool, so we have a line. 
--
Now what?

Well, we can use this 'line' (our model) to think about the relationship between variable _x_ (whatever that is) and variable _y_ (whatever that is)
- the constant/intercept _a_ tells us what _y_ should be at _x=0_

$$ E(Y) = a + b_10 $$

- the parameter coefficient/estimate _b1_ tells us how _y_ changes with each unit increase in _x_

$$ (a + b_11) - (a + b_10) = b_1 $$

We talk about this being a 'one unit change', but you can also give it a substantive interpretation...

---

.pull-left[
## Simple linear regression

With statistics, and _incomplete data more generally_ (especially observational data),
we _always_ need to be skeptical and ask:
_couldn't we just have gotten this particular model (i.e. slope) by chance?_

If it's just by chance, 
then our model wouldn't actually tell us very much beyond this particular sample,
which kind of undermines its utility for us.

While it is tricky to evaluate how likely we could've gotten this result by chance from just one "sample", 
we can say something about how likely we are to get a slope
this steep or greater if actually there isn't any relationship in the population
based on how large and varied the sample is...

Let's run the simple linear regression again, but this time give a bit more output...
What can we say about our 'hypothesis' that _y_ is significantly related to _x_?
]

.pull-right[

```r
# Recommended
library(sjPlot)
lm(y ~ x, sim1) %>% tab_model()
```

<table style="border-collapse:collapse; border:none;">
<tr>
<th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm;  text-align:left; ">&nbsp;</th>
<th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">y</th>
</tr>
<tr>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal;  text-align:left; ">Predictors</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal;  ">Estimates</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal;  ">CI</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal;  ">p</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  ">4.22</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  ">2.44&nbsp;&ndash;&nbsp;6.00</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  "><strong>&lt;0.001</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">x</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  ">2.05</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  ">1.76&nbsp;&ndash;&nbsp;2.34</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  "><strong>&lt;0.001</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">30</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> / R<sup>2</sup> adjusted</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">0.885 / 0.880</td>
</tr>

</table>

```r
# sim1_mod <- lm(y ~ x, sim1)
# stargazer::stargazer(sim1_mod, type = "text")

# finalfit::finalfit(sim1, “y”, “x”)

# models <- list()
# models[['OLS 1']] <- lm(y ~ x, sim1)
# stargazer::msummary(models)
```

]

???

For another approach to understanding OLS through simulation, see [this link](http://yukiyanai.github.io/teaching/rm1/contents/R/linear-regression-1.html).

Or, [this](https://enchufa2.shinyapps.io/ls-springs/) is also a fun app...

---
## Significant notes

- `$p$`-values are not the probability that `$H_0$` is true
  - it is the probability, _assuming `$H_0$` to be true_, of observing a test statistic this far from the null hypothesised value
  - the further the observed statistic diverges from null hypothesised value, less probable such a divergent statistic obtained simply by chance
- `$p$`-values are not the probability that `$H_a$` is true
  - inference relates _only to the null hypothesis_; there could be other alternative hypotheses
  - in practice, we provisionally accept the alternative when rejecting the null (but more scholarship always warranted!)
- `$p$`-values are not the probability that `$H_0$` is false
  - statistical significance does not necessarily mean that the effect is real
  - by chance alone, about 1-in-20 significant findings will be spurious (implications for replication crisis...)
- non-significance does not mean no effect
  - small studies will often report non-significance even when there are important, real effects which a larger study would have detected
- significance does not mean important effect
  - it is the size of the effect that determines the importance, not the presence of statistical significance

---
## An example

> “The results **do not provide clear support** for the lack-of control hypothesis. Self-reported feelings of low and high control are positively associated with conspiracy belief in observational data (model 1; p<.05 and p<.01, respectively). **We are reluctant to engage in post hoc speculation about this unexpected result, but it does not clearly support our hypothesis.** Moreover, our experimental treatment effect estimate for our low-control manipulation is null relative to both the high-control condition (the preregistered hypothesis test) as well as the baseline condition (a RQ) in both the combined (table 2) and individual item results (table B7). Finally, **we find no evidence** that the association with self-reported feelings of control in model 1 of table 2 or the effect of the control treatments in model 2 are moderated by anti-Western or anti-Jewish attitudes (results available on request). **Our expectations are thus not supported.**”

.pull-right[Nyhan, Brendan, and Thomas Zeitzoff. 2018. "Conspiracy and Misperception Belief in the Middle East and North Africa." _The Journal of Politics_ 80(4): 1400-1404.]

---
## Extra features

The linear model is extensible to an arbitrary number of explanatory/independent variables/features.

But then (well, always) we need to interpret the coefficients carefully.

For example, consider the prediction equation of _k=2_ explanatory variables:

$$ E(Y) = a + b_1 x_1 + b_2 x_2 $$

`$b_1, b_2, ..., b_k$` are partial regression coefficients. 
That is, there is a linear relationship between `$E(Y)$` and `$x_1$` with the slope `$b_1$`, __controlling for other predictors in the model__.

If `$x_1$` goes up 1 unit with `$x_2$` held constant, the change in `$E(Y)$` is:

$$ (a + b_1 (x_1 + 1) + b_2 x_2) -  (a + b_1 x_1 + b_2 x_2) = b_1$$

The effect of each independent variable is the slope *controlling for* or *adjusting for* the effects of (all) other variables.

Thus, the best way of thinking about regression with more than one independent variable is to imagine a separate regression line for age at each value of religiosity, and vice versa.

---
.pull-left[
One trend, at least in presentations,
is to present results as "forest" plots (rather than tables):

```r
sim1$z <- sim2$y[1:30]
sim2_model <- lm(y ~ x + z, sim1)
plot_model(sim2_model, vline.color = "red", 
           sort.est = TRUE, show.values = T, 
           value.offset = .3)
```

<img src="STAT_L1_Refresher_files/figure-html/forestplot-1.png" width="504" />
]

.pull-right[
`{sjPlot}`, which is what we are using here, also makes it easy to standardise coefficients...

```r
plot_model(sim2_model, vline.color = "red", 
           sort.est = TRUE, 
           show.values = TRUE, 
           value.offset = .3, 
           type= "std")
```

<img src="STAT_L1_Refresher_files/figure-html/forestplotstd-1.png" width="504" />
]

???

See [here](https://strengejacke.github.io/sjPlot/articles/plot_model_estimates.html) for more on forest plots.

---
## Residuals

Another way to think about all this is that what we're really doing is using models to partition data into _patterns_ and _residuals_.

We’re trying to predict Y, but we’re never going to be spot on. Remember...

---
class: left, middle

.blockquote[All models are wrong, some are useful.

~ George E. P. Box]

.blockquote[A map is not the territory it represents, but, if correct, it has a similar structure to the territory, which accounts for its usefulness.

~ Alfred Korzybski]

???

It’s worth reading the fuller context of the quote:

Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. 
However, cunningly chosen parsimonious models often do provide remarkably useful approximations. 
For example, the law PV = RT relating pressure P, volume V and temperature T of an “ideal” gas 
via a constant R is not exactly true for any real gas, 
but it frequently provides a useful approximation and furthermore its structure is informative 
since it springs from a physical view of the behavior of gas molecules.

For such a model there is no need to ask the question “Is the model true?”. 
If “truth” is to be the “whole truth” the answer must be “No”. 
The only question of interest is “Is the model illuminating and useful?”.

The goal of a model is not to uncover truth, but to discover a simple approximation that is still useful.

---
## Residuals

What we're really doing is using models to partition data into _patterns_ and _residuals_.
We’re trying to predict Y, but we’re never going to be spot on. 
_How wrong they are_ and _how they are wrong_ can tell us a lot about whether we can improve our model or whether we are using the right family of models.

The main way to explore this is by examining the *residuals*: the remaining deviation between model predictions and the actual observations
(the _e_ in our model equation).

```r
plot(sim1_mod)
```

---
## Some final points on data

All this depends on some important *assumptions*: "a hammer works best for nailing problems".

We'll cover several assumptions over the next 2 weeks,
but to conclude today I want to highlight three extra conditions concerning the **data** we are using here:

1. Independent:
  - Sampling theory vs stochastic theory
  - SRS
  - independent and identically distributed (IID or iid)
  
--

1. Size: `dim()`
  - `$n > k$`
  - CLT says `$n>30$` will provide normal sampling distribution even if population distribution not normal
  
--

1. Variance: `var()`
  - There must be some variance in the features
  - What is the opposite of a variable?

---
background-image: url(https://imgs.xkcd.com/comics/statistics.png)
background-size: contain

---
class: center, middle
# Summary

What questions do you have for me?