Confidence Intervals and Hypothesis Testing

Where are we now?

We can now estimate population parameters from data
- And think about bias and consistency of estimators
Now: How can we use these estimates to test a hypothesis about a parameter?
- Is the mean treatment effect of pre election mailings > 0?
- Is there majority support for a carbon tax?
We will put the probability intuition we have been building to good use!

Supposed Origins of Hypothesis Testing

The Lady Tasting Tea

Biologist (and tea afficionado) Muriel Bristol claimed she could tell whether tea or milk was added first to a cup.
Statistician R.A Fisher was skeptical, so he devised a simple test:
- Make 8 cups of team, 4 each way
- Present cups in random order; asked Bristol to pick which 4 are milk first
She picked all of them correctly.
- What can we learn from this?
- Could she have just gotten all 8 right by chance?

Probability of Randomly Getting all 8

How often would she get all 8 right if she were just guessing at random?
- One way to choose all 4 correctly
- \(\binom{8}{4}\) = 70 total ways to choose 4 cups
So, probability of getting all 8 correct with random guessing is \(\frac{1}{70}\) \(\approx 0.014\)
It’s pretty unlikely she was guessing randomly!
- Bayesian perspective might be a little different.

Hypothesis Testing Framework

What is a hypothesis?

A hypothesis is a statement about a population parameter
We might have causal hypothesis
- Does social pressure cause higher voter turnout (mean turnout higher under social pressure than control)?
- Does dropping standardized test requirements increase student diversity?
Or descriptive hypothesis
- Is Keir Starmer’s (UK PM) approval rating higher than 50%
- Do more Americans support leaving NATO now than did in 2010?

NHST Framework

Choose null and alternative hypotheses
Choose a test statistic, \(T_{n}\)
Choose a test level, \(\alpha\)
Determine the rejection region
Reject if \(T_{n}\) is in the reject region, fail to reject otherwise

Null and Alternative Hypothesis

The Null hypothesis is the one we explicitly test
- Usually of the form, “No relationship/difference/effect”
- \(H_{0}\) : Social pressure mailings don’t impact turnout, \(\tau = 0\)
The Alternative Hypothesis is the complement of the null
- Usually, “there is a relationship/difference/effect”
- \(H_{1}\): social pressure mailings do impact turnout, \(\tau \neq 0\)
Testing the null, quantity of interest remains \(\tau\)
- In papers, we generally state \(H_{1}\)
- Bayes: Posterior distribution with credible intervals

Two Sided versus One Sided

One sided tests are of the form \(H_{1}: \theta > \theta_{0} \text{ or } \theta < \theta_{0}\)
- Explicitly tests for either a positive or negative difference
Two sided tests are of the form \(H_{1}: \theta \neq \theta_{0}\)
- tests for (lack of) equality
We almost always use two sided tests.
- One sided tests are ignoring information/evidence in one direction
- Two sided is much more conservative, and much more common

General Framework

Hypothesis tests choose to reject or not reject the null based on the observed data
- Assumption: we know the data generating process
Rejection is based on test statistic, \(T_{n}\)
- Helps us reason about likelihood of Null vs Alternative
- Larger values of \(T_{n}\) mean that the null is less plausible
- A test statistics is a random variable (has a distribution, etc)
Intuitively, reject the null when \(|\bar{Y}_{1} - \bar{Y}_{0}|\) is large

Rejection

The Rejection Region R is a region of the sample space
- If our data lies in R, we reject \(H_{0}\)
- If not, we fail to reject \(H_{0}\)
Regions are based on some test statistic \(T_{n}\). Usually:
- We have some critical value c.
- \(|T_{n}| > c\) means that we reject the null
c also defines the rejection region C
- Reject when \(T_{n} \in C\)

Types of Error

	\(H_0\) True	\(H_0\) False
Retain \(H_0\)	Great	Type II error (False Negative)
Reject \(H_0\)	Type I error (False Positive)	Great

I don’t like the terminology
False Positive (type 1)
- No treatment effect but we reject the null
False Negative (type 2)
- Treatment effect is nonzero, but we fail to reject the null

Features of a test

A good test rejects the null when it should, and retains it when there is no treatment effect

Power Function of a test: probability of rejection of a null as a function of \(\theta\)

\[ \pi(\theta) = P(\text{Reject } H_{0} | \theta) = P(T_{n} \in C | \theta) \]

Hypothetical: If we knew \(\theta\), what is the probability of the test rejecting the null.

The Power of a test against an alternative \(\theta_{1} \in H_{1}\) is \(\pi(\theta)\)

Size

The size of a test is the probability of a false positive/false discovery:

\[ \pi(\theta_{0}) = P(\text{Reject } H_{0}|\theta = 0) \]

Size of a two sided test: \(P(|T_{n}| > c|\theta = 0)\)

We want to minimize the size of the test

Test Statistic Example

From the Central Limit Theorem, the Difference in Means estimator is asymptotically normal

\[ \frac{\tau_{n} - \tau}{\widehat{se}[\widehat{\tau_{n}}]} \overset{d}{\rightarrow} N(0,1) \]

Under the null of \(H{0} = E[Y_{1}] - E[Y_{0}] = 0\) ,

\[ T_{n} = \frac{\widehat{\tau}}{\widehat{se}[\widehat{\tau_{n}}]} \overset{d}{\rightarrow} N(0,1) \]

The bigger the power, the bigger the size.

Size Power Tradeoff

Controlling the Size

Generally, we cannot either reduce size or increase power
- Classic example is to minimize rate of false discovery
In frequentist statistics, we set a significance level \(\alpha\) as the maximum size of a test
- Convention in social sciences is arbitrarily 0.05.
  - Benjamin el al propose \(\alpha\) = .005 in a recent nature pieice
  - My take - it’s all arbitrary. Specify in advance and justify based on research Q or application

Why set \(\alpha\) at .05?

Justification of \(\alpha\) is that at most \(100\times\alpha\%\) of discoveries will be false discoveries
- In practice, this is woefully optimistic
- Much more on this next semester, but a combination of p-hacking, the file-drawer problem and bias against null results in journals.
- Bayesians hate this approach, but as of now it remains dominant
As researcher, our goal is never significance.
- Our job is to accurately describe the world

One Sided Test

How would we select c s.t. \(\alpha\) = 0.05
- Let \(g_{0} = P(T_{n} \leq t|\alpha_{0})\) be the CDF under the null
- We want to find c that puts \(\alpha\) probability in the tail: \(1 - G_{o}(c) = \alpha\)
- Use the quantile function \(c = G^{-1}(1 - \alpha)\)
if \(G_{0} \sim N(0,1) \text{ and } \alpha = 0.05, G^{-1}(0.95) = 1.645\)

Two Sided Test

How would we select a c s.t \(\alpha\) = 0.05?
- Same logic as before, but now we want \(c = G^{-1}(1 - \frac{\alpha} {2})\)
If \(g_{0} \sim N(0,1)\) and \(\alpha = 0.05, G^{-1}(0.975) = 1.96\)

Hypothesis Test Steps

Hypothesis: \(H_{0}, \tau = 0\) vs \(H_{1},\tau \neq 0\)
Test Statistic: \(T_{n} = \frac{\tau_{n}}{\widehat{se}[\widehat{\tau{n}}]}\)
Pick \(\alpha\), often 0.05
Rejection region is from the quantile function
- Computer will calculate for you, or use a look up table

t-test/Wald test

Consider any asympotically normal estimator \(\widehat{\theta}\) for parameter \(\theta\)
Test \(H_{0}, \theta = \theta_{0}\) vs \(H_{1}, \theta \neq \theta_{0}\)

Note

A size \(\alpha\) t or wald test rejects \(H_{0}\) when \(|T_{n}| > c\) where \[T_{n} = \frac{\widehat{\theta}_{n} - \theta_{0}}{\widehat{se}[\widehat{\theta]}}\]

Critical Value comes from the quantile function as before

p-values

p values are the probability of observing \(T_{n}\) or more extreme under \(H_{0}\)
- Smallest \(\alpha\) we could set to reject the null
- Less arbitrary than picking some c, a continuous measure of the strength of evidence against the null
  
  Note
  
  For a two sided test
  
  \[ p = 2(1 - G_{0}(|T_{n}|) \]
  
  but in practice R will do this for you

Careful with P values!

Low P value \(\sim\) data unlikely if null is true \(\sim\) evidence against the null
P values are not
- An indication of large substantive effects
- The probability that the null hypothesis is false
- The probability that the alternative hypothesis is true
They are \(P(\widehat{\theta}|H_{0} \text{ is true})\)

Student t Exact Test

Asymptotics are approximations. Can we get precise inference at any sample size?
Yes! Assume \(X_{1},...,X_{n}\) are iid samples from \(N(\mu, \sigma^{2})\)
Note

Under null of \(H_{o}: \mu = \mu_{0}\) we have \[T_{n} = \frac{\bar{X}_{n} - \mu_{0}}{s_{n}/\sqrt{n}} \sim t_{n-1}\] or a student t distribution with n-1 degrees of freedom
Use quantiles of the Student t rather than normal for critical values
- Asymptotically equivalent to normal, but more conservative with small n.

Comparing Student and Normal Distributions

Confidence Intervals

Hypothesis tests give us a binary decision; reject or accept the null hypothesis
An alternative approach is to focus on the range of plausible values for a given estimate, rather than simply whether it rejects or accepts the null
The concept of confidence intervals allows us to place upper and lower bounds on the plausible values for our estimate/treatment/intervention.
- If we calculate a 95% confidence interval, we can (assuming no systematic biases, measurement error, etc) be 95% confident that the population parameter lies in that interval.

Calculating Confidence Intervals

Can calculate for any value. The steps are as follows
- Pick some \(0 < \alpha < 1\).
- Determine the corresponding critical value from the quantile function or a look up table
- Confidence interval is \([\bar{x} - c \times se, \bar{x} + c \times se]\)

An example

Suppose we sample 1500 survey respondents from some population, and ask them whether they support increase public funding for charter schools. We get back a sample mean of 0.57 (0 = oppose, 1 = support), with a standard error of 0.03). How can we calculate a 92% confidence interval.

It turns out (I asked chat GPT) that the critical value for a 92% confidence interval is 1.75. The qt function will do this for you in R. So,

Lower Bound = \(0.57 \times 1.75 -0.03 = 0.5175\)

Upper Bound = \(0.57 \times 1.75 = 0.03 = 0.6225\)

Visualizing Confidence Intervals

Confidence Intervals w/PDF

When do we do this?

A simple difference in means or proportions hypothesis test is most useful when
- We want to compare two groups
  - Can compare more groups, it’s a little more complicated
- We have reason to believe that there is balance across the groups such that we don’t need to adjust/control for confounders
  - Experiments/random assignment
- OR, we want to know if a population parameter is different than some discrete value or in some interval
We instead use other tools (mostly regression) if
- We want to model relationships between a continuous or count independent variable and the DV
- We think we need to adjust for some covariates that may impact both our independent and dependent variables

Hypothesis Testing in R

## load the data
 data(mtcars)
## recode
mpg_auto <- mtcars$mpg[mtcars$am == 0] 
mpg_manual <- mtcars$mpg[mtcars$am == 1]
#Perform a two-sample t-test
t_test_result <- t.test(mpg_auto, mpg_manual, alternative = "two.sided",  var.equal = FALSE)


t_test_result


    Welch Two Sample t-test

data:  mpg_auto and mpg_manual
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean of x mean of y 
 17.14737  24.39231

Your Turn

Let’s go back to the data on potential racial bias in hiring (resumes.csv)
Test whether individuals with black and white names have the same probability of getting a call back
Conduct a formal hypothesis test and then plot the difference in means with confidence intervals
- Start with a critical value of .05 (the default)
- What if we adopt the recent recommendation of .005?