Confidence Intervals and Hypothesis Testing

Where are we now?

  • We can now estimate population parameters from data

    • And think about bias and consistency of estimators
  • Now: How can we use these estimates to test a hypothesis about a parameter?

    • Is the mean treatment effect of pre election mailings > 0?

    • Is there majority support for a carbon tax?

  • We will put the probability intuition we have been building to good use!

Supposed Origins of Hypothesis Testing

The Lady Tasting Tea

  • Biologist (and tea afficionado) Muriel Bristol claimed she could tell whether tea or milk was added first to a cup.

  • Statistician R.A Fisher was skeptical, so he devised a simple test:

    • Make 8 cups of team, 4 each way

    • Present cups in random order; asked Bristol to pick which 4 are milk first

  • She picked all of them correctly.

    • What can we learn from this?

    • Could she have just gotten all 8 right by chance?

Probability of Randomly Getting all 8

  • How often would she get all 8 right if she were just guessing at random?

    • One way to choose all 4 correctly

    • \(\binom{8}{4}\) = 70 total ways to choose 4 cups

  • So, probability of getting all 8 correct with random guessing is \(\frac{1}{70}\) \(\approx 0.014\)

  • It’s pretty unlikely she was guessing randomly!

    • Bayesian perspective might be a little different.

Hypothesis Testing Framework

What is a hypothesis?

  • A hypothesis is a statement about a population parameter

  • We might have causal hypothesis

    • Does social pressure cause higher voter turnout (mean turnout higher under social pressure than control)?

    • Does dropping standardized test requirements increase student diversity?

  • Or descriptive hypothesis

    • Is Keir Starmer’s (UK PM) approval rating higher than 50%

    • Do more Americans support leaving NATO now than did in 2010?

NHST Framework

  • Choose null and alternative hypotheses

  • Choose a test statistic, \(T_{n}\)

  • Choose a test level, \(\alpha\)

  • Determine the rejection region

  • Reject if \(T_{n}\) is in the reject region, fail to reject otherwise

Null and Alternative Hypothesis

  • The Null hypothesis is the one we explicitly test

    • Usually of the form, “No relationship/difference/effect”

    • \(H_{0}\) : Social pressure mailings don’t impact turnout, \(\tau = 0\)

  • The Alternative Hypothesis is the complement of the null

    • Usually, “there is a relationship/difference/effect”

    • \(H_{1}\): social pressure mailings do impact turnout, \(\tau \neq 0\)

  • Testing the null, quantity of interest remains \(\tau\)

    • In papers, we generally state \(H_{1}\)

    • Bayes: Posterior distribution with credible intervals

Two Sided versus One Sided

  • One sided tests are of the form \(H_{1}: \theta > \theta_{0} \text{ or } \theta < \theta_{0}\)

    • Explicitly tests for either a positive or negative difference
  • Two sided tests are of the form \(H_{1}: \theta \neq \theta_{0}\)

    • tests for (lack of) equality
  • We almost always use two sided tests.

    • One sided tests are ignoring information/evidence in one direction

    • Two sided is much more conservative, and much more common

General Framework

  • Hypothesis tests choose to reject or not reject the null based on the observed data

    • Assumption: we know the data generating process
  • Rejection is based on test statistic, \(T_{n}\)

    • Helps us reason about likelihood of Null vs Alternative

    • Larger values of \(T_{n}\) mean that the null is less plausible

    • A test statistics is a random variable (has a distribution, etc)

  • Intuitively, reject the null when \(|\bar{Y}_{1} - \bar{Y}_{0}|\) is large

Rejection

  • The Rejection Region R is a region of the sample space

    • If our data lies in R, we reject \(H_{0}\)

    • If not, we fail to reject \(H_{0}\)

  • Regions are based on some test statistic \(T_{n}\). Usually:

    • We have some critical value c.

    • \(|T_{n}| > c\) means that we reject the null

  • c also defines the rejection region C

    • Reject when \(T_{n} \in C\)

Types of Error

\(H_0\) True \(H_0\) False
Retain \(H_0\) Great Type II error (False Negative)
Reject \(H_0\) Type I error (False Positive) Great
  • I don’t like the terminology

  • False Positive (type 1)

    • No treatment effect but we reject the null
  • False Negative (type 2)

    • Treatment effect is nonzero, but we fail to reject the null

Features of a test

A good test rejects the null when it should, and retains it when there is no treatment effect

Power Function of a test: probability of rejection of a null as a function of \(\theta\)

\[ \pi(\theta) = P(\text{Reject } H_{0} | \theta) = P(T_{n} \in C | \theta) \]

Hypothetical: If we knew \(\theta\), what is the probability of the test rejecting the null.

The Power of a test against an alternative \(\theta_{1} \in H_{1}\) is \(\pi(\theta)\)

Size

The size of a test is the probability of a false positive/false discovery:

\[ \pi(\theta_{0}) = P(\text{Reject } H_{0}|\theta = 0) \]

Size of a two sided test: \(P(|T_{n}| > c|\theta = 0)\)

We want to minimize the size of the test

Test Statistic Example

From the Central Limit Theorem, the Difference in Means estimator is asymptotically normal

\[ \frac{\tau_{n} - \tau}{\widehat{se}[\widehat{\tau_{n}}]} \overset{d}{\rightarrow} N(0,1) \]

Under the null of \(H{0} = E[Y_{1}] - E[Y_{0}] = 0\) ,

\[ T_{n} = \frac{\widehat{\tau}}{\widehat{se}[\widehat{\tau_{n}}]} \overset{d}{\rightarrow} N(0,1) \]

The bigger the power, the bigger the size.

Size Power Tradeoff

Size Power Tradeoff

Size Power Tradeoff

Controlling the Size

  • Generally, we cannot either reduce size or increase power

    • Classic example is to minimize rate of false discovery
  • In frequentist statistics, we set a significance level \(\alpha\) as the maximum size of a test

    • Convention in social sciences is arbitrarily 0.05.

      • Benjamin el al propose \(\alpha\) = .005 in a recent nature pieice

      • My take - it’s all arbitrary. Specify in advance and justify based on research Q or application

Why set \(\alpha\) at .05?

  • Justification of \(\alpha\) is that at most \(100\times\alpha\%\) of discoveries will be false discoveries

    • In practice, this is woefully optimistic

    • Much more on this next semester, but a combination of p-hacking, the file-drawer problem and bias against null results in journals.

    • Bayesians hate this approach, but as of now it remains dominant

  • As researcher, our goal is never significance.

    • Our job is to accurately describe the world

One Sided Test

  • How would we select c s.t. \(\alpha\) = 0.05

    • Let \(g_{0} = P(T_{n} \leq t|\alpha_{0})\) be the CDF under the null

    • We want to find c that puts \(\alpha\) probability in the tail: \(1 - G_{o}(c) = \alpha\)

    • Use the quantile function \(c = G^{-1}(1 - \alpha)\)

  • if \(G_{0} \sim N(0,1) \text{ and } \alpha = 0.05, G^{-1}(0.95) = 1.645\)

Two Sided Test

  • How would we select a c s.t \(\alpha\) = 0.05?

    • Same logic as before, but now we want \(c = G^{-1}(1 - \frac{\alpha} {2})\)
  • If \(g_{0} \sim N(0,1)\) and \(\alpha = 0.05, G^{-1}(0.975) = 1.96\)

Hypothesis Test Steps

  • Hypothesis: \(H_{0}, \tau = 0\) vs \(H_{1},\tau \neq 0\)

  • Test Statistic: \(T_{n} = \frac{\tau_{n}}{\widehat{se}[\widehat{\tau{n}}]}\)

  • Pick \(\alpha\), often 0.05

  • Rejection region is from the quantile function

    • Computer will calculate for you, or use a look up table

t-test/Wald test

  • Consider any asympotically normal estimator \(\widehat{\theta}\) for parameter \(\theta\)

  • Test \(H_{0}, \theta = \theta_{0}\) vs \(H_{1}, \theta \neq \theta_{0}\)

    Note

    A size \(\alpha\) t or wald test rejects \(H_{0}\) when \(|T_{n}| > c\) where \[T_{n} = \frac{\widehat{\theta}_{n} - \theta_{0}}{\widehat{se}[\widehat{\theta]}}\]

Critical Value comes from the quantile function as before

p-values

  • p values are the probability of observing \(T_{n}\) or more extreme under \(H_{0}\)

    • Smallest \(\alpha\) we could set to reject the null

    • Less arbitrary than picking some c, a continuous measure of the strength of evidence against the null

      Note

      For a two sided test

      \[ p = 2(1 - G_{0}(|T_{n}|) \]

      but in practice R will do this for you

Careful with P values!

  • Low P value \(\sim\) data unlikely if null is true \(\sim\) evidence against the null

  • P values are not

    • An indication of large substantive effects

    • The probability that the null hypothesis is false

    • The probability that the alternative hypothesis is true

  • They are \(P(\widehat{\theta}|H_{0} \text{ is true})\)

Student t Exact Test

  • Asymptotics are approximations. Can we get precise inference at any sample size?

  • Yes! Assume \(X_{1},...,X_{n}\) are iid samples from \(N(\mu, \sigma^{2})\)

  • Note

    Under null of \(H_{o}: \mu = \mu_{0}\) we have \[T_{n} = \frac{\bar{X}_{n} - \mu_{0}}{s_{n}/\sqrt{n}} \sim t_{n-1}\] or a student t distribution with n-1 degrees of freedom

  • Use quantiles of the Student t rather than normal for critical values

    • Asymptotically equivalent to normal, but more conservative with small n.

Comparing Student and Normal Distributions

Confidence Intervals

  • Hypothesis tests give us a binary decision; reject or accept the null hypothesis

  • An alternative approach is to focus on the range of plausible values for a given estimate, rather than simply whether it rejects or accepts the null

  • The concept of confidence intervals allows us to place upper and lower bounds on the plausible values for our estimate/treatment/intervention.

    • If we calculate a 95% confidence interval, we can (assuming no systematic biases, measurement error, etc) be 95% confident that the population parameter lies in that interval.

Calculating Confidence Intervals

  • Can calculate for any value. The steps are as follows
    • Pick some \(0 < \alpha < 1\).

    • Determine the corresponding critical value from the quantile function or a look up table

    • Confidence interval is \([\bar{x} - c \times se, \bar{x} + c \times se]\)

An example

Suppose we sample 1500 survey respondents from some population, and ask them whether they support increase public funding for charter schools. We get back a sample mean of 0.57 (0 = oppose, 1 = support), with a standard error of 0.03). How can we calculate a 92% confidence interval.

It turns out (I asked chat GPT) that the critical value for a 92% confidence interval is 1.75. The qt function will do this for you in R. So,

Lower Bound = \(0.57 \times 1.75 -0.03 = 0.5175\)

Upper Bound = \(0.57 \times 1.75 = 0.03 = 0.6225\)

Visualizing Confidence Intervals

Confidence Intervals w/PDF

When do we do this?

  • A simple difference in means or proportions hypothesis test is most useful when

    • We want to compare two groups

      • Can compare more groups, it’s a little more complicated
    • We have reason to believe that there is balance across the groups such that we don’t need to adjust/control for confounders

      • Experiments/random assignment
    • OR, we want to know if a population parameter is different than some discrete value or in some interval

  • We instead use other tools (mostly regression) if

    • We want to model relationships between a continuous or count independent variable and the DV

    • We think we need to adjust for some covariates that may impact both our independent and dependent variables

Hypothesis Testing in R

## load the data
 data(mtcars)
## recode
mpg_auto <- mtcars$mpg[mtcars$am == 0] 
mpg_manual <- mtcars$mpg[mtcars$am == 1]
#Perform a two-sample t-test
t_test_result <- t.test(mpg_auto, mpg_manual, alternative = "two.sided",  var.equal = FALSE)


t_test_result

    Welch Two Sample t-test

data:  mpg_auto and mpg_manual
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean of x mean of y 
 17.14737  24.39231 

Your Turn

  • Let’s go back to the data on potential racial bias in hiring (resumes.csv)

  • Test whether individuals with black and white names have the same probability of getting a call back

  • Conduct a formal hypothesis test and then plot the difference in means with confidence intervals

    • Start with a critical value of .05 (the default)

    • What if we adopt the recent recommendation of .005?