Sampling, Estimators, and Estimation

Will Horne

Final Project

Reminder: Due date is December 13th, final class is the 2nd
- We will have in class presentations (not-graded), roughly 7 minutes per presentation (A chance for feedback)
Requirements for final project:
- Final paper should be 10-15 pages (including tables, figures, references). Include a clear research question and a brief literature review
- Quantitative description of your data
- Analysis of the relationship between two (or more) variables (Covariance, difference in means, regression…)

A Pitch for POSC 8410!

I’m teaching it!
Revamping both curriculum and qualifying exam process
- POST Curriculum –> Exam
Evaluate published research with state of the art methods and propose extensions using appropriate methods.
POSC 8410 will cover regression modelling, measurement, and techniques for causal inference
- Direct Goal - give you the tools you need for research
- Indirect benefit - You will be very well prepared for qualifying exams!

Job Talks

The department is hiring a tenured professor for the POST program
I highly recommend attending the job talks and taking the opportunity to meet with the candidates and ask questions
- Faculty vote on candidates, but your feedback is important
Even if you don’t care who gets hired, this is a chance to see what good public policy research looks like!
For those of you who are interested in pursuing academic careers, it is also useful to see what a job talk looks like

Job Talk Schedule

Friday 11/15 2:30 - 4 - Qing Miao (RIT): Assessing Social Equity in Federal Disaster Aid Distribution: Evidence from County-Level Analyses
Monday 11/18 2:30 - 4 - Ping Xu (URI): Federalism and the Politics of Immigrant Welfare Exclusion in the US
- Sadly, we have class. If you are very interested in this talk, let me know.
Wednesday 11/20 11:30 - 1 - Michael Jones (University of Tennessee)

Checking in

So far, we have been learning about properties of random variables and their distributions
Now, we are ready to try to estimate features of a population with data
Where do our estimators come from? What are their properties?
How does sampling work?
Problem Set next week on Estimators and Hypothesis Testing.

Some Terminology

Sample: The subset of units that are observed and measured. Size of the sample of denoted n.

Population: The set of units from which your sample is drawn. The size of the population is N.

Statistic: A numerical summary of the sample - such as the sample mean or variance

Population Parameter: A numerical summary of the population. Sample statistics have analogs in the population

We can say that the sample mean is an estimator of the population mean.

Motivating Example

Rational choice theories of political behavior struggle to explain why individuals vote
Formally, vote only if \(p(B) + D > C\) is positive where
- p is the probability of pivotality
- B is the policy benefit to the voter
- D is the “direct benefit” to the voter
- C is the cost of voting
What about the role of norms and social pressure?

Gerber et al Experiment

What is our sample? What is our population? What is a good quantity to estimate?

Gerber et al Results

How confident should we be in these predicted differences?

Should campaigns update their mailing strategies?

Can we think of some assumptions that might underlie the validity of these results?

Work with the Gerber Data

Load in the Gerber, Green and Larimer Data.

The Dependent Variable is “Vote” and the Treatment (or Independent) Variable is called treatment.

Convert treatment to a factor. Calculate the mean and standard deviation for each treatment category (use R functions, no need to code your own)

Calculate the Difference in means for the Neighbors Treatment and each other category.

The Magic of Randomization

One reason to be skeptical of the results is if the treatment and control groups look different in ways that are either observable or unobservable.

We cannot test whether we have balance on unobservables - but we feel better about things if our observable covariates are balanced.Let’s check if sex, age and voting in the 2004 primary are balanced.

If randomization is successful, we are nearly guaranteed a balanced sample across treatments. We will get into the math later, but as long as each treatment condition is reasonably large, we should be OK. This means for experimental results, we don’t even need regression with controls!

What are the goals? (1)

Inference
- What is our best guess about some quantity of interest?
- What is a plausible range of values for that quantity of interest?
- Under certain assumptions/conditions, recover causal inferences about treatments
  - Much more about this next semester. Causal inference is a whole sub-discipline of both statistics and the quantitative social sciences.

What are the goals? (2)

Compare Estimators
- Difference in sample means
  - \(\bar{Y} - \bar{X}\)
- Post-stratification estimator
  - Estimate group means separately, then weight to recover the overall estimator. If we let W stand for white respondents and B stand for black respondents, that is:
    - \((\bar{Y_{w}} - \bar{X_{w}})\bar{Z} + (Y_{m} - X_{m})(1 - \bar{Z_{i}})\)
- How do we choose which estimator is appropriate?

Sampling Intuition

Imagine you are a TA in POSC1010. Let’s say there are 300 students in the class, and you would like to know the distribution of class year (Freshman, Sophomore, etc).

Rather than ask every single student, you decide to ask the 20 students seated in the first two rows.

The first student answers “Freshman”, the second “Junior” and so on, and you convert these to numeric codes (1 = F, 2 = So, 3 = J, 4 = Se).

Raw Data

Student ID	Year
1	2
2	3
3	1
4	1
5	1

Plotting the Distribution

[1] "Sample Mean = 1.75"

The Components

Our sample is the twenty students we asked for their class year (n = 20)

Our population is the entire class (N = 300)

Our statistics/estimator is the sample mean \(\bar{X}\) = 1.75. This is our estimate of the population mean \(\ u\) .

How good of an estimate is this?

Types of Error

Selection Bias

One type of bias you might come across is selection bias.

This occurs when not all units in the population have equal probabilities of being samples.

A common example in social science research is the convenience sample, in which the units most easily available are chosen

Weighting and Stratification are potential solutions to measurement error

Measurement Bias

Measurement bias is introduced when your process of measuring a variable systematically misses the target in one direction.

In surveys, measurement bias can arise when questions are confusingly worded or leading, or when respondents may not be comfortable answering honestly.

In social sciences, we often use proxy measures to measure things we cannot observe directly, or indexes to attempt to measure latent concepts. If we are not careful, these will introduce measurement bias.

Non-Response Bias

Non-Response bias is introduced when units originally selected for the sample fail to provide data. When non-response is present, the final sample size for which there is full data is less than the initial sample size.

If non-responders differ from responders, we have bias. Un-testable assumption. We hope responders are missing at random.

Recall the first P-set. The attrition by survey respondents might introduce bias into our panel.

Back to our example

What types of bias are likely to occur if we randomly sample students from the first two rows of class?

We likely have selection bias, because students do not choose where to sit randomly. They may choose to sit with friends, or more experienced students may choose to sit closer to to the front (or back).

Less likely that we would have measurement bias or non-response bias here.

Causes of Variance

Sampling variance describes the variability from one sample to the next. How much variation if you were to draw a different sample.

Measurement variance is caused by inconsistency with measurement. If you measure the same variable twice, do you get the same reading?

Variance in Our Example

What type of variance are we likely to have?

Sampling variance depends on both the size of the population and of the sample. We have a sample of size 20 from a population of 300 - this is a relatively small sample!

We will formalize how to measure sampling variance shortly, but the importance of a large sample as opposed a random sample is often overstated!

Now for the Notation

Model-Based Inference: Random Vectors \(X_{1},...,X_{n}\) are iid draws from CDF F
- Model based because we assume some probability model F
- Example: \(X_{i} = 1 \text{ if responent i votes and 0 otherwise}\)
- Iid justification is justified based on assumption of random sample from an infinite population.

Point Estimation

Goal: Learn about features of the population
Parameter: \(\theta\) is any function of the population CDF F
- Aka: Quantities of Interest, Estimands
Some common parameters
- \(\mu = E[X_{i}]\)
- \(\sigma^{2} = Var[X]\)
- \(\mu_{y} - \mu_{x} = E[Y_{i}] - E[X_{i}]\) The difference in means between groups
Point estimation provides a single best guess about these parameters

Estimators

An Estimator \(\hat{\theta}_{n}\) for some parameter \(\theta\) is a statistics intended as a guess about \(\theta\)

\(\hat{\theta}_{n}\) is a random variable, because it is a function of the CDF

Implication: \(\hat{\theta}_{n}\) has a distribution, expectation, variance, etc

An Estimate is one particular realization of the estimator.

For example, “my estimator was the sample mean, and my estimate was 0.6”

Some Estimators

We could use many possible estimators for the population expectation \(E[X_{i}]\)
- \(\hat{\theta}_{n} = \bar{X}_{n}\) the sample mean
- \(\hat{\theta}_{n} = X_{1}\) just use the first observation
- \(\hat{\theta}_{n} = \text{max}(X_{1},...,{X}_{n})\) use the largest observation
- \(\hat{\theta}_{n} = 0\) always guess 0
Clearly, some estimators are better guesses of the population parameter than others!

Three Distributions

Population Distribution: the data-generating process
- In the Gerber et al paper, Bernoulli because vote/not vote is binary
Empirical Distribution: \(X_{1},....,X_{n}\)
- The data in your specific sample.
Sampling Distribution: Distribution of the estimator over repeated samples from the population distribution
- 0.38 turnout for the “neighbors” treatment is one draw from this distribution

Sampling from a population

# Set seed for reproducibility
set.seed(30317)

# Create a population of size 10000 with a mean of 0.4
# Using probability of success = 0.4 and number of trials = 1 for a binary outcome
population <- rbinom(10000, size = 1, prob = 0.4)

# Take 3 draws of size 30 from the population
sample1 <- sample(population, 30)
sample2 <- sample(population, 30)
sample3 <- sample(population, 30)

# Display the sample means
list(mean(sample1), mean(sample2), mean(sample3))

[[1]]
[1] 0.3

[[2]]
[1] 0.1666667

[[3]]
[1] 0.4333333

100000 repeated draws

Where do estimators come from?

Parametric Models: Assume X ~ F, iid, and specify what family of distributions F is from
- Example, F ~ binom(1000, 0.4)
- We often construct our estimator using maximum likelihood
  - The sample mean is the maximum likelihood estimator for the population mean
- Inferences are model dependent - and our model may be wrong

Estimators Part 2

Non-Parametric Models: Minimal or no assumptions about F
- KNN models for clustering are a basic example
Plug-in principle: Replace F with the empirical distribution
- If \(\theta\) = E[X], replace with the sample mean \(\hat{\theta} = \frac{1}{n}\sum_{i =1}^{n} X_{i}\)
- Semi-Parametric in that we only make limited assumptions about the population distribution

Other Plug-in Estimators

Variance:

\[ \sigma^{2} = E[(X_{i} - E[X_{i}])^{2} -> \widehat{\sigma}^{2} = \frac{1}{n}\sum_{i = 1}^{n}(X_{i} - \bar{X}_{n})^{2} \]

Covariance:

\[ \sigma_{x,y} = Cov[X_{i}, Y_{i}] = E[(X_{i} - E[X_{i}](Y_{i} - E[Y_{i}])] -> \\ \widehat{\sigma}_{x,y} = \frac{1}{n}\sum_{i=1}^{n}(X_{i} - \bar{X_{i}})(Y_{i} - \bar{Y_{i}})\]

Estimator Properties

We only get one draw from the sampling distribution, \(\widehat{\theta_{n}}\)
- Want to use estimators whose distribution is “close” to the true value.
Two ways to evaluate estimators
- Finite Sample Properties: The properties of a sampling distribution for a fixed sample size n.
- Large (or infinite) Sample Properties: The properties of an estimator as \(n -> \infty\)

Bias

The bias of an estimator \(\widehat{\theta}\) for parameter \(\theta\) is

\[ \text{bias}[\widehat{\theta}] = E[\widehat\theta] - \theta \]

An estimator is unbiased if \(E[\hat{\theta}]\) - \(\theta\) = 0.

Is the sample mean unbiased?

\[ E[\bar{X}_{n}] = \frac{1}{n}\sum_{i=1}^{n}E[X_{i} ] = \frac{1}{n}\sum_{i = 1}^{n} \mu = \mu\]

Estimation Variance

Sampling Variance: The variance of an estimator Var[\(\widehat{\theta}\)].

Sampling Variance of the sample mean:

\[ Var[{\bar{X}_{n}}] = \frac{1}{n^{2}}\sum_{i=1}^{n} Var[X_{i}] = \frac{1}{n^{2}}\sum_{i=1}^{n}\sigma^{2} = \frac{\sigma^{2}}{n} \]

Standard Error: Standard Deviation of the Estimator \(se(\widehat{\sigma}) = \sqrt{Var(\widehat{\theta})}\)

Which means that the standard error of the sample mean is \(\frac{\sigma}{n}\)

Mean Squared Error

Mean Squared Error (MSE) is

\[ E[\widehat{(\theta_{n}} - \theta)^{2}] \]

One way to estimate the quality of an estimator.

How large are the squared deviations from the true parameter? Lower = better!

Also a key metric to measure model performance in machine learning.

Bias Variance Tradeoff

Useful Decomposition:

\[ MSE = \text{bias}[\widehat\theta_{n}]^{2} + V[\widehat{\theta}] \]

Therefore - For unbiased estimators, MSE is just the sampling variance

Of course we want low bias! Often, we want the best unbiased estimator. But, we might accept some bias for large reductions in variance. This can give us a better overall MSE.

Examples include Regularization, which increases bias and decreases variance in an effort to avoid overfitting the model to the data.

Understanding Bias and Variance

Imagine that we actually has access to an accurate record for all students enrolled in the class from the early example. Think of it as a census of the student population.

Bias in Selecting the Front Row

What if wanting to set in the front row is caused by some hidden/latent trait like enthusiasm.

Sampling from the front row

We have reason to suspect that students in the front row are more enthusiastic. What sort of bias might that introduce?

We can model the bias (in a sort of naive way) by selecting students with a probability proportional to enthusiasm

slice(pop_enthusiastic, 1, 2, 3, 4, 5)

  year enthusiasm
1    1         10
2    2          6
3    4          1
4    1         10
5    2          6

Code

# Load libraries
library(dplyr)
library(ggplot2)
library(infer)
library(patchwork)

# Generate sample data
samp_1 <- pop_eager %>%
  slice_sample(n = 18, replace = FALSE, weight_by = eagerness) 

many_samps <- samp_1 %>%
  mutate(replicate = 1)

set.seed(40211)

for (i in 2:500) {
  many_samps <- pop_eager %>%
    slice_sample(n = 18, replace = FALSE, weight_by = eagerness) %>%
    mutate(replicate = i) %>%
    bind_rows(many_samps)
}

# Define a custom theme for all plots
custom_theme <- theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 10)
  )

# Plot Sample 1
p1 <- many_samps %>%
  filter(replicate == 1) %>%
  ggplot(aes(x = year)) + 
  geom_bar(fill = "#4B0082", color = "black", width = 0.7) +
  labs(title = "Sample 1", x = "Year", y = "Count") +
  custom_theme

# Plot Sample 2
p2 <- many_samps %>%
  filter(replicate == 2) %>%
  ggplot(aes(x = year)) + 
  geom_bar(fill = "#6A0DAD", color = "black", width = 0.7) +
  labs(title = "Sample 2", x = "Year", y = "Count") +
  custom_theme

# Plot Sample 3
p3 <- many_samps %>%
  filter(replicate == 3) %>%
  ggplot(aes(x = year)) + 
  geom_bar(fill = "#8A2BE2", color = "black", width = 0.7) +
  labs(title = "Sample 3", x = "Year", y = "Count") +
  custom_theme

# Sampling Distribution of Means
many_xbars <- many_samps %>%
  group_by(replicate) %>%
  summarize(xbar = mean(as.numeric(year)))

p4 <- many_xbars %>%
  ggplot(aes(x = xbar)) +
  geom_histogram(fill = "purple", color = "black", bins = 15) +  # Histogram for distribution of means
  lims(x = c(0, 4)) +
  labs(title = "Sampling Distribution", x = "Mean Year", y = "Frequency") +
  custom_theme

# Arrange all plots in a grid using patchwork
(p1 + p2 + p3) / p4

Samples

Survey Sampling

Survey sampling is a type of design based inference where we take a finite sample a population of size N.

A simple random sample of size n from a finite population is a sample in which the probability of inclusion of each unit is

\[ \pi = \frac{n}{N} \]

and we let \(I_{n}\) be an indicator [0,1] of whether a population unit is included in the sample

More complex sampling designs lead to different inclusion probabilities and design inferences

Estimands and Estimators

Estimand: Population mean \(\bar{x} = \frac{1}{N}\sum_{i=1}^{N}x_{i}\)
- Fixed quantity because pop is fixed and finite
- But we never observe it (short of a true census)
Estimator: sample mean \(\bar{X} = \frac{1}{N}\sum_{i=1}^{N} I_{i}x_{i}\)
- The estimator is random because the sample is random
Design Based Inference - Randomness comes from sample alone, and depends on the sampling design

Survey Variance

Variance of \(X_{n}\) over repeated samples

\[ Var[\bar{X_{n}}] = (1 - \frac{n}{N}) \frac{s^{2}}{N} \]

What is \(s^{2}\)? It’s the population variance

\[ s^{2} = \frac{1}{N-1} \sum_{i=1}^{N}(x_{i} -\bar{x})^{2} \]

Variance of the sample mean

We can apply the plug-in principle from before, and use the sample variance \(S^{2}\)

\[ \widehat{Var}[\bar{X}_{n}] = (1 - \frac{n}{N})\frac{S^{2}}{n} \\ S^{2} = \frac{1}{n-1}\sum_{i=1}^{N}I_{i}(x_{i}-\bar{X_n})^{2} \]

This is an unbiased estimate of the variance of the sample mean

Weighting

Often, we have unequal sampling probabilties
- Some groups may be hard to reach
  - Low trust individuals
  - non English speakers
Let \(\pi_{i}\) be the probabilty of an individual from a specific group being sampled

\[ \tilde{X}_{ipw} = \frac{I_{i}x_{i}/\pi_{i}}{I_{i}/\pi_{i}} \]