[1] "Sample Mean = 1.75"
Reminder: Due date is December 13th, final class is the 2nd
Requirements for final project:
Final paper should be 10-15 pages (including tables, figures, references). Include a clear research question and a brief literature review
Quantitative description of your data
Analysis of the relationship between two (or more) variables (Covariance, difference in means, regression…)
I’m teaching it!
Revamping both curriculum and qualifying exam process
Evaluate published research with state of the art methods and propose extensions using appropriate methods.
POSC 8410 will cover regression modelling, measurement, and techniques for causal inference
Direct Goal - give you the tools you need for research
Indirect benefit - You will be very well prepared for qualifying exams!
The department is hiring a tenured professor for the POST program
I highly recommend attending the job talks and taking the opportunity to meet with the candidates and ask questions
Even if you don’t care who gets hired, this is a chance to see what good public policy research looks like!
For those of you who are interested in pursuing academic careers, it is also useful to see what a job talk looks like
Friday 11/15 2:30 - 4 - Qing Miao (RIT): Assessing Social Equity in Federal Disaster Aid Distribution: Evidence from County-Level Analyses
Monday 11/18 2:30 - 4 - Ping Xu (URI): Federalism and the Politics of Immigrant Welfare Exclusion in the US
Wednesday 11/20 11:30 - 1 - Michael Jones (University of Tennessee)
So far, we have been learning about properties of random variables and their distributions
Now, we are ready to try to estimate features of a population with data
Where do our estimators come from? What are their properties?
How does sampling work?
Problem Set next week on Estimators and Hypothesis Testing.
Sample: The subset of units that are observed and measured. Size of the sample of denoted n.
Population: The set of units from which your sample is drawn. The size of the population is N.
Statistic: A numerical summary of the sample - such as the sample mean or variance
Population Parameter: A numerical summary of the population. Sample statistics have analogs in the population
We can say that the sample mean is an estimator of the population mean.
Rational choice theories of political behavior struggle to explain why individuals vote
Formally, vote only if \(p(B) + D > C\) is positive where
p is the probability of pivotality
B is the policy benefit to the voter
D is the “direct benefit” to the voter
C is the cost of voting
What about the role of norms and social pressure?
What is our sample? What is our population? What is a good quantity to estimate?
How confident should we be in these predicted differences?
Should campaigns update their mailing strategies?
Can we think of some assumptions that might underlie the validity of these results?
Load in the Gerber, Green and Larimer Data.
The Dependent Variable is “Vote” and the Treatment (or Independent) Variable is called treatment.
Convert treatment to a factor. Calculate the mean and standard deviation for each treatment category (use R functions, no need to code your own)
Calculate the Difference in means for the Neighbors Treatment and each other category.
One reason to be skeptical of the results is if the treatment and control groups look different in ways that are either observable or unobservable.
We cannot test whether we have balance on unobservables - but we feel better about things if our observable covariates are balanced.Let’s check if sex, age and voting in the 2004 primary are balanced.
If randomization is successful, we are nearly guaranteed a balanced sample across treatments. We will get into the math later, but as long as each treatment condition is reasonably large, we should be OK. This means for experimental results, we don’t even need regression with controls!
Inference
What is our best guess about some quantity of interest?
What is a plausible range of values for that quantity of interest?
Under certain assumptions/conditions, recover causal inferences about treatments
Compare Estimators
Difference in sample means
Post-stratification estimator
Estimate group means separately, then weight to recover the overall estimator. If we let W stand for white respondents and B stand for black respondents, that is:
How do we choose which estimator is appropriate?
Imagine you are a TA in POSC1010. Let’s say there are 300 students in the class, and you would like to know the distribution of class year (Freshman, Sophomore, etc).
Rather than ask every single student, you decide to ask the 20 students seated in the first two rows.
The first student answers “Freshman”, the second “Junior” and so on, and you convert these to numeric codes (1 = F, 2 = So, 3 = J, 4 = Se).
Student ID | Year |
---|---|
1 | 2 |
2 | 3 |
3 | 1 |
4 | 1 |
5 | 1 |
[1] "Sample Mean = 1.75"
Our sample is the twenty students we asked for their class year (n = 20)
Our population is the entire class (N = 300)
Our statistics/estimator is the sample mean \(\bar{X}\) = 1.75. This is our estimate of the population mean \(\ u\) .
How good of an estimate is this?
One type of bias you might come across is selection bias.
This occurs when not all units in the population have equal probabilities of being samples.
A common example in social science research is the convenience sample, in which the units most easily available are chosen
Weighting and Stratification are potential solutions to measurement error
Measurement bias is introduced when your process of measuring a variable systematically misses the target in one direction.
In surveys, measurement bias can arise when questions are confusingly worded or leading, or when respondents may not be comfortable answering honestly.
In social sciences, we often use proxy measures to measure things we cannot observe directly, or indexes to attempt to measure latent concepts. If we are not careful, these will introduce measurement bias.
Non-Response bias is introduced when units originally selected for the sample fail to provide data. When non-response is present, the final sample size for which there is full data is less than the initial sample size.
If non-responders differ from responders, we have bias. Un-testable assumption. We hope responders are missing at random.
Recall the first P-set. The attrition by survey respondents might introduce bias into our panel.
What types of bias are likely to occur if we randomly sample students from the first two rows of class?
We likely have selection bias, because students do not choose where to sit randomly. They may choose to sit with friends, or more experienced students may choose to sit closer to to the front (or back).
Less likely that we would have measurement bias or non-response bias here.
Sampling variance describes the variability from one sample to the next. How much variation if you were to draw a different sample.
Measurement variance is caused by inconsistency with measurement. If you measure the same variable twice, do you get the same reading?
What type of variance are we likely to have?
Sampling variance depends on both the size of the population and of the sample. We have a sample of size 20 from a population of 300 - this is a relatively small sample!
We will formalize how to measure sampling variance shortly, but the importance of a large sample as opposed a random sample is often overstated!
Model-Based Inference: Random Vectors \(X_{1},...,X_{n}\) are iid draws from CDF F
Model based because we assume some probability model F
Example: \(X_{i} = 1 \text{ if responent i votes and 0 otherwise}\)
Iid justification is justified based on assumption of random sample from an infinite population.
Goal: Learn about features of the population
Parameter: \(\theta\) is any function of the population CDF F
Some common parameters
\(\mu = E[X_{i}]\)
\(\sigma^{2} = Var[X]\)
\(\mu_{y} - \mu_{x} = E[Y_{i}] - E[X_{i}]\) The difference in means between groups
Point estimation provides a single best guess about these parameters
An Estimator \(\hat{\theta}_{n}\) for some parameter \(\theta\) is a statistics intended as a guess about \(\theta\)
\(\hat{\theta}_{n}\) is a random variable, because it is a function of the CDF
Implication: \(\hat{\theta}_{n}\) has a distribution, expectation, variance, etc
An Estimate is one particular realization of the estimator.
For example, “my estimator was the sample mean, and my estimate was 0.6”
We could use many possible estimators for the population expectation \(E[X_{i}]\)
\(\hat{\theta}_{n} = \bar{X}_{n}\) the sample mean
\(\hat{\theta}_{n} = X_{1}\) just use the first observation
\(\hat{\theta}_{n} = \text{max}(X_{1},...,{X}_{n})\) use the largest observation
\(\hat{\theta}_{n} = 0\) always guess 0
Clearly, some estimators are better guesses of the population parameter than others!
Population Distribution: the data-generating process
Empirical Distribution: \(X_{1},....,X_{n}\)
Sampling Distribution: Distribution of the estimator over repeated samples from the population distribution
# Set seed for reproducibility
set.seed(30317)
# Create a population of size 10000 with a mean of 0.4
# Using probability of success = 0.4 and number of trials = 1 for a binary outcome
population <- rbinom(10000, size = 1, prob = 0.4)
# Take 3 draws of size 30 from the population
sample1 <- sample(population, 30)
sample2 <- sample(population, 30)
sample3 <- sample(population, 30)
# Display the sample means
list(mean(sample1), mean(sample2), mean(sample3))
[[1]]
[1] 0.3
[[2]]
[1] 0.1666667
[[3]]
[1] 0.4333333
Parametric Models: Assume X ~ F, iid, and specify what family of distributions F is from
Example, F ~ binom(1000, 0.4)
We often construct our estimator using maximum likelihood
Inferences are model dependent - and our model may be wrong
Non-Parametric Models: Minimal or no assumptions about F
Plug-in principle: Replace F with the empirical distribution
Variance:
\[ \sigma^{2} = E[(X_{i} - E[X_{i}])^{2} -> \widehat{\sigma}^{2} = \frac{1}{n}\sum_{i = 1}^{n}(X_{i} - \bar{X}_{n})^{2} \]
Covariance:
\[ \sigma_{x,y} = Cov[X_{i}, Y_{i}] = E[(X_{i} - E[X_{i}](Y_{i} - E[Y_{i}])] -> \\ \widehat{\sigma}_{x,y} = \frac{1}{n}\sum_{i=1}^{n}(X_{i} - \bar{X_{i}})(Y_{i} - \bar{Y_{i}})\]
We only get one draw from the sampling distribution, \(\widehat{\theta_{n}}\)
Two ways to evaluate estimators
Finite Sample Properties: The properties of a sampling distribution for a fixed sample size n.
Large (or infinite) Sample Properties: The properties of an estimator as \(n -> \infty\)
The bias of an estimator \(\widehat{\theta}\) for parameter \(\theta\) is
\[ \text{bias}[\widehat{\theta}] = E[\widehat\theta] - \theta \]
An estimator is unbiased if \(E[\hat{\theta}]\) - \(\theta\) = 0.
Is the sample mean unbiased?
\[ E[\bar{X}_{n}] = \frac{1}{n}\sum_{i=1}^{n}E[X_{i} ] = \frac{1}{n}\sum_{i = 1}^{n} \mu = \mu\]
Sampling Variance: The variance of an estimator Var[\(\widehat{\theta}\)].
Sampling Variance of the sample mean:
\[ Var[{\bar{X}_{n}}] = \frac{1}{n^{2}}\sum_{i=1}^{n} Var[X_{i}] = \frac{1}{n^{2}}\sum_{i=1}^{n}\sigma^{2} = \frac{\sigma^{2}}{n} \]
Standard Error: Standard Deviation of the Estimator \(se(\widehat{\sigma}) = \sqrt{Var(\widehat{\theta})}\)
Which means that the standard error of the sample mean is \(\frac{\sigma}{n}\)
Mean Squared Error (MSE) is
\[ E[\widehat{(\theta_{n}} - \theta)^{2}] \]
One way to estimate the quality of an estimator.
How large are the squared deviations from the true parameter? Lower = better!
Also a key metric to measure model performance in machine learning.
Useful Decomposition:
\[ MSE = \text{bias}[\widehat\theta_{n}]^{2} + V[\widehat{\theta}] \]
Therefore - For unbiased estimators, MSE is just the sampling variance
Of course we want low bias! Often, we want the best unbiased estimator. But, we might accept some bias for large reductions in variance. This can give us a better overall MSE.
Examples include Regularization, which increases bias and decreases variance in an effort to avoid overfitting the model to the data.
Imagine that we actually has access to an accurate record for all students enrolled in the class from the early example. Think of it as a census of the student population.
What if wanting to set in the front row is caused by some hidden/latent trait like enthusiasm.
We have reason to suspect that students in the front row are more enthusiastic. What sort of bias might that introduce?
# Load libraries
library(dplyr)
library(ggplot2)
library(infer)
library(patchwork)
# Generate sample data
samp_1 <- pop_eager %>%
slice_sample(n = 18, replace = FALSE, weight_by = eagerness)
many_samps <- samp_1 %>%
mutate(replicate = 1)
set.seed(40211)
for (i in 2:500) {
many_samps <- pop_eager %>%
slice_sample(n = 18, replace = FALSE, weight_by = eagerness) %>%
mutate(replicate = i) %>%
bind_rows(many_samps)
}
# Define a custom theme for all plots
custom_theme <- theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 10)
)
# Plot Sample 1
p1 <- many_samps %>%
filter(replicate == 1) %>%
ggplot(aes(x = year)) +
geom_bar(fill = "#4B0082", color = "black", width = 0.7) +
labs(title = "Sample 1", x = "Year", y = "Count") +
custom_theme
# Plot Sample 2
p2 <- many_samps %>%
filter(replicate == 2) %>%
ggplot(aes(x = year)) +
geom_bar(fill = "#6A0DAD", color = "black", width = 0.7) +
labs(title = "Sample 2", x = "Year", y = "Count") +
custom_theme
# Plot Sample 3
p3 <- many_samps %>%
filter(replicate == 3) %>%
ggplot(aes(x = year)) +
geom_bar(fill = "#8A2BE2", color = "black", width = 0.7) +
labs(title = "Sample 3", x = "Year", y = "Count") +
custom_theme
# Sampling Distribution of Means
many_xbars <- many_samps %>%
group_by(replicate) %>%
summarize(xbar = mean(as.numeric(year)))
p4 <- many_xbars %>%
ggplot(aes(x = xbar)) +
geom_histogram(fill = "purple", color = "black", bins = 15) + # Histogram for distribution of means
lims(x = c(0, 4)) +
labs(title = "Sampling Distribution", x = "Mean Year", y = "Frequency") +
custom_theme
# Arrange all plots in a grid using patchwork
(p1 + p2 + p3) / p4
Survey sampling is a type of design based inference where we take a finite sample a population of size N.
A simple random sample of size n from a finite population is a sample in which the probability of inclusion of each unit is
\[ \pi = \frac{n}{N} \]
and we let \(I_{n}\) be an indicator [0,1] of whether a population unit is included in the sample
More complex sampling designs lead to different inclusion probabilities and design inferences
Estimand: Population mean \(\bar{x} = \frac{1}{N}\sum_{i=1}^{N}x_{i}\)
Fixed quantity because pop is fixed and finite
But we never observe it (short of a true census)
Estimator: sample mean \(\bar{X} = \frac{1}{N}\sum_{i=1}^{N} I_{i}x_{i}\)
Design Based Inference - Randomness comes from sample alone, and depends on the sampling design
Variance of \(X_{n}\) over repeated samples
\[ Var[\bar{X_{n}}] = (1 - \frac{n}{N}) \frac{s^{2}}{N} \]
What is \(s^{2}\)? It’s the population variance
\[ s^{2} = \frac{1}{N-1} \sum_{i=1}^{N}(x_{i} -\bar{x})^{2} \]
We can apply the plug-in principle from before, and use the sample variance \(S^{2}\)
\[ \widehat{Var}[\bar{X}_{n}] = (1 - \frac{n}{N})\frac{S^{2}}{n} \\ S^{2} = \frac{1}{n-1}\sum_{i=1}^{N}I_{i}(x_{i}-\bar{X_n})^{2} \]
This is an unbiased estimate of the variance of the sample mean
Often, we have unequal sampling probabilties
Some groups may be hard to reach
Low trust individuals
non English speakers
Let \(\pi_{i}\) be the probabilty of an individual from a specific group being sampled
\[ \tilde{X}_{ipw} = \frac{I_{i}x_{i}/\pi_{i}}{I_{i}/\pi_{i}} \]