Often, when doing research, you end up with data from multiples sources (or in different files from the same source) and you need to merge them
Broadly you might need to
bind_rows - combine two datasets with the same columns covering different time periods or extra observations
left_join - merge new measurements/variables
Important to make sure your merge works as expected
may require harmonizing data types or renaming variables
Try it out
Download the data from the course site
Bind the rows of the unemployment data and then join it with the GDP data
You will likely run into error messages, look at the data and adjust as needed
Plot the relationship between GDP and Unemployment (any type of plot is fine)
Where are we now?
We have now discussed how univariate distributions generalize to multivariate distributions
Joint, Marginal and Conditional Distributions
Covariance and Correlation
We will discuss a very important quantity today - the conditional expectation
Then, will discuss how to estimate features of a population from a sample (Finally!)
Conditional Expectation
The conditional expectation of (Y|X) is: \[
\mu(\mathbf{x}) = \mathbb{E}[Y \mid \mathbf{X} = \mathbf{x}] =
\begin{cases}
\sum_{y} y \, P(Y = y \mid \mathbf{X} = \mathbf{x}) & \text{discrete } Y \\
\int_{-\infty}^{\infty} y \, f_{Y \mid \mathbf{X}}(y \mid \mathbf{x}) \, dy & \text{continuous } Y
\end{cases}
\]
This is the expected value of Y given X = x
Can be viewed as a function of x, in which case we call it the conditional expectation function (CEF)
The CEF tells us who the average value of Y changes for different values of X
Why does conditioning matter?
Fred is a 30 year old man. If the average life expectancy in Fred’s country is 80 years, should Fred conclude that he has 50 years (on average) of life expectancy remaining?
No! We have some good news for Fred, which is that if we let L be Fred’s expected lifespan,
\[
E[T] \lt E(T|T \geq 30)
\]
Of course we can (and will) be able to get even better estimates for Fred’s lifespan if we condition on other variables (Where does he live? What does he do for work? Does he smoke?)
Two Envelope Problem
You are on a game show! Everyone knows how to solve the Monty Hall problem, so the host has changed the puzzle. There are two envelopes, one of which has twice as much money as the other. You can open one envelope, and then choose whether to switch.
You open the envelope, and it has $100 in it. Should you switch?
By the symmetry of the set up, the expectation of each envelope is equal. But, if your envelope has $100 in it, doesn’t that mean the other envelope has equal probability of having either $50 or $200? The expectation would then be $125, so you should switch! Or should you…?
Simulating Sticking or Switching
# Load required librarieslibrary(ggplot2)# Simulation parametersset.seed(30317) # For reproducibility (my zip code)n_simulations <-100000# Number of simulations# Function to simulate one round of the gamesimulate_game <-function() {# Randomly pick a base amount X <-sample(1:100, 1) *10# Random amount between 10 and 1000# Assign amounts to envelopes envelope1 <- X envelope2 <-2* X# Randomly assign which envelope is picked first envelopes <-sample(c(envelope1, envelope2)) first_choice <- envelopes[1] # Amount in the first envelope chosen second_choice <- envelopes[2] # Amount in the other envelope# Outcomes based on sticking vs switching stick_value <- first_choice switch_value <- second_choicereturn(c(stick_value, switch_value))}# Run the simulationresults <-replicate(n_simulations, simulate_game())# Convert results to a data frameresults_df <-data.frame(Strategy =rep(c("Stick", "Switch"), each = n_simulations),Value =as.vector(results))# Plot the resultsggplot(results_df, aes(x = Strategy, y = Value, fill = Strategy)) +geom_boxplot() +labs(title ="Simulation Results for Two Envelopes Game",x ="Strategy",y ="Amount in Envelope") +theme_minimal()
Simulating Sticking or Switching
Coinflip Problem
I saw an interesting example of a problem that we can solve with conditional expectation go viral last spring:
Suppose a policymaker is interested in assessing the impact of a new education program on future earnings. This program, aimed at low-income students, provides additional resources, tutoring, and counseling services to improve educational outcomes. To evaluate the effect, they want to know the expected earnings of individuals who participated in the program.
Let Y represent future earnings.
Let P be an indicator of participation in the program (1 if participated, 0 if not).
Calculating Expectation
Assume 30% of individuals complete the program, and
Foreshadowing next semester: Did the program have a causal effect on student earnings? Or, what assumptions would we need to make?
Properties of the CEF
\(E[g(X)Y|X] = g(X)E[Y|X]\) for any function g(X)
If X and Y are independent RVs then
\[
E[Y|X = x] = E[Y]
\]
If X is independent of Y|Z, then
\[
E[Y|X = x, Z=z] = E[Y|Z=z]
\]
Linearity
\[
E[Y + X|Z] = E[Y|Z] + E[X|Z]
\]
CEF Error
We can also write down a measure of the prediction error of the CEF
\[
\epsilon = Y - E[Y|X]
\]
It has following properties
\(E[\epsilon|X] = 0\) (think about why)
\(E[\epsilon] = 0\)
We won’t cover the matrix algebra, but E[Y|X] is the projection of Y into a plane representing all functions of X, where E[Y|X] is the function of X that is closest to Y. (See theorem 9.3.9 in B&H)
Conditional Expectation as Best Predictor
Something we often want to do is to predict Y given some X
we can use any function of X, g(X) to do so
The mean squared error (MSE) for our predictions is
Which can be conceptualized as within group variation\(E[Var[Y|X]]\) and between group variation\(Var[E[Y|X]]\)
Height Example
Suppose we have the heights (in centimeters) of individuals drawn from three different countries:
Country 1: 160, 162, 158, 161
Country 2: 170, 168, 172, 171
County 3: 180, 179, 181, 182
What is \(Var[height]\)? What is the within group variation? What is the between group variation?
Skedasticity
Homoskedasticity means variances do not depend on X, such that for all X, \(\sigma^{2}(x) = \sigma^{2}\)
Unless we show that our data is homoskedastic, we should assume that it is instead heteroskedastic. This will matter for calculating error terms for regressions!