Problem Set will be posted online EOD Wednesday 9/18
Due Sunday 9/29 (11:59:59). Upload to blackboard
Theoretical and Analytical Questions can be word or PDF
Please also submit code as an .rmd or .qmd.
Please Please Please - don’t use AI beyond what is specified in syllabus
I am willing to review code once. Will not give full answers, but will point in right direction.
Final Paper:
Topic of your choosing
Analysis Plan + Data Analysis
Topic by 10/7 - 1 page memo identifying data source + importance of project
Must schedule a meeting with me by 10/7 to discuss and make sure of feasability
Take Home Mid-Term week of 10/14 (No class, Holiday)
Conditional probability is the heart of modern statistics and social science. IMO - getting a firm grasp on conditional probability is the most important thing we do this semester!
Unconditional Probability: What is the probability that A occurs?
Conditional Probability: If we know B has occurred, what is the probability that A occurs (not necessarily sequential)?
We condition our estimate of A on B having occurred.
Could spend a whole semester on conditional probability and fun examples.
Consider a standard 52 card deck. Let A be the event of drawing a Spade and B be the event of drawing a red card.
What is P(A)?
What is P(B)?
What is P(A|B)?
What is P(B|A)?
If \(P(B)\) > 0, then we define the conditional probability of A given B as:
\[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]
How often A and B jointly occur, divided by how often B occurs. Why do we need to divide by B?
What do we think \(P(\text{Policy Nerd} | \text{POST PhD Student})\) is?
What about \(P(\text{POST PhD Student}|\text{Policy Nerd})\)?
Often assumed that \(P(Post PhD Student|Policy Nerd)\) should also be high. This is referred to as the base rate fallacy. With cohorts under 10, simply cannot be high!
Let A = Presence of clouds and B = Rain
What is \(P(A|B)\)? (Don’t over-think it - we don’t need math here!)
Does that mean \(P(B|A)\) is 1 too?
What about \(P(B|A^{c})\)?
Choose one senator at random from this population. What is the probability a randomly selected Democrat is a woman?
What is the probability that a randomly selected woman is a Republican?
Conditional Probabilities are valid probability functions
All the Axioms of probability are satisfied
P(A|A) = 1
But, why isn’t this true?\[P(A|B \cup C) = P(A|B) + P(A|C)\]
The probability of the intersection of two events
If we think through conditional prob definition, it implies\[P(A,B) = P(A)P(B|A) = P(B)P(A|B)\]
We can generalize to joint probability for arbitrarily many events\[P(A_{1},...,A_{n}) \\=P(A_{1})P(A_{2}|A_{1})P(A_{3}|A1,A2)...P(A_{n}|P(A_{1}...A_{n-1})\]
You may have heard there is an election in 2024. Suppose we know the proportion of Trump and Harris supporters in each city in Georgia.
How can we use this information to work out state wide support for each candidate?
All of the cities together make up a partition of the state
In technical terms, a partition is a set of mutual disjoint events whose union make up the sample space.
The law of total probability says that if \(A_{1}, ... , A_{k}\) is a partition\[P(B) = \sum_{j = 1}^{k}P(B|A_{j})P(A_{j})\]
In practical terms, what does this mean?
How do we use thus to work out the probability of a random Georgia voter supporting Harris?
Imagine Georgia has 3 cities, Atlanta, Helen, and Macon.
P(Harris|Atlanta) = 0.60, P(Harris|Helen) = 0.1, P(Harris|Macon) = 0.15.
But, we need to consider populations. Atlanta has 500,000 voters, Macon has 80,000 voters and Helen has 20,000 voters.
How do we put this all together with LoTP?
The Smith’s have two children. The older child is a girl. What is the probability that both children are girls (assume gender/sex is binary here for simplification, and birth rates are equal for both sexes)?
1/2 - why?
The Smith’s have two children. At least one of them is a boy. What is the probability that both children are boys?
1/3 - why?
Possibly the most famous conditional probability problem.
Imagine you are a contestant on a game show. You can choose between three doors, and receive the prize behind the door
Two doors have a goat behind them, one has a car. Monty knows where the car is, and opens one door such that he never reveals a car. You then have the option to switch doors
If you want to win a car, what should you do?
Imagine you are a public health regulator. A pharmaceutical rep comes to you with a fantastic new cancer screen test, that can detect a deadly cancer early 99 % of the time.
What’s more, it has a low false positive rate, only 3 %.
The manufacturers want you to approve the screening test and recommend regular screenings for the public. What should you do?
Many (Most?) people would say go ahead and approve it, but this ignores the base rate fallacy
Fortunately, most people do not have cancer at any given point in time
Imagine that the population prevalence of the specific cancer is 1 in 1,000 at any given point in time.
Imagine we give a random person the cancer screening, and it comes back positive. What are the odds that they actually have the disease?
Reverend Thomas Bayes (1701-1761): English Minister and Statistician
Bayes Rule: if \(P(B) > 0\) then \[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]
We can expand this out to
\[P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|A^{c})P(A^{c})}\]
Denominator follows from LoTP (do you see why?)
We call the resulting P(A|B) our Posterior Probability
So, what are the odds of the patient actually having cancer
We have some prior information
P(Cancer) = .001
P(Positive|Cancer) = 0.99
P(Positive|No Cancer) = 0.03
Apply the formula! Public Health decisions are complicated!
Imagine a murder is committed, and analyses of the crime scene shows that the murderer has a rare blood type shared by only 5% of the population.
Imagine a suspect is given a blood test - and is shown to share that blood type. The prosecutor argues that this establishes a 95% chance that the suspect is the murderer.
If you were on the jury - how would you evaluate that piece of evidence?
Recall that we determined that the probability of Paul selecting all 8 games correctly was .0025
But…what do we think the base rate of prophetic Octupi is?
What are some other possible uses of Bayes Rule?
Consider two doctors, Dr. Nick and Dr. Hibbert who both operate in Springfield. They each offer two types of surgeries: Heart Surgery and Band-Aid Removal. Each surgery can be a success or a failure.
Dr. Hibbert | Heart | Band-Aid |
---|---|---|
Success | 70 | 10 |
Failure | 20 | 0 |
Dr. Nick | Heart | Band-Aid |
---|---|---|
Success | 2 | 81 |
Failure | 8 | 9 |
Dr. Nick has a higher success rate (83%) than Dr. Hibbert (80%). Which doctor would you prefer to use?
Bayes Rule tells us how knowing B changes the probability of A.
Sometimes, knowing B tells us nothing about A!
Formally, two events A and B are independent (or A \(\perp\) B) if \(P(A \cap B)\) = \(P(A)P(B)\). Why??
Why is \(\perp\) an intuitive symbol for independence?
Independence is symmetric (A \(\perp\) B implies B \(\perp\) A)
If events are not independent, they are dependent
if \(A \perp B\) and P(B) > 0, then:
\[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]
from the definition of conditional probability, and then…
\[ = \frac{P(A)P(B)}{P(B)} \]
And if we do a little algebra…
\[= P(A)\]
$$
$$
Imagine your friend has 10 fair coins she will flip one after the other, showing you the result each time.
She flips each of the first 9 coins. Each time, it comes up heads.
What is the probability that the 10th coin will come up heads? Why?
Instead, imagine your friend has 10 fair coins that she flips, and then chooses 9 coins to show you.
After flipping all the coins, she reveals 9 heads, reserving one already flipped coin.
What is the probability that the 10th coin is also heads? Why?
Are disjoint events independent - dependent - or does it depend?
Suppose that the current prevalence of COIVD in South Carolina is 1.5%.
If we sample 20 random people, what is the likelihood that at least one has COVID?
What if the first person we sample has COVID - what is the probability at least one of the remaining people has Covid?
Note - sampling without replacement is independent. Sampling with replacement adds dependency.
Two events, A and B are conditionally independent given C if \[P(A\cap B|E) = P(A|E)P(B|E)\]
This is a very important concept once we get to regression
Independence does not imply conditional independence
Consider undergrad admission to a prestigious public university in South Carolina that happens to have a good football team. Assume that in the population high school GPA and football talent are independent in the general population.
This public university wants both good students and good athletes, so they value both GPA and football talent in their admissions process. Among admitted students, would we expect GPA and football talent to be independent?
Also closely related to collider bias
Consider a regression model (University GPA ~ SAT Score), showing no association between GPA and SAT scores
First of all - I did not name these!
Random variables provide a link between probability and data
A Random Variable is a function that maps from the sample space to the real number line.
A numeric representation of uncertain events
Imagine an event A - the numeric value of that event is then X(A) where X is a random variable
Randomness comes from the randomness of the “experiment” or event - not from X
Each poll is an event where X(poll) is % support for Harris.
Imagine polls coming from a distribution centered around the true support.
In practice, we don’t know the true level of support. X is then the sample mean of support for Harris in each poll.
This is the core of what FiveThirtyEight and similar are doing for their election forecasts.
For any given ‘experiment’, there can be many different random variables
Imagine randomly sampling university students, where we measure their class (Freshman, Sophomore, Junior, Senior)
with 2 students, there are 8 possible outcomes (FF, FSo, FJ, Fse, SoSo, SoJ, SoSe, JJ, JSe, SeSe)
Random Variable could be number of Freshmen
Or..number of Juniors + Senrors
Or..number of non-Seniors
Discrete and Continuous
Today + Next Week, Discrete
Closely related, but different techniques and tests (ie Logit vs OLS vs Poisson Regressions) based on the type of RV
Definition: A Random Variable is discrete if the values it takes with positive probability is finite or countably infinite
Uncertainty over the sample space –> uncertainty over the value of X
The distribution of a random variable specifies that uncertainty
Specifically, it gives you the probabilities of all possible events
X = number of days a randomly chosen student was absent from school
Distribution tells you - What is P( X >10)? What is P (X = 0)
Consider flipping 4 fair coins, where X is the number of heads.
What is P(X = 1)?
What about P(X = 2 or 3)?
What about P(X > 4)?
The Probability Mass Function (PMF) is specified as:\[p_{x}(x) = P(X = x)\]
X = x is an event
The support of X for a discrete random variable is the values for which it has a positive probability
What does all this mean in plain language?
A valid PMF with support \(x_{1}, x_{2},...\) has the following properties
Non-Negativity: \(p_{x}(x) > 0\) if \(x \in x_{1}, x_{2}, ...\) and \(p_{x}(x) = 0\) otherwise
Sums to 1: \(\sum_{j = 1}^{n} p_{x}(x_{j}) = 1\)
The probability of any set of values S in (\(x_{1}, x_{2}...)\) is \[P(X\in S) = \sum_{x \in S} p_{x}(x)\]
Imagine you are designing an experimental educational policy intervention with 3 treatment conditions (Control, T1 and T2).
Good randomization is the foundation of experimental inference, so you decide to randomize into conditions by flipping four coins. Let X be the number of heads. If X is 0 or 1 you assign the school to the control. If X is 2 you assign it to T1, and if X >2, you assign it to T2.
What does the PMF of X look like? Would you use this randomization technique?
Social Science Examples
What is the probability of two states going to war if they are both democracies?
What is the probability of a recession in 2025 if the unemployment rate rose in 2024?
What is the probability of a military coup in Brazil if consumer prices double?
Examples of relevance for you?