Key Point: Like GGplot gives us a grammar of graphics with layers, dplyr (or tidyverse) give us a clear pipeline for data manipulation
# A tibble: 5 × 4
# Groups: species [3]
species island avg_bill_length count
<fct> <fct> <dbl> <int>
1 Chinstrap Dream 49.7 49
2 Gentoo Biscoe 47.5 123
3 Adelie Biscoe 39.7 30
4 Adelie Dream 39.7 31
5 Adelie Torgersen 39.6 31
Often, we want to compare subgroups within data
The `group_by()function is part of the dplyr
r package and is used to split data into groups before performing operations on each group.
Once we have grouped the data - we can summarize it by group.
mutate
is used to created new variables from existing ones in a dataframe. Imagine we had separate variables for the first and second exam, and wanted to calculate an average.
# Example dataset
df <- data.frame(
Name = c('Alice', 'Bob', 'Charlie'),
Score1 = c(90, 85, 80),
Score2 = c(95, 88, 78)
)
# Create a new column with mutate()
df_with_average <- df %>%
mutate(Average_Score = (Score1 + Score2) / 2)
print(df_with_average)
Name Score1 Score2 Average_Score
1 Alice 90 95 92.5
2 Bob 85 88 86.5
3 Charlie 80 78 79.0
filter - filter the dataset to certain observations
select - keep only certain rows of the data
select(data, age, gender, vote) keeps only age
, gender
and vote
variables, drops all others
Resource: RStudio Cheat Sheet: Data Transformation with dplyr
Download “SmokeBan.csv”. Save it to your computer
Load in the data and save it as smoking_data.
Argument: smoking bans ->reduced smoking. What does a quick glance at the data show?
Let’s compare the rates of smoking (variable named smoker) among those at a workplace with and without a ban
What are other factors that might influence smoking - how can we bring these in to our analyses?
Practice finding help - use stack exchange if you are stuck. Read the documentation for functions. See what GPT recommends!
X ~ Bern(p) with parameter p if P(X = 1) = p and P(x = 0) = 1-p
Story: Indicator of success in some trial with two outcomes
Any event A has an associated Bernoulli Indicator Variable
An extension of the Bernoulli to multiple trials:
Let X be the number of successes in n independent Bernoulli trials all with success probability p. Then X follows the binomial distribution with parameters n and p, which is written X~Bin(n,p)
p is again the probability of success
n is the number of trials
Thus - bin(1,p) ~ Bern(p)
Logically, if X ~ Bin(n,p), n-X ~ Bin(n, 1-p)
If X ~ Bin(n,p) then the pmf of X is\[p_{x}(k) = {n \choose k} p^{k}(1-p)^{n-k}\]
for all K = 0,1,….., n
\(p^{k}(1-p)^{n-k}\) is the probability of a specific sequence of successes and failures, with K sucesses
Binomial Coefficient \(n \choose k\) is how many of those combinations there are.
A Binomial is an experiment consisting of n independent random trials, with a parameter p specifying the probability of success. We can construct the PMF from things we already know:
\(P(X = k)\) is the probability of k successes. To figure out the probability of k successes, we first need to know how many ways can we get k successes from n trials.
\[ n\choose k \]
Then we need to know how likely a single sequence with k successes from n trials is.
\(p^{k}(1-p)^{k}\) (why?)
If the probability of a single sequences \(p^{k} (1-p)^{n-k}\) , we need to multiply that by \(n \choose k\) to get the probability of all sequences
\[P(X = k) = {n\choose k} p^{k}(1-p)^{n-k} \]
Note, if k >n, \(P(X=k)\) must be 0.
Mary Peltota (D) is the incumbent. Assume her true level of support is 40%. If we sample 10 voters, what are the odds that 5 voters favor her over all other candidates?
X ~ Bin(10, 0.4), so what is the P(X = 5)?
What about the probability that at least two voters support her over all others?
All Random Variables have some sort of distribution
Many, but not all, conform to some family of distributions
A Probability Mass Function describes a discrete distribution
Often, we want to sample without replacement.
Consider drawing balls without replacement, where there are W white balls, B blacks balls and N trials. The number of white (or black) balls we draw follows a hypergeometric distribution.
X ~ HGeom(w,b,n) - following the Blitzstein and Hwang formulation.
\[ P(X = k) = \frac{{w\choose k} \binom{b} {n-k}}{\binom{w+b}{n}} \]
Note - there is no p parameter. If we add weights to the bals, there is no generic PMF
Consider a company with an equal amount of male and female employees (1000 in total). A group of women claim that the company unfairly promotes men over women, and in particular fails to nominate women for management training.
Imagine that each year for the past 10 years, one employee is selected for the training. Only one woman has been picked in that time period.
Let X be the number of women picked for the training. What is \(P(X \leq 1)\)?
# Load the ggplot2 package
library(ggplot2)
# Set parameters for the hypergeometric distribution
m <- 500 # Number of successes in the population
n <- 500 # Number of failures in the population
k <- 10 # Number of draws (sample size)
# Define the range of possible successes in the sample
x <- 0:k
# Calculate the probability mass function
probabilities <- dhyper(x, m, n, k)
# Create a data frame for plotting
df <- data.frame(Successes = x, Probability = probabilities)
# Plot using ggplot2
ggplot(df, aes(x = Successes, y = Probability)) +
geom_col(fill = "skyblue", width = 0.7) +
geom_point(color = "darkblue", size = 2) +
labs(title = "Hypergeometric Distribution (m=500, n=500, k=10)",
x = "Number of Successes",
y = "Probability") +
theme_minimal()
# Number of trials (number of employees selected)
k <- 10
# Total population size (total employees)
N <- 1000
# Number of successes in the population (number of women)
m <- 500
# Probability of success in the binomial distribution
p <- m / N # Since the population is 50% women
# Hypergeometric Distribution
# Calculate P(X <= 1)
P_hyper <- phyper(1, m, N - m, k)
cat("Hypergeometric P(X ≤ 1):", P_hyper, "\n")
Hypergeometric P(X ≤ 1): 0.01043612
# Binomial Distribution
# Calculate P(X <= 1)
P_binom <- pbinom(1, size = k, prob = p)
cat("Binomial P(X ≤ 1):", P_binom, "\n")
Binomial P(X ≤ 1): 0.01074219
HGeom and Binomial won’t always be similar!
Small sample (k) relative to population size (n)
Moderate sample size (k) compared to population size (n)
Moderately Large k (50) compared to n
Very simple story: Imagine N different items. Choose one item uniformly at random, what is the probability of choosing that item?
\[ P(X = x) = \frac{1}{N} \]
Simple - but the intuition is important and sometimes a useful baseline
CDFs exist for all Random Variables (PMF only discrete). Becomes very important when we move to continuous r.v.
Definition: The CDF of an r.v. X is the function \(F_{x}\) given by \(F_{X}(x) = P(X \leq x)\)
Increasing: If \(x_{1} \leq x_{2}\), \(F(x_{1}) \leq F(x_{2})\)
Converges to 0 and 1 in the limits\[\lim_{x \to -\infty} F(x) = 0 \text{ and} \lim_{x \to \infty} F(x) = 1\]
Just like probabilities can be independent, so can random variables.
If \(P(X \leq x, Y \leq y) = P(X \leq x)P(Y \leq y)\) then the random variables X and Y are independent.
In the discrete case only, this is equivalent to \(P(X=x, Y=y) = P(X = x)P(Y = y)\)
Variables are independent and identically distributed (iid) if they are both independent and share the same CDF (or PMF/PDF).
Independent and Identically Distributed: Let X be the result of a fair die role. Let Y be the result of a second, independent, die roll. X and Y are iid.
Independent and not Identically Distributed: Let X be the result of a die roll and Y be whether it rains tomorrow.
Dependent and Identically Distributed: Let X be the number of heads in n tosses and let Y be the number of tails from n tosses of the same fair coin.
Depdendent and not Identically Distributed: Let X be the approval rating of the incumbent president and let Y be an indicator of whether the incumbent wins re-election.
Assume that the data, \(X_{1}, X_{2},...\) are iid, apply an appropriate statistical model to estimate quantity of interest
Example: Support for a Carbon Tax
Sample n respondents from the population with replacement
\(X_{1}, X_{2}, X_{3}...\) are independent Bernoulli trials indicating support/opposition
so, X ~ bin (n,p) where p is population approval rate
\(\bar{X} = (\frac{1}{n}) \displaystyle\sum_{i=1}^{n} X_{i}\) is our estimate of p.
Consider X~bin(n,p). How does it relate to Bernoulli RVs?
X is the sum of n iid(by definition of Binomial) Bernoulli trials. \[X = X_{1} + X_{2} + ... + X_{n}\]
If X ~ Bin(n,p) and Y ~(m,p), and X and Y are independent, then X + Y ~ bin(n+m, p)
Same definition as for probabilities \[P(X \leq x, Y \leq y|Z = z) = P(X \leq x|Z)P(Y \leq y|Z = z)\]
Conditional PMFs and conditional CDFs follow all the same rules as unconditional PMFs and CDFs
As before, independence and conditional independence do not imply each other
Bob and Harriet sometimes call you.
Bob and Harriet’s phone calls are independent
Let Z be the condition of receiving exactly one phone call from them
Are (X|Z) and (Y|Z) independent?
One of two unknown opponents
P(Win|P1) = .75, P(Win|P2) = .5
Let X and Y be indicators of victory in the first and second set. Let Z be an indicator of which opponent you face.
We’ve covered probability + conditioning
Defined Random Variables, covered a few discrete distributions
Distributions give us complete information about the properties of an r.v.
Next: Key values that summarize distributions, starting with expectation
Imagine we want to know if door-to-door canvassing changes behaviors on climate change, specifically by reducing energy consumption
We can define two potential outcomes
\(Y_{i}(1)\): Whether person i would reduce energy use (1) or not (0) if they received canvassing
\(Y_{i}(0)\): Whether person i would reduce energy use (1) or not (0) if they did not receive canvassing
The individual effect of canvassing is then\[\tau_{i} = Y_{i}(1) - Y_{i}(0)\]
We can think of \(Y_{i}(1)\) and \(Y_{i}(0)\) as RVs - thus so is \(\tau_{i}\) .
How should we summarize the distribution of causal effects?
Presumably, you’ve seen before how to calculate an average, or mean, of a set of numbers.
Formally \[\bar{x} = \frac{1}{n}\sum_{j=1}^{n} x_{j}\]
When we start talking about probability distributions, a weighted mean becomes useful, as some values are more likely than others. So we weight each value with its probability of occuring:
Formally \[\text{weighted-mean(x)} = \sum_{j=1}^{n}x_{j}p_{j}\]
The expectation, aka mean or expected value, of a discrete random variable X is defined by \[E[X] = \sum_{j = 1}^{\infty} x_{j}P(X = x_{j}) \]
Let X ~ Bern(.7). How do we calculate E(X)?
\[E[X] = 1p + 0(1-p) = .7 + 0 = .7 \]
Let X be the outcome of the roll of a weighted die, where p(1) = .25, p(2) = .25, p(3) = .2, p(4) = .1, p(5) = .1, p(6) = .1, what is the expected value of a single roll?
X ~ (n,p). How can we find E[X]? Plug in the PMF to the expectation formula:
\[E[X] = \sum_{k=0}^{n} k \binom{n}{k} p^{k}q^{n-k}\]
We can simplify this - covered in the book but requires theorems and combinatorics beyond the scope of the class. But, can we use a story of what the binomial is to reason out what E[X] must be?
If X ~ Bin(n,p), E[X] = np.
We often care about some the expectation of some transformation of random variables. For any r.v.s X and Y and any constant C. Always True!
\[ E[X + Y] = E([X] + E[Y]) \\ E[cX] = cE[X] \]
\[ E[(g(X)] \neq g(E[X]) \]
\[ E[XY] \neq E[X]E[Y] \text{ - unless X and Y are independent} \]
Suppose \(Y = g(X)\) where X is an rv with pmf f(x). Then Y is a random variable take values g(X) with the probabilities associated with X = x. \[E[Y] = E[g(X)] = \sum_{x} g(x) * f(x) = \sum_{x}g(x) * p(X=x) \]
Let X be a random variable with the following probability distribution:
x | P(X=x) |
---|---|
-2 | 0.2 |
-1 | 0.1 |
0 | 0.2 |
1 | 0.3 |
3 | 0.2 |
What is \(E[X^{2}]\)? What about \((E[X])^{2}\)?
Imagine a 5 card hand from a deck of cards, where X is the number of aces. How can we find E(X)?
We could plug in the PMF, but that seems hard!
Let \(X_{j}\) indicate if the jth card is an Ace. Then, \[E[x] = E(X_{1} + ... + X_{5}) = 5E[x] = 5\text{[P(1st card is an ace)]} \]
E[X] = \(n *\frac{K}{N}\). How does this relate to Binomial expectation?
Geom(p): Independent Bernoulli Trials. How many failures before the first success?
Let X ~ Geom(p), q = 1-p. Then the PMF must be.\[ P(X = k) = pq^{k}\]
Think through why that must be the PMF
From the definition of expectation, where we plug the PMF in for P(X=x):
\[ E[X] = \sum_{k = 0}^{\infty} kpq^{k} \]
We could use calculus to work out how this simplifies to \(\frac{q}{p}\) (see Example 4.3.6 in B&H), or we can use a story proof
Flip a coin that has a probability of heads p repeatedly until the first heads comes up
Let c = E[X]. How can we solve for c?
Consider the first flip. If the coin is heads, c = 0. This happens with probability p
If the first flip is tails, then the problem resets (with probability q)
Then, \(c = 0p + (1 + c) q\)
After a little algebra, c = \(\frac{q}{p}\)
In labor market studies, the geometric distribution can model the number of applications an individual submits until they receive a job offer.
During a recession, there are many more applicants than there are jobs. Imagine that an application has a probability of success .015. How many applications would we expect an applicant to send before getting a job. If an applicant can apply to 7 jobs a week, how long will it take them to find work?
We are looking for the first success, so \(E[X] =\frac{1}{0.015} = 66.67\) , or about 67 applications. At 7 applications a week, an average job search would take 10 weeks.
Not Negative or Binomial, a generalization of the Geometric
Parameters: r,p
Story: Independent Bernoulli trials with probability of success p. We want to know how many failures before the rth success
Deriving the PMF: Consider the following sequence, where r = 5\[0010001010001001\]
The probability of such a sequence is \(p^{r}(1-p)^{n}\) - and there are \(\binom{n+r-1}{r-1}\) ways to get a sequence with n failures before r successes.
We could plug in the PMF to the expectation formula. But it looks awful to solve!
Instead, note that if r (number of successes) = 1, we already know the expectation. What if r = 2?
For any arbitrary number of successes, wait for the 1st, the wait again for the second, then….until the rth.
\(E[X] = E[X_{1} + .... + X_{r}]\) where \(X_{j}\) is the number of failures between \(X_{j-1}\) and \(X_{j}\) . So each step is \(X_{j}\) ~ Geom(p)\[E[X] = \frac{rq}{p}\]
X ~ Poisson(\(\lambda\))
PMF: \(P(X=k) = \frac{e^{-\lambda}\lambda^{k}}{k!}, k \in {0,1,2,...}, \lambda >0\)
\(\lambda\) is the “rate” parameter
There is a nice derivation of the expectation in the B&H, but for our purposes just know:\[E[X] = \lambda\]
Use/Story: Used for applications where we have many trials and a very small probability of success per trial. Often used to model rare events (earthquakes, etc).
Imagine a store that averages 15 customers an hour over the course of a day. With some assumptions, we can model this as a random variable X that represents the number of customers that come into the store, X ~ Poisson. This might be useful to help the store make staffing decisions.
Assumptions: Uniformity of events over the interval, independence of intervals, constant \(\lambda\)
We know the expectation is 15, and that this is also \(\lambda\).
Is poisson a realistic model? What are some potential issues?
Someone offers you the opportunity to play the following game:
They flip a fair coin repeatedly until it lands heads. If the first flip is heads you get $2, if the second flip is heads you get $4 and generically you get \(2^{x}\) where X is flips of a fair coin until it gets heads. How much should you be willing to pay to play this game?
\[E[X] = \infty \]