Slides will be online after the lecture
For next week - a few readings have been added to Canvas
Follow textbooks at your own pace
Read (to at least get main argument) articles before Monday
First problem set next week. A mix of working in R and probability theory
Want to measure the relationship between variables in the social world
Assume that relationships between variables are usually not deterministic
Need ways to measure uncertainty about the world
Models of democracy (Pzerworksi 2000, Boix 2003, Acemoglu and Robinson 2006) suggest democratization is caused by economic conditions
Let i indicate a given country and t indicate a given year:
\[ \text{Dem}_{it} = \text{f(Economic Conditions}_{it}) \]
What are potential problems with this?
\[ \text{Dem}_{it} = \text{f(Economic Conditions}_{it}) + g(\text{Stuff}_{it}) \]
We often rewrite this as
\[ \text{Dem}_{it} = \text{f(Economic Conditions}_{it}) + \epsilon{ij} \]
where \(\epsilon\) is the error term.
Probabilty gives us a way to both quantify the strength of the relationship (probably through regression) and to measure the uncertainty associated with our estimate.
Conditional Probability is the foundation of statistics
We often want to know whether a group that has been “treated” by a policy intervention has different outcomes than a control group.
To determine whether these expected values are different, we need to understand how they are distributed.
Good social scientists need to understand what (conditional) probability they are estimating (excepting purely descriptive work)
Being confident that we are estimating the right quantity of interest requires
Understanding probability (often skipped in applied stats courses)
Understanding the Data Generating Process (linking theory and expertise to empirical models)
Methods and Tools WILL change during your careers
Text-as-data (Dictionaries –> Bag of Words –> LLMs)
A new DiD estimator every month (slight exaggeration)
The rise and fall (and rise?) of instrumental variables
SPSS -> STATA -> R -> Python (?) -> Machine writes the code (?)
We want to lay a foundation so that you can adapt as tools change
Paul picked 8 consecutive games correctly (all of Germany’s games + the final). What is the probability of correctly picking 8 consecutive games by chance?
The probability of randomly picking any single game correctly is 0.5. Intuitively, the odds of getting all 8 correct are \(0.5^{8}\) or \(\frac{1}{256}\)
Less than a 0.5% chance of getting all 8 right by chance. Usually, our statistical tests look for p < .05, this is p <.005. Thus, Paul can see the future!
Or can he? Next week, Reverand Bayes can help us think through this with rigor.
A sample space is the set of all possible outcomes or states. An event A is a subset of the sample space S
The sample space can be finite, countably infinite or uncountably infinite. If it is finite, we can visualize it as below:
This all might seem quite abstract…what are some sample spaces and events we might care about?
Political Science: Sample space might be decisions made by the electorate. A could be the event of voting and B the event of voting for a third party candidate.
Economics: Sample space might be range of labor market outcomes. A might be those currently employed and B might be those looking for a new job.
Education: The sample space could be the range of educational outcomes, where A is those who have at least some college.
A \(\cup\) B is the union of Events A and B.
A or B (including A and B)
A \(\cap\) B is the intersection of A and B
A and B
A\(^C\) is the complement of A
Everything that is in S but not in A
What is the probability of B occurring?
Looks like 4/9, if we just take the number of outcomes in B and divide by the number of outcomes in S. But….what assumptions are we making?
S might not be finite
Some events may be more likely (have more mass) than others
Imagine a race with 25 runners, where runners are awarded Gold, Silver and Bronze medals. How many combinations of medal winners are there?
Imagine a race with 25 runners, where runners are awarded Gold, Silver and Bronze medals. How many combinations of medal winners are there?
There are 25 potential first place winners. Once we know this, there remain 24 potential second place winners. And so on (n) * (n-1) * (n-2)…
Consider N objects, from which we make K choices, with replacement. Assume that order matters {1,2} != {2,1}
Consider N objects, from which we make K choices, with replacement. Assume that order matters {1,2} != {2,1}
Then the number of possible outcomes is \(n^{k}\). Why?
This is what Dall E-3 imagines a polling center looks like!
Consider n objects from which we make k choices without replacement.
Consider n objects from which we make k choices without replacement.
From the multiplication rule, there are n(n-1)(n-2)….(n-k+1) possible outcomes.
What is the probability that two people in this class share a birthday?
library(ggplot2)
# Function to calculate the probability
birthday_prob <- function(n) {
if (n > 365) return(1)
prob <- 1
for (i in 0:(n-1)) {
prob <- prob * (365 - i) / 365
}
return(1 - prob)
}
# Number of people in the group
n <- 1:100
# Calculate probabilities
probabilities <- sapply(n, birthday_prob)
# Create a data frame for plotting
birthday_data <- data.frame(GroupSize = n, Probability = probabilities)
# Plotting the birthday problem
ggplot(birthday_data, aes(x = GroupSize, y = Probability)) +
geom_line(color = "blue", size = 1.5) +
geom_point(color = "red", size = 2) +
labs(title = "Birthday Problem: Probability of Shared Birthdays",
x = "Number of People in the Group",
y = "Probability of At Least One Shared Birthday") +
theme_minimal()Code Behind the Plot
How many ways are there to chose a three person committee from five people?
List them all out (123)(124)(125)(134)(135)(145)(234)(235)(245)(345). So, there are 10 ways to form this committee.
or, we can use the multiplication rule. There are 5 ways to chose spot 1, 4 ways to chose spot 2, 2 ways to chose spot 3….but this over counts because order does not matter.
This leads to a more general concept - the Binomial Coefficient \(n\choose k\) or “n choose k”. This is the number of subsets of size k for a set of size n.
\[ \frac{(n)(n-1)(n-2)....(n-k+1)}{k!} \]
if k < n
Equivalently
\[\frac{n!}{(n-k)!k!}\]
A probability space consists of a sample space S and a probabilty function P which takes an event A in S as an input and returns P(A) (the probability of A occuring), a real number between 0 and 1 as output
P(\(\emptyset\)) = 0, P(S) = 1
If \(A_{1}\) , \(A_{2}\) are disjoint (non-overlapping) then P(A \(\cup\) B) is P(\(A_{1}\)) + P(\(A_{2}\))
The following are important properties
P(\(A^{c}\)) = 1 - P(A)
If A \(\subseteq\) B, then P(A) \(\leq\) P(B)
P(\(A \cup B\)) = P(A) + P(B) - P(\(A \cap B\))
In R, we often want to call a function that someone else has written
Some functions in base R, most from packages
Be careful about name clashes
Functions take a range of prespecified arguments
Some are required, others are options
Some have defaults if you do not specify, others don’t
May also have an order in the function’s syntax
While it is possible to plot in base R (without loading a package), nearly everyone uses ggplot2 instead
Designed by Hadley Wickham, who introduces a “Grammar of Graphics”
A plot is composed of three core elements (and other options)
The Data
The Aesthetic Mapping of variables to visual cues
The Geometry used to encode observations (the type of plot)
Variable: Something that you can measure
Value: The state of the variable when you measure it. Usually the state of a variable varies.
Observation: A set of measurements of some object - usually a row in the data.
Tidy Data: Data formatted s/t each observation is one row.
library(ggplot2)
library(palmerpenguins)
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = bill_depth_mm,
color = species)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) + # Add group-specific OLS regression lines with uncertainty
ggtitle("Characteristics of Penguins by Species") + # Add title
labs(x = "Bill Length (mm)", y = "Bill Depth (mm)") + # Format axes into plain English
theme_minimal() + # Apply a minimal theme
theme(axis.text = element_text(size = 12), # Increase axis number size
axis.title = element_text(size = 14), # Increase axis label size
plot.title = element_text(size = 16, hjust = 0.5)) # Center and increase title size
We want to convey information to the reader
clearly
efficiently
as simply as possible
For most plots, we want the outcome or dependent variable on the Y axis
Similarly, the key independent variable should be on the X axis
Think carefully about axes. Use a single scale, with sensible values!
Clearly label plots. Rename variables so reader can understand
Use colorblind friendly color schemes
In R, the default scheme has improved.
The palette.colors function returns even better color schemes
As a general rule (there may be exceptions)
Don’t use pie graphs, they are hard to interpret
Similarly, 3D graphics rarely add anything to plots
If you want to track trends across different groups/subjects, consider faceting.
Useful for understanding the bivariate relationship (or joint distribution) between two variables
We often add a line of fit (linear or smoothed) to make the relationship more explicit
in ggplot, geom_point() tells R to make a scatter plot
Great to summarize a distribution
Particularly with discrete or categorical data
Can be used with continuous data too (Histogram)
geom_bar() (counts), geom_col() (summary statistics) or geom_histogram() (continuous data with bins)
library(ggplot2)
library(palmerpenguins)
library(tidyverse)
# Pre-compute the mean bill length for each species
penguin_summary <- penguins %>%
group_by(species) %>%
summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE))
ggplot(data = penguin_summary, aes(x = species, y = mean_bill_length)) +
geom_col() +
ggtitle("Average Bill Length by Species") + # Add title
labs(x = "Species", y = "Average Bill Length (mm)") + # Add axis labels
theme_minimal() # Minimal theme for a cleaner look
Good for summarizing continuous data
geom_density()
X axis is variable, Y axis is probability density (0 to 1 - will define soon)
can smooth the curve with adjust
library(ggplot2)
library(palmerpenguins)
ggplot(data = penguins, aes(x = bill_length_mm)) +
geom_density(fill = "lightblue", alpha = 0.6) + # Add fill color and transparency
ggtitle("Density Plot of Bill Lengths") + # Add title
labs(x = "Bill Length (mm)", y = "Density") + # Add axis labels
theme_minimal() # Minimal theme for a cleaner look
ggplot(data = penguins, aes(x = bill_length_mm, fill = species)) +
geom_density(alpha = 0.5) + # Add density curves with transparency
ggtitle("Density Plot of Bill Lengths by Species") + # Add title
labs(x = "Bill Length (mm)", y = "Density") + # Add axis labels
theme_minimal() # Minimal theme for a cleaner look
Another useful way to plot the distribution
Visually
The middle line is the median
The box spans the 25th - 75th percentile (Interquartile range)
The whiskers are (usually) 1.5 X the interquartile range
Points beyond the whiskers are outliers
geom_boxplot()
library(dplyr)
library(tidytext)
library(wordcloud2)
library(gutenbergr)
# Download "Hamlet" by Shakespeare
# Set a different mirror
gutenberg_mirror <- "https://gutenberg.pglaf.org"
# Download Hamlet again using the new mirror
hamlet <- gutenberg_download(1787, mirror = gutenberg_mirror)
# Tokenize the text into words and remove stop words
data(stop_words)
hamlet_words <- hamlet %>%
unnest_tokens(word, text) %>% # Tokenize into words
anti_join(stop_words) %>% # Remove stop words
count(word, sort = TRUE) # Count word frequencies
# Generate the word cloud
wordcloud2(data = hamlet_words, size = 0.7, color = "random-light", backgroundColor = "black")
Note - Word clouds look cool, but there are probably better ways to graph text data