Lecture 2: Intro to Probability

Will Horne

Admin Stuff

  • Slides will be online after the lecture

    • I encourage taking some notes, but you don’t need to get everything down
  • For next week - a few readings have been added to Canvas

    • Follow textbooks at your own pace

    • Read (to at least get main argument) articles before Monday

  • First problem set next week. A mix of working in R and probability theory

What is the goal?

  • Want to measure the relationship between variables in the social world

    • Possibly establish causal relationships (mostly an issue for the spring)
  • Assume that relationships between variables are usually not deterministic

    • In this sense, different from some lab sciences
  • Need ways to measure uncertainty about the world

A Deterministic Model

Models of democracy (Pzerworksi 2000, Boix 2003, Acemoglu and Robinson 2006) suggest democratization is caused by economic conditions

Let i indicate a given country and t indicate a given year:

\[ \text{Dem}_{it} = \text{f(Economic Conditions}_{it}) \]

What are potential problems with this?

A Probabilistic Model

\[ \text{Dem}_{it} = \text{f(Economic Conditions}_{it}) + g(\text{Stuff}_{it}) \]

We often rewrite this as

\[ \text{Dem}_{it} = \text{f(Economic Conditions}_{it}) + \epsilon{ij} \]

where \(\epsilon\) is the error term.

Probabilty gives us a way to both quantify the strength of the relationship (probably through regression) and to measure the uncertainty associated with our estimate.

A Preview: Conditioning

  • Conditional Probability is the foundation of statistics

  • We often want to know whether a group that has been “treated” by a policy intervention has different outcomes than a control group.

    • But…what do we mean by different?
    • We mean something like \(E[Y_{i}| X_{i} = 1]\) does not equal \(E[Y_{i}|X_{i} = 0]\), where X denotes treatment status.
  • To determine whether these expected values are different, we need to understand how they are distributed.

Why Bother with the Math?

  • Good social scientists need to understand what (conditional) probability they are estimating (excepting purely descriptive work)

    • Estimating the wrong (or uninteresting) conditional probability is a frequent problem
  • Being confident that we are estimating the right quantity of interest requires

    • Understanding probability (often skipped in applied stats courses)

    • Understanding the Data Generating Process (linking theory and expertise to empirical models)

Learning new Methods

  • Methods and Tools WILL change during your careers

    • Text-as-data (Dictionaries –> Bag of Words –> LLMs)

    • A new DiD estimator every month (slight exaggeration)

    • The rise and fall (and rise?) of instrumental variables

    • SPSS -> STATA -> R -> Python (?) -> Machine writes the code (?)

  • We want to lay a foundation so that you can adapt as tools change

Paul the Octopus

Paul the Prophet?

Paul picked 8 consecutive games correctly (all of Germany’s games + the final). What is the probability of correctly picking 8 consecutive games by chance?

The probability of randomly picking any single game correctly is 0.5. Intuitively, the odds of getting all 8 correct are \(0.5^{8}\) or \(\frac{1}{256}\)

Less than a 0.5% chance of getting all 8 right by chance. Usually, our statistical tests look for p < .05, this is p <.005. Thus, Paul can see the future!

Or can he? Next week, Reverand Bayes can help us think through this with rigor.

Sample Spaces and Events

A sample space is the set of all possible outcomes or states. An event A is a subset of the sample space S

The sample space can be finite, countably infinite or uncountably infinite. If it is finite, we can visualize it as below:

Events in the Social World

This all might seem quite abstract…what are some sample spaces and events we might care about?

Political Science: Sample space might be decisions made by the electorate. A could be the event of voting and B the event of voting for a third party candidate.

Economics: Sample space might be range of labor market outcomes. A might be those currently employed and B might be those looking for a new job.

Education: The sample space could be the range of educational outcomes, where A is those who have at least some college.

Union and Intersection

A \(\cup\) B is the union of Events A and B.

A or B (including A and B)

A \(\cap\) B is the intersection of A and B

A and B

A\(^C\) is the complement of A

Everything that is in S but not in A

Naive Definition of Probability

What is the probability of B occurring?

Looks like 4/9, if we just take the number of outcomes in B and divide by the number of outcomes in S. But….what assumptions are we making?

S might not be finite

Some events may be more likely (have more mass) than others

Multiplication Rule

Multiplication Rule

Imagine a race with 25 runners, where runners are awarded Gold, Silver and Bronze medals. How many combinations of medal winners are there?

Multiplication Rule

Imagine a race with 25 runners, where runners are awarded Gold, Silver and Bronze medals. How many combinations of medal winners are there?

There are 25 potential first place winners. Once we know this, there remain 24 potential second place winners. And so on (n) * (n-1) * (n-2)…

Sampling with Replacement

Consider N objects, from which we make K choices, with replacement. Assume that order matters {1,2} != {2,1}

Sampling with Replacement

Consider N objects, from which we make K choices, with replacement. Assume that order matters {1,2} != {2,1}

Then the number of possible outcomes is \(n^{k}\). Why?

Sampling Without Replacement

This is what Dall E-3 imagines a polling center looks like!

Consider n objects from which we make k choices without replacement.

Sampling Without Replacement

Consider n objects from which we make k choices without replacement.

From the multiplication rule, there are n(n-1)(n-2)….(n-k+1) possible outcomes.

Let’s work through an example

What is the probability that two people in this class share a birthday?

Plotting the Birthday Problem

Code Behind the Plot

library(ggplot2)

# Function to calculate the probability
birthday_prob <- function(n) {
  if (n > 365) return(1)
  prob <- 1
  for (i in 0:(n-1)) {
    prob <- prob * (365 - i) / 365
  }
  return(1 - prob)
}

# Number of people in the group
n <- 1:100

# Calculate probabilities
probabilities <- sapply(n, birthday_prob)

# Create a data frame for plotting
birthday_data <- data.frame(GroupSize = n, Probability = probabilities)

# Plotting the birthday problem
ggplot(birthday_data, aes(x = GroupSize, y = Probability)) +
  geom_line(color = "blue", size = 1.5) +
  geom_point(color = "red", size = 2) +
  labs(title = "Birthday Problem: Probability of Shared Birthdays",
       x = "Number of People in the Group",
       y = "Probability of At Least One Shared Birthday") +
  theme_minimal()Code Behind the Plot

Adjusting for over counting

How many ways are there to chose a three person committee from five people?

List them all out (123)(124)(125)(134)(135)(145)(234)(235)(245)(345). So, there are 10 ways to form this committee.

or, we can use the multiplication rule. There are 5 ways to chose spot 1, 4 ways to chose spot 2, 2 ways to chose spot 3….but this over counts because order does not matter.

Binomial Coefficients

This leads to a more general concept - the Binomial Coefficient \(n\choose k\) or “n choose k”. This is the number of subsets of size k for a set of size n.

\[ \frac{(n)(n-1)(n-2)....(n-k+1)}{k!} \]

if k < n

Equivalently

\[\frac{n!}{(n-k)!k!}\]

General Definition of Probability

A probability space consists of a sample space S and a probabilty function P which takes an event A in S as an input and returns P(A) (the probability of A occuring), a real number between 0 and 1 as output

Axioms of Probability

  • P(\(\emptyset\)) = 0, P(S) = 1

  • If \(A_{1}\) , \(A_{2}\) are disjoint (non-overlapping) then P(A \(\cup\) B) is P(\(A_{1}\)) + P(\(A_{2}\))

  • The following are important properties

    • P(\(A^{c}\)) = 1 - P(A)

    • If A \(\subseteq\) B, then P(A) \(\leq\) P(B)

    • P(\(A \cup B\)) = P(A) + P(B) - P(\(A \cap B\))

Switching Gears

Calling Functions in R

  • In R, we often want to call a function that someone else has written

    • Some functions in base R, most from packages

    • Be careful about name clashes

  • Functions take a range of prespecified arguments

    • Some are required, others are options

    • Some have defaults if you do not specify, others don’t

    • May also have an order in the function’s syntax

GGplot and the Grammar of Graphics

  • While it is possible to plot in base R (without loading a package), nearly everyone uses ggplot2 instead

  • Designed by Hadley Wickham, who introduces a “Grammar of Graphics”

  • A plot is composed of three core elements (and other options)

    • The Data

    • The Aesthetic Mapping of variables to visual cues

    • The Geometry used to encode observations (the type of plot)

Data - Defining Some Terms

  • Variable: Something that you can measure

    • Can be continuous, categorical, count, ordered, etc
  • Value: The state of the variable when you measure it. Usually the state of a variable varies.

  • Observation: A set of measurements of some object - usually a row in the data.

  • Tidy Data: Data formatted s/t each observation is one row.

A very simple example

library(ggplot2)
library(palmerpenguins)
ggplot(data = penguins, 
       mapping = aes(x = bill_length_mm,
                     y = bill_depth_mm,
                     color = species)) +
    geom_point()

A very simple example

Making it fancy

library(ggplot2)
library(palmerpenguins)

ggplot(data = penguins, 
       mapping = aes(x = bill_length_mm,
                     y = bill_depth_mm,
                     color = species)) +
    geom_point() +
    geom_smooth(method = "lm", se = TRUE) +  # Add group-specific OLS regression lines with uncertainty
    ggtitle("Characteristics of Penguins by Species") +  # Add title
    labs(x = "Bill Length (mm)", y = "Bill Depth (mm)") +  # Format axes into plain English
    theme_minimal() +  # Apply a minimal theme
    theme(axis.text = element_text(size = 12),  # Increase axis number size
          axis.title = element_text(size = 14),  # Increase axis label size
          plot.title = element_text(size = 16, hjust = 0.5))  # Center and increase title size

Making it fancy

What is the Goal?

We want to convey information to the reader

  • clearly

  • efficiently

  • as simply as possible

Some Basic Rules

  • For most plots, we want the outcome or dependent variable on the Y axis

  • Similarly, the key independent variable should be on the X axis

  • Think carefully about axes. Use a single scale, with sensible values!

  • Clearly label plots. Rename variables so reader can understand

More Rules

  • Use colorblind friendly color schemes

    • In R, the default scheme has improved.

    • The palette.colors function returns even better color schemes

  • As a general rule (there may be exceptions)

    • Don’t use pie graphs, they are hard to interpret

    • Similarly, 3D graphics rarely add anything to plots

  • If you want to track trends across different groups/subjects, consider faceting.

Examples of Bad Graphics

Examples of Bad Graphics

Examples of Bad Graphics

Examples of Bad Graphics

Examples of Bad Graphics

Too Fancy

A Better Graphic

Another Decent(?) Graphic

Scatter Plots

  • Useful for understanding the bivariate relationship (or joint distribution) between two variables

  • We often add a line of fit (linear or smoothed) to make the relationship more explicit

  • in ggplot, geom_point() tells R to make a scatter plot

Bar Plots

  • Great to summarize a distribution

    • Particularly with discrete or categorical data

    • Can be used with continuous data too (Histogram)

      • pay special attentiont to bin size with continous data
  • geom_bar() (counts), geom_col() (summary statistics) or geom_histogram() (continuous data with bins)

Example

library(ggplot2)
library(palmerpenguins)

ggplot(data = penguins, aes(x = species)) +
  geom_bar() +
  ggtitle("Count of Penguins by Species") +  # Add title
  labs(x = "Species", y = "Count") +  # Add axis labels
  theme_minimal()  # Minimal theme for a cleaner look

Example

Summary Stat Example

library(ggplot2)
library(palmerpenguins)
library(tidyverse)

# Pre-compute the mean bill length for each species
penguin_summary <- penguins %>%
  group_by(species) %>%
  summarize(mean_bill_length = mean(bill_length_mm, na.rm = TRUE))

ggplot(data = penguin_summary, aes(x = species, y = mean_bill_length)) +
  geom_col() +
  ggtitle("Average Bill Length by Species") +  # Add title
  labs(x = "Species", y = "Average Bill Length (mm)") +  # Add axis labels
  theme_minimal()  # Minimal theme for a cleaner look

Summary Stat Example

Density Plots

  • Good for summarizing continuous data

    • Personally, I prefer these to histograms
  • geom_density()

  • X axis is variable, Y axis is probability density (0 to 1 - will define soon)

  • can smooth the curve with adjust

    • don’t overdo it!

Example

library(ggplot2)
library(palmerpenguins)

ggplot(data = penguins, aes(x = bill_length_mm)) +
  geom_density(fill = "lightblue", alpha = 0.6) +  # Add fill color and transparency
  ggtitle("Density Plot of Bill Lengths") +  # Add title
  labs(x = "Bill Length (mm)", y = "Density") +  # Add axis labels
  theme_minimal()  # Minimal theme for a cleaner look

Example

Also useful for comparing distributions

Code

ggplot(data = penguins, aes(x = bill_length_mm, fill = species)) +
  geom_density(alpha = 0.5) +  # Add density curves with transparency
  ggtitle("Density Plot of Bill Lengths by Species") +  # Add title
  labs(x = "Bill Length (mm)", y = "Density") +  # Add axis labels
  theme_minimal()  # Minimal theme for a cleaner look

Box Plots

  • Another useful way to plot the distribution

  • Visually

    • The middle line is the median

    • The box spans the 25th - 75th percentile (Interquartile range)

    • The whiskers are (usually) 1.5 X the interquartile range

    • Points beyond the whiskers are outliers

  • geom_boxplot()

Example

ggplot(data = penguins, aes(x = species, y = bill_length_mm, fill = sex)) +
  geom_boxplot() +
  ggtitle("Box Plot of Bill Lengths by Species and Sex") +  # Add title
  labs(x = "Species", y = "Bill Length (mm)") +  # Add axis labels
  theme_minimal()  # Minimal theme for a cleaner look

Example

Facets

Preview: Text Data

Word Cloud Code

library(dplyr)
library(tidytext)
library(wordcloud2)
library(gutenbergr)

# Download "Hamlet" by Shakespeare
# Set a different mirror
gutenberg_mirror <- "https://gutenberg.pglaf.org"

# Download Hamlet again using the new mirror
hamlet <- gutenberg_download(1787, mirror = gutenberg_mirror)
# Tokenize the text into words and remove stop words
data(stop_words)

hamlet_words <- hamlet %>%
  unnest_tokens(word, text) %>%  # Tokenize into words
  anti_join(stop_words) %>%  # Remove stop words
  count(word, sort = TRUE)   # Count word frequencies

# Generate the word cloud
wordcloud2(data = hamlet_words, size = 0.7, color = "random-light", backgroundColor = "black")

Note - Word clouds look cool, but there are probably better ways to graph text data

Your Turn

  • R comes with a built datasets, including mtcars and iris - If you love penguins, instead you can install and load palmerpenguis
    • useful toy data sets
  • Pick one (or try both) and explore the data:
    • call summary and head on the data
      • for example summary(mtcars) will give you info about the mtcars dataset
  • Plot some interesting relationships + play around with ggplot features