Post 8000: Introduction

Will Horne

Who am I?

  • I am a Political Scientist

    • A comparativist who studies political parties and elections

    • Also interested in research methods and design

  • I just got here…official start date was August 15th

    • PhD Princeton 2022

How did I get here?

What do I study?

Enough About me!

Now let’s talk about you for a minute

Goals of the course

The goals of this course are:

  1. Introduce the foundations of quantitative social science (aka computational social science)

    • Measurement Theory

    • Probability Theory

    • Descriptive and Causal Inference

  2. Link QSS foundations to policy relevant questions

  3. Introduce the R coding language for data analysis

Course Sequence

  • Department is updating graduate methods sequence

    • Part of why I was hired
  • In practice, this is the first course in a sequence

    • We will cover roughly 101 level stats –> OLS Regression. Roughly 2 semesters of (rigorous) UG stats.

    • Spring 2025: advanced regression models + design based techniques for causal inference (DiD, Experiments, etc)

  • Currently no ML/AI or Text courses, but if you have interest stop by

Why Quantitative Social Science?

  • Not Quant vs Qual

    • Quant and qual

    • Mixed methods research often stronger than pure qual or pure quant

      • I’ve done archival work + Other courses available specifically devoted to qualitative research methods
    • Commonalities in thinking hard about good research design
  • New techniques –> wider range of research questions

Motivation

Source: United States Bureau of Labor Statistics

Course Outline

  • Early on, the focus is on getting up and running in R

  • Measurement Theory -> Probability Theory -> Regression

  • Will require some math

    • Calculus (derivatives, integrals, limits) + a little matrix algebra

    • We will review! Not a bad idea to check out Khan Academy or similar if rusty/new

  • Grad school –> Learning is your responsibility, be proactive.

Expectations

  • This course will be hard but…

    • Don’t stress, grad school is not about grades!

      • Goal of this course is to give you tools for your future work
  • If you are lost….stop me and ask questions

    • Office hours: By appointment (online or in-person). Please utilize!

    • Please read! Tons of online resources for both statistics and coding. If you don’t like the readings, feel free to supplement w/ something else.

Course Books

  • Jospeh Blitzsten and Jessica Hwang, Introduction to Probability

  • Hadley Wickham et al, R for Data Science (second edition)

  • Blackwell (Free) or Gelman et al ($ but more user friendly) for Regression

  • Lauderdale for Measurement

Other Course Readings

My plan is, once we have a solid foundation, to add roughly one to two social science articles a week, so you can see the connection to doing good research

What the readings are will, in part, depend on your interests. I have a few in mind, but the goal is to draw connections to topics you care about

Assignments and Grading

  • 20% Attendance and Participation

  • 25% Midterm Exam (Likely Take Home)

  • 25% Problem sets (3 or 4)

  • 30%: Final Project (Research Proposal with Analyses)

Getting Started

  • This course has two main aims

    • Teaching the fundamentals of statistics for social scientists

    • Teaching the fundamentals of coding for data analysis using R

  • We will start in on the statistics side next time, the rest of today will be devoted to getting up and running in R

Why R?

  • Open Source and Free (unlike STATA, SPSS, SAS, etc)

  • Widely used in academic and government research

  • Specifically developed for statistical analysis (unlike Python)

  • Has a friendly IDE, R Studio (Unlike Python imo)

Everything we do can be done in Python if you prefer. I use both for my work.

But…

Please no SPSS/STATA/SAS

R and R Studio

R can be used as a calculator

x <- 6 + 5 + 4

x
[1] 15
y <- (9 * 3)/ 27

y
[1] 1
z <- 2^3

z
[1] 8
a <- "the small brown"
b <- "fox"

## paste strings together
c <- paste(a, b, sep = " ")

c
[1] "the small brown fox"

Data Types

Numeric - 0, 1, 2, 2.25, 3.14, -100

Integer - 1,2,3,4,5

Logical - TRUE/FALSE (or T/F)

String - “The small brown fox”

Factor - Categorical (“South Carolina”, “Georgia”, “Florida”) or Ordinal (“Bad”, “Ok”, “Good”)

Date - “9/2/23”, “2024-5-25”, “13/01/01” or “September 1, 2024”

Not Exhaustive, but these are the basics

Data Types Matter

Packages

We often need to install and load packages to use package specific functions

Sometimes, packages you want may not be in CRAN. There will usually be package specific installation instructions.

Loading Packages

library(tidyverse)

Documentation

Vectors

## create vectors
a <- c(1,2,3,4,5)
b <- c(-1,-1,-1,-1)

We can do operations on vectors

a + b
[1] 0 1 2 3 4
a * b
[1] -1 -2 -3 -4 -5

What about a * 5?

a * 5
[1]  5 10 15 20 25

Errors

Some data types cannot be combined

## what happens if you add a string to a number
a + "cat"
Error in a + "cat": non-numeric argument to binary operator

And some probably shouldn’t be combined

## What happens if you add or subtract logicals?
a - FALSE
[1] 1 2 3 4 5
a - TRUE
[1] 0 1 2 3 4

Note - ALWAYS verify that your output makes sense and that your code is performing the correct operations. Errors are ok - but sometimes you’ll make mistakes that don’t cause errors.

Matrices

Matrices are just multiple vectors of the same length combined, with dimension row x column

## create vectors
a <- c(1,2,3,4,5)
b <- c(-1,-1,-1,-1, -1)
## bind columns
matrix <-cbind(a, b)

In R, Matrices must have consistent data types, and columns do not have headings. This isn’t great for data analyses.

c <-c("The", "small", "brown", "fox", "was")
## try to bind columns 
matrix_2 <-cbind(matrix, c)

Data Frames

Data frames solve the problems with matrices, and are much better for data manipulation.

df <-data.frame(matrix)
df <-cbind(df, c)
## add the column of strings using colnames function
colnames(df) <-c("some_numbers", "other_numbers", "strings")

An example of simple data manipulation using tidyverse syntax

## filter 
  result <- df %>%
  filter(some_numbers > 3) %>%
  mutate(Sum = some_numbers + other_numbers)
  
  result
  some_numbers other_numbers strings Sum
1            4            -1     fox   3
2            5            -1     was   4

Data Frames (2)

We can view the data frame by either calling it directly from the code, or accessing it from our environment panel in Rstudio.

View(df)

Access specific columns with $

df$some_numbers
[1] 1 2 3 4 5

Comment your code!

Real example of some code to create a figure

The Resulting Figure

Resources

Stack Exchange- Active community of R users giving advice. Can also pose questions to the community.

R for Data Scientists - Free book by the creator of the Tidyverse suite of packages.

GPT - GPT is pretty good at R. Will make mistakes, dangerous if you don’t know what you are doing!