Exploring Relationships between Variables

Will Horne

Quick Review

  • Last class, we saw how to explore one variable at a time

    • Creating frequency tables and proportion tables

    • Creating well formatted histograms

    • Computing descriptive statistics, like means, medians and standard deviations

  • But…most data analyses are about explaining relationships between variables

Today’s Plan

  • Today, we will start exploring how to describe the relationships between variables using

    • Scatterplots, which effectively visualize the relationship between variables

    • Correlation, which gives a numerical summary of how two variables are related

  • The Goal: Learn how to determine if there is a strong positive or negative relationship between two variables.

Scatterplots

A scatterplot lets us visualize the relationship between two variables by plotting in 2-D space

Each dot on a scatterplot is a single observation i whose position is determined by the values (\(X_{i}\), \(Y_{i}\))

Understanding Scatterplots

Imagine we have two variables:

i X Y
1 4 2
2 8 5
3 10 3

We can create the scatterplot of \(X\) and \(Y\) by plotting the dot for each observation one at a time.

So, we put the first point at (4,2), the second at (8,5) and so on..

Simple Scatterplot

Scatterplots in R

R function to create a scatterplot: ggplot() + geom_point()

How many components are required?

Three main components: data, aesthetic mappings, and geometry layer

Basic structure:

Code
  ggplot(data, aes(x = X, y = Y)) + geom_point()

Components

Code
  ggplot(data, aes(x = X, y = Y)) + geom_point()
  • Component breakdown:
    • First: data (your dataset)
    • Second: aes() (aesthetic mappings defining which variables map to x and y)
    • Third: geom_point() (the geometry layer that creates the points)

Example

Code
ggplot(star, aes(x = reading, y = math)) + geom_point()

Fancier Plot

Code
ggplot(star, aes(x = reading, y = math)) + 
  geom_point(color = "#E74C3C", alpha = 0.7) +
  labs(
    title = "Relationship Between Reading and Math Scores",
    x = "Reading Score",
    y = "Math Score"
  ) +
  theme_minimal()

I add a title, and x and y axis labels using labs()

Adding a Color Aesthetic

It might be interesting to color in the points with different shades based on some variable

Code
ggplot(star, aes(x = reading, y = math, color = graduated)) + 
  geom_point() +
  labs(
    title = "Relationship Between Reading and Math Scores",
    x = "Reading Score",
    y = "Math Score"
  ) +
  theme_minimal()

Adding Color

Code
ggplot(star, aes(x = reading, y = math, color = graduated)) + 
  geom_point() +
  labs(
    title = "Relationship Between Reading and Math Scores",
    x = "Reading Score",
    y = "Math Score"
  ) +
  theme_minimal()

STAR Data

We are going to work with the STAR data again, so let’s load that in and look at the data.

  classtype reading math graduated
1     small     578  610         1
2   regular     612  612         1
3   regular     583  606         1
4     small     661  648         1
5     small     614  636         1
6   regular     610  603         0

What is the unit of analysis? How can I interpret the first row?

Your Turn

Create a scatterplot with math score on the X axis and reading score on the Y axis. Add a color aesthetic to make the dots different colors based on classtype.

Give the plot a sensible title and axis labels.

Plot

Code
ggplot(star, aes(x = math, y = reading, color = classtype)) + 
  geom_point() +
  labs(
    title = "Relationship Between Reading and Math Scores",
    x = "Math Score",
    y = "Reading Score"
  ) +
  theme_minimal()

Interpretation

  • What does each dot represent?
  • Would you prefer to be in the lower left corner or the upper right corner?
  • What does this scatterplot reveal about the relationship between reading and math scores?
  • Next: We could use some measure to summarize the relationship we are seeing

The Correlation Coefficient

Correlation

The Correlation Coefficient, or Correlation, summarizes the strength of the linear relationship between two variables. We often denote the correlation coefficient as r.

Ranges From -1 to 1

If the correlation coefficient is positive, variables tend to move together.

If the correlation coefficient is negative, variables tend to move separately.

If the correlation coefficient is near zero, there is a weak or non-existent relationship between the variables.

Graphical View of Correlation

How Correlation is Calculated

The correlation coefficient (r) measures the strength and direction of a linear relationship between two variables.

Formula: \[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}\]

What does all this mean?

Interpretation

In simple terms:

  • For each observation, see how far X is from its mean and how far Y is from its mean

  • Multiply these deviations together

  • Sum them up and standardize

Ranges from -1 to +1

  • r = +1: perfect positive relationship
  • r = -1: perfect negative relationship
  • r = 0: no linear relationship

Interpretation Revisisted

The correlation coefficient captures both the direction and the strength of the linear association. How?

Direction (+/-): Do two variables tend to move in the same direction (+) or the opposite direction (-)

Strength: How close to the line of best fit (the line that best summarizes the data). Closer = Stronger

Correlation Decomposition

First, the correlation coefficient can be decomposed into:

cor(X,Y) ∈ [-1,1]

  • Sign:
    • negative: cor(X,Y) ∈ [-1,0)
    • positive: cor(X,Y) ∈ (0,1]
  • Absolute value: |cor(X,Y)| ∈ [0,1]

Examples

Example 1: cor(X,Y) = -0.9

  • sign: negative
  • absolute value: 0.9

Example 2: cor(X,Y) = 0.1

  • sign: ___________
  • absolute value: ___________

Even More Interpretation

  • The sign reflects the direction of the linear association:

    • \(\text{cor}(X,Y) > 0\) when variables tend to move in the same direction; slope of line of best fit is positive

    • \(\text{cor}(X,Y) < 0\) when variables tend to move in opposite directions; slope of line of best fit is negative

  • The absolute value reflects the strength of the linear association

    • \(|\text{cor}(X,Y)| = 0\), no linear assocation

    • \(|\text{cor}(X,Y)| = 1\), perfect linear association

    • The closer to 1, the stronger the association

Example of change in direction

Which scatterplot shows a positive linear association? ___________

A

B

Imagine the line of best fit:

  • if slope is positive (line moves upwards from left to right), the linear association and the correlation are both positive

  • if slope is negative (line moves downwards), the linear association and the correlation are both negative

Example of change in strength

Which scatterplot shows a weaker linear association?

A

B

Imagine the line of best fit:

  • the closer the dots are to the line of best fit, the stronger the linear association, the closer |cor(X,Y)| to 1

  • the farther the dots are from the line of best fit, the weaker the linear association, the closer |cor(X,Y)| to 0

Special cases:

Examples:

Examples:

cor(X,Y) = -0.9

  • sign: negative → negative linear association
  • absolute value: 0.9 → strong linear association

cor(X,Y) = 0.2

  • sign: __________ → __________ linear association
  • absolute value: __________ → __________ linear association

Note: To help you gain an intuition, here we use the terms weak and strong, but what is considered a weak correlation in one field may be considered strong in another

STAR Scatterplot

Let’s return to the scatterplot of reading and math

Is the association positive or negative? Does this seem like a strong association?

Correlation in R

The r function to calculate correlation is cor()

What are the required arguments? The variables:

Code
cor(data$x, data$y)

Order of variables doesn’t matter; cor(X,Y) = cor(Y, X)

Go ahead and calculate the correlation between math and reading in the star data.

Common Misconceptions

A steeper line of best fit does not necessarily mean a higher correlation in absolute terms, or vice versa

Which scatterplot has a steeper line of best fit?
Which shows a higher correlation?

Common Misconceptions

A correlation of zero does not necessarily mean that there is no relationship between the variables; only that there is no linear relationship

  • cor(X,Y) ≈ 0 but there is a perfect quadratic relationship

Correlation and Causation

Correlation != Causation

Correlation does not necessarily imply causation

  • Just because two variables, \(X\) and \(Y\), are highly correlated - have a strong linear association, regardless of the sign - does not necessarily mean that changes in \(X\) cause changes in \(Y\) or vice versa

    • a third variable could be affecting both \(X\) and \(Y\), making them move together or in opposite directions
  • In STAR data, reading and math are highly correlated, but this does not mean that improving reading scores will necessarily cause math scores to improve

    • it is likely that students who are engaged learners, have good study habits, or receive strong educational support at home tend to perform well in both subjects

Quick Review

  • Would you expect the correlation between outside temperature and heating bills to be positive or negative?
  • Would you expect the correlation between age of a car and its resale value to be positive or negative?
  • Would you expect the correlation between hours spent studying and exam scores to be positive or negative?