
Last class, we saw how to explore one variable at a time
Creating frequency tables and proportion tables
Creating well formatted histograms
Computing descriptive statistics, like means, medians and standard deviations
But…most data analyses are about explaining relationships between variables
Today, we will start exploring how to describe the relationships between variables using
Scatterplots, which effectively visualize the relationship between variables
Correlation, which gives a numerical summary of how two variables are related
The Goal: Learn how to determine if there is a strong positive or negative relationship between two variables.
A scatterplot lets us visualize the relationship between two variables by plotting in 2-D space
Each dot on a scatterplot is a single observation i whose position is determined by the values (\(X_{i}\), \(Y_{i}\))

Imagine we have two variables:
| i | X | Y |
|---|---|---|
| 1 | 4 | 2 |
| 2 | 8 | 5 |
| 3 | 10 | 3 |
We can create the scatterplot of \(X\) and \(Y\) by plotting the dot for each observation one at a time.
So, we put the first point at (4,2), the second at (8,5) and so on..
R function to create a scatterplot: ggplot() + geom_point()
How many components are required?
Three main components: data, aesthetic mappings, and geometry layer
data (your dataset)aes() (aesthetic mappings defining which variables map to x and y)geom_point() (the geometry layer that creates the points)I add a title, and x and y axis labels using labs()
It might be interesting to color in the points with different shades based on some variable
We are going to work with the STAR data again, so let’s load that in and look at the data.
classtype reading math graduated
1 small 578 610 1
2 regular 612 612 1
3 regular 583 606 1
4 small 661 648 1
5 small 614 636 1
6 regular 610 603 0
What is the unit of analysis? How can I interpret the first row?
Create a scatterplot with math score on the X axis and reading score on the Y axis. Add a color aesthetic to make the dots different colors based on classtype.
Give the plot a sensible title and axis labels.
Correlation
The Correlation Coefficient, or Correlation, summarizes the strength of the linear relationship between two variables. We often denote the correlation coefficient as r.
Ranges From -1 to 1
If the correlation coefficient is positive, variables tend to move together.
If the correlation coefficient is negative, variables tend to move separately.
If the correlation coefficient is near zero, there is a weak or non-existent relationship between the variables.
The correlation coefficient (r) measures the strength and direction of a linear relationship between two variables.
Formula: \[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}\]
What does all this mean?
In simple terms:
For each observation, see how far X is from its mean and how far Y is from its mean
Multiply these deviations together
Sum them up and standardize
Ranges from -1 to +1
The correlation coefficient captures both the direction and the strength of the linear association. How?
Direction (+/-): Do two variables tend to move in the same direction (+) or the opposite direction (-)
Strength: How close to the line of best fit (the line that best summarizes the data). Closer = Stronger
First, the correlation coefficient can be decomposed into:
cor(X,Y) ∈ [-1,1]
Example 1: cor(X,Y) = -0.9
Example 2: cor(X,Y) = 0.1
The sign reflects the direction of the linear association:
\(\text{cor}(X,Y) > 0\) when variables tend to move in the same direction; slope of line of best fit is positive
\(\text{cor}(X,Y) < 0\) when variables tend to move in opposite directions; slope of line of best fit is negative
The absolute value reflects the strength of the linear association
\(|\text{cor}(X,Y)| = 0\), no linear assocation
\(|\text{cor}(X,Y)| = 1\), perfect linear association
The closer to 1, the stronger the association
Which scatterplot shows a positive linear association? ___________
A

B

if slope is positive (line moves upwards from left to right), the linear association and the correlation are both positive
if slope is negative (line moves downwards), the linear association and the correlation are both negative
Which scatterplot shows a weaker linear association?
A

B

the closer the dots are to the line of best fit, the stronger the linear association, the closer |cor(X,Y)| to 1
the farther the dots are from the line of best fit, the weaker the linear association, the closer |cor(X,Y)| to 0
cor(X,Y) = -0.9
cor(X,Y) = 0.2
Note: To help you gain an intuition, here we use the terms weak and strong, but what is considered a weak correlation in one field may be considered strong in another
Let’s return to the scatterplot of reading and math
Is the association positive or negative? Does this seem like a strong association?
The r function to calculate correlation is cor()
What are the required arguments? The variables:
Order of variables doesn’t matter; cor(X,Y) = cor(Y, X)
Go ahead and calculate the correlation between math and reading in the star data.
A steeper line of best fit does not necessarily mean a higher correlation in absolute terms, or vice versa
Which scatterplot has a steeper line of best fit?
Which shows a higher correlation?
A correlation of zero does not necessarily mean that there is no relationship between the variables; only that there is no linear relationship
Correlation does not necessarily imply causation
Just because two variables, \(X\) and \(Y\), are highly correlated - have a strong linear association, regardless of the sign - does not necessarily mean that changes in \(X\) cause changes in \(Y\) or vice versa
In STAR data, reading and math are highly correlated, but this does not mean that improving reading scores will necessarily cause math scores to improve