So far, we have focused on how to calculate the probability of events, and the expected value, given a distribution! The Expectation is a nice summary statistic, but without a measure of spread it is incomplete.
Think about the binomial and the hypergeometric (sampling with and without replacement). The expectation is the same, but the distributions sometimes look very different!
We don’t just want to know what the center of gravity of the data is, we also want to know how spread it out it is
In practical terms, the wider the spread of the data, and the fatter the tails, the less surprised we should be by values far from the expectation
In the second half of the semester, this will be crucial for hypothesis testing. The basic intuition of hypothesis testing is that we measure how surprised we should be by a value if there is not a relationship between our independent and dependent variables.
We would like a metric that tells us how far from E[X] the values of X typically fall.
That metric is called variance
If the variance is small, we expect the realizations of X to cluster around E[X], and we would be surprised to see values far from E[X]
If the variance is large, we expect the realizations of X to be quite spread out, and we would not be surprised to see values far from E[X]
The Variance, which measures the spread of the distribution, is defined as:
\[ Var[X] = E[(X -E[X])^{2}] \]
Why not just use E[X - E[X]]?
In practice, this is a weighted average of distance from the mean
A common representation of the variance is
\[ Var[X] = E[X^{2}] - (E[X])^{2} \]
\[ Var[X] = E[(X -E[X])^{2}] \]
Note that the variance is the squared distance from the mean. If we want to know, on average, how far a realization of X will be from the \(E[X]\), we can calculate the standard deviation
\[ SD(X) = \sqrt{Var[X]} \]
In practice, we usually work with variance rather than standard deviation, but standard deviation is more immediately interpretable
Calculate the Expectation, Variance and Standard Deviation for a weighted dice roll with the following PMF:
x | P(X = x) |
---|---|
1 | 0.1 |
2 | 0.15 |
3 | 0.2 |
4 | 0.25 |
5 | 0.2 |
6 | 0.1 |
Imagine we want to know both the average (expectation) and spread (variance/standard deviation) of incomes in a community. We randomly select 10 households who have the following incomes (in thousands of US Dollars). Let X be the income of each respondent. We have:
\[ X = [45,50,52,47,60,55,120,28,430,73] \]
How can we calculate the expectation, variance and standard deviation?
Var[X + c] = Var[X] for any constant
If c is a constant, \(Var[cX] = c^{2}Var[X]\)
If and only if X and Y are independent, Var[X + Y] = Var[X] + Var[Y]
If X is not a constant, Var[X] > 0.
Rather than use the formula, we can also use story proofs to find the variance of known distributions
Recall that X ~ Bin(n,p) is the sum of n Bernouli Trials
Variance of a Bernoulli is easy
\[ Var[X_{i}] = E[X_{i}^{2}] - E[X_{i}]^{2} = p - p^{2} = p(1-p) \]
Binomials are the sum of independent Bernoulli trials
\[ Var[X] = n(p - p^{2}) = np(1-p) \]
So far: Probability Theory and Discrete Random Variables
You should feel comfortable about distributions of data that take on discrete values
You should have a good idea of what the PMF and CDF of discrete variables mean
Now: Same idea for variables that can take on any value
Many variables we care about as social scientists are (approximately) continuous
Income, Time, Tax Rates, Vote Shares
Sample means of variables are also continuous
For a discrete RV, \(P(X=x) > 0\) for all values in the support.
Does not hold for a continuous RV. Let’s see why:
Suppose \(P(X = x) = \epsilon\), \(x \in (0,1)\). let \(\epsilon\) be arbitrarily small.
How many real numbers are there between 0 and 1?
If each has probability \(\epsilon\) , \(P(X \in (0,1)) = \infty\)
Let X be distributed continuous uniform from 0 to 10.
What is P(X = 3)?
What about P(X = 0.194345111223)?
Generically, for a continuous random variable, P(X = x) = 0.
This does not mean X = 3 cannot happen. It is in the support (defined as 0 to 10).
So…the PMF is pretty useless now. All values for X have zero mass!
Definition: A random variable X is continuous if its CDF \(F(X \leq x)\) is a continuous function.
When we analyze discrete distributions, the PMF tells us the point probability of P(X = x). We can easily solve the probability that our random variable takes on any value.
We cannot do this for continuous random variables! But…we can use calculus to find the probability that X lies in an interval of the CDF.
What calculus operator would be useful here?
The probability density function of a continuous random variable X is the function \(f_{x}(x)\) that satisfies
\[ F_{x}(x) = \int_{-\infty}^{x} f_{x}(x)dt \]
By the fundamental theorem of calculus, this is the derivative of the c.d.f. So, all we are doing is replacing \(\Sigma\) with \(\int\)
So, \(P(a < X <b ) = P(X \leq b) - P(x \leq a) = \int_{a}^{b}f_{x}(x)dx\)
Note - continuity means \(P(a < X < b) = P(a \leq x \leq b)\)
The area under the curve of a region is equal to the probability of X falling in that region
All valid PDFs are
Nonnegative: \(f_{x}(x) \geq 0\)
Integrates (rather than sums) to 1: \(\int_{-\infty}^{\infty} f_{x}(x)dx = 1\)
Unlike with a PMF, \(f_{x}(x)\) can be greater than 1!
Let X be a random variable with a PDF \(f_{x}(x) = 1\) if X is in the interval (0,1) and f(X) = 0 otherwise.
Graphically:
Any continuous random variable X where the probability of X is the same over the entirety of the support is distributed Uniform. Can we work out the PDF?
if X is uniform on (a,b), the pdf is
\[ f(x) = \begin{cases} \frac{1}{b - a} & \text{for } x \in [a, b] \\ 0 & \text{otherwise} \end{cases} \]
Relatedly, if (c,d) is a subinterval of (a,b), then \(P(x) \in (c,d)\) is \(\frac{d-c}{b-a}\)
dunif
computes the density f(x) of x where \(f(x)= \frac{1}{b-a}\), for a<x<b.
x: the value of x in f(x)
min: the lower bound of the interval (a). Default is 0.
max: the upper bound of the interval (b). Default is 1.
punif
computes the cdf \(F(x)=P(X≤x)\) of X.
q: the value of x in F(x)
min: the lower bound of the interval (a). Default is 0.
max: the upper bound of the interval (b). Default is 1.
A very useful thing you can do in R is simulate data with a certain distribution.
runif
draws random numbers from a uniform distribution
n: the sample size we want
min: the lower bound of the interval
max: the upper bound of the interval
In R (or any statistical program), we often want our results to be reproducible
But….if we run runif
over and over again, we get different draws every time.
The solution is to set a seed, using set.seed()
## we set a seed for reproducibility (it will generate the same numbers each time)
set.seed(29631)
random_numbers <-runif(20, -10, 10)
random_numbers
[1] 8.5734087 5.3892572 -7.5406310 -8.4881313 4.2082382 -9.6233854
[7] 1.8259260 8.1478078 0.6631944 5.7339422 8.9260468 -6.3962945
[13] 9.9210618 6.6941542 -0.5974971 3.5098407 -8.7295550 5.4518538
[19] -5.4562228 3.5680572
For any continuous random variable X, the expectation is
\[ E[X] = \int_{-\infty}^{\infty} xf_{x}(x)dx \]
What does this mean? How does it relate to the discrete version?
From the definition of expectation
\[ E[X] = \int_{a}^{b} xf_{x}(x)dx = \int_{a}^{b} x\frac{1}{b-a} dx \]
solving the integral and evaluating for the interval (a,b) gives us
\[ \frac{x^{2}}{2(b-a)} \Big|_{a}^{b} = \frac{b^{2} - a^{2}}{2(b-a)} = \frac{(b+a)(b-a)}{2(b-a)} = \frac{a+b}{2} \]
We already know the definition of Variance for any r.v. X.
\[ Var[X] = E[(X - E[X])^{2}] \]
Again, analogous to the discrete case
\[ Var[X] = \int_{-\infty}^{\infty} (x - E[X])^{2}f_{x}(x)dx \]
All the properties of expectation and variance (like linearity) hold in the continuous case. Importantly - Var[X] is still equal to \(E[X^{2}] - E[X]^{2}\)
We know that the variance of any RV X is \(E[X^{2}] - E[X]^{2}\). We can easily get \(E[X]^{2}\), but what about \(E[X^{2}]\)?
Calculating \(E[X^{2}]\) directly is possible but quite difficult. Fortunately, there is an easier way!
The Law of the Unconscious Statistician (LOTUS) says that \(E[g(x)]\) is equal to \(g(E[x])\) , which implies
\[ E[g(x)] = \int_{-\infty}^{\infty} g(x) f_{x}(x)dx \]
LOTUS means we can sub in \(X^{2}\) for X, and then take the expectation E(X) using the pdf of X
\[ E[X^{2}] = \int_{-a}^{b} x^{2}f_{x}(x)dx \]
We can evaluate the definite integral
\[ E[X^{2}] = \frac{1}{b-a}\frac{x^{3}}{3} \Big|_{a}^{b} = \frac{b^{3} - a^{3}}{3(b-a)} \]
So, finally….
\[ Var[X] = E[X^{2}] - E[X]^{2} = \frac{b^{3} - a^{3}}{3(b-a)} - \bigg(\frac{a+b}{2}\bigg)^{2} \]
After an annoying amount of algebra (feel free to do this at home, use the difference in cubes formula!), this simplifies to:
\[ Var[X] = \frac{(b-a)^{2}}{12} \]
A continuous random variable Z follows a standard normal distribution with E[Z] = 0 and \(Var[Z] = 1\)if it’s pdf \(\psi\) is:
\[ \psi(z) = \frac{1}{\sqrt{2\pi}}e^{-z^{2}/2}, \text{for } \infty < z < \infty \\ \text{written as } Z \sim N(0,1) \]
The CDF has no closed-form solution, but is written by convention as
\[ \Phi(z) = \int_{\infty}^{\infty}\frac{1}{\sqrt{2\pi}}e^{t^{2}/{2}}dt \]
If \(Z \sim N(0,1)\) then
\[ X = \mu + \sigma Z \]
is also distributed normal with mean \(\mu\) and variance \(\sigma^{2}\) .
\[ f_{x}(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x - \mu)^{2}}{2\sigma^{2}}} \]
Importantly, we can get back to the standard normal through standardization:
\(\frac{x - \mu}{\sigma} \sim N(0,1)\)
dnrom
, pnorm
and rnorm
do the same things as their uniform counterparts
Let’s try and plot a couple of normal distributions.
Generate 20 random numbers from a normal(0,1) and plot the distribution (use a histogram with bin width 0.5).
Generate 200 random numbers from a normal(0,1) and plot the distribution.
Finally, generate 2000 random numbers and plot.
An inverse function essentially “reverses” the effect of a given function. If a function f maps an element x to f(x), then its inverse \(f^{-1} (x)\) will map f(x) back to x.
\(f(f^{-1}(x)) = x \quad \text{and} \quad f^{-1}(f(x)) = x\)
Intuition: What is the inverse of \(f(x) = x^{2}\) for \(x >0\)?
The inverse of the CDF, \(F^{-1}\) is called the quantile function
\(F^{-1}(\alpha)\) is the value of x such that \(P(X \leq x) = \alpha\)
The quantile function takes probabilities as arguments
\(F^{-1}(0.5)\) is the median, \(F^{-1}(0.9)\) is the upper decile
Soon: One way to obtain our confidence intervals is from the quantile function. \(F^{-1}(0.975)\) is the upper bound of a 95% confidence interval
Let U ~ Unif(0,1) and let F be a continuous, increasing, CDF. Let \(X = F^{-1}(U)\). Then, X is an r.v. with CDF F.
Proof: for all real x:
\(P(X \leq x) = P(F^{-1}(U) \leq x) = P(U \leq F(x)) = F(x)\)
Imagine that we had a random number generator that gives us numbers between 0 and 1. So, the output is uniform(0,1). Imagine we spin and get U = 0.975
Suppose we wanted instead random numbers that follow a standard normal distribution. We can get to a normal distribution by plugging 0.975into \(F^{-1}(U)\), which gives us a corresponding X for the standard normal distribution.
For the standard normal distribution, this give X a value such that \(P(Z \leq X) = 0.84\) . In this case, X = 1.96.
If we were to repeat this process many times, we would generate numbers following a standard normal distribution.
Let X be an r.v. with CDF F. Then F(X) ~ Unif(0,1).
Proof:
Let X have cdf F and find the CDF of Y = F(X). Since Y takes values in (0,1), \(P(Y \leq y)\) is 0 for \(y \leq 0\) and 1 for \(y \geq 1\). For \(y \in (0,1)\)
\(P(Y \leq y) = P(F(X) \leq y) = P(X \leq F^{-1}(y)) = F(F^{-1}(y)) = y\)
Now imagine instead that we have random numbers from a standard normal distribution. We know the CDF, \(F_{x}(X)\), of the normal distribution, which gives the probability that X is less than or equal to some value.
To transform X into a uniform random variable, we compute \(U = F_{x}(X)\)
Suppose X = 1. Let \(U = F_{x}(X)\). To do so, we just find \(P(Z < 1)\), which happens to be 0.84.