function
keyword, and they can take inputs (known as arguments) and return an output.Let’s create a simple function that calculates the average of a numeric vector.
# Define the function to calculate average
calculate_average <- function(vec) {
sum_value <- sum(vec) # Calculate the sum of elements in the vector
length_value <- length(vec) # Calculate the number of elements in the vector
avg <- sum_value / length_value # Calculate the average
return(avg) # Return the calculated average
}
# Example usage
my_vector <- rnorm(20, 0, 1)
calculate_average(my_vector) # Output:
[1] -0.2178085
We just wrapped up distributions of a single random variable
Before we get to hypothesis testing, we need to cover distributions of multiple random variables
The joint probability mass function (p.m.f) of a pair of discrete random variables (X,Y) describes the probability of any pair of values:
\[ p_{x,y}(x,y) = P(X =x, Y = y) \]
It may not be obvious, but this allows us to start to answer interesting questions!
So, now the probability for the sample space looks like:
\[ \sum_{x}\sum_{y}P(X = X, Y = y) = 1 \]
Easy to get P(X,Y) from P(X) and P(Y) if X and Y are independent - but usually they aren’t!
We are going to be interested in how variables co-vary
Gender | Support Dream Act (Y = 1) |
Oppose Dream Act (Y = 0) |
---|---|---|
Male (X = 1) |
0.24 | 0.24 |
Female (X = 0) |
0.34 | 0.18 |
What’s P(Y = 1,X = 1)?
What about the distribution for just one of the variables?
The Marginal PMF of Y is:
\[ P(Y = y) = \sum_{x} P(X=x, Y=y) \]
We are summing the probability that Y = y over all possible values of X = x
Terminology: Often referred to as marginalizing out X. Not the same as marginal change.
Gender | Support Dream Act (Y = 1) | Oppose Dream Act (Y = 0) |
---|---|---|
Male (X = 1) | 0.24 | 0.24 |
Female (X = 0) | 0.34 | 0.18 |
Marginal Distribution of Support | 0.58 | 0.42 |
What’s P(Y = 1)? And P(X = 1)?
The conditional probability mass function of Y|X is
\[ P(Y = y | X= x) = \frac{P(X = x, Y = Y)}{P(X = x)}, \text{for all x st P(X = x) > 0} \]
Has all the usual properties : P(Y = y | X=x) \(\geq\) 0 and \(\Sigma_{y} P(Y = y| X = x)\) =1
and
\[ E[Y | X = x] = \sum_{y} y P(Y =y |X = x) \]
Gender | Support Dream Act (Y = 1) | Oppose Dream Act (Y = 0) |
---|---|---|
Male (X = 1) | 0.24/.48 = 0.5 | 0.24/.48 = 0.5 |
Female (X = 0) | 0.34/0.52 = 0.65 | 0.18/.52 = 0.35 |
Bayes Rule for random variables
\[ P(Y = y| X = x) = \frac{P(X=x|Y = y)P(Y = y)}{P(X=x)} \]
Law of total probability for random variables
\[ P(X = x) = \sum_{y} P(X = x|Y = y)P(Y=y) \]
For two rvs, X and Y, the joint cdf \(F_{x,y}(x,y)\) can be written
\[ F_{x,y}(x,y) = P(X \leq x, Y \leq y) \]
For a two discrete RVs, that gives us
\[ F_{x,y}(x,y) = \sum_{i<x}\sum_{j<y} P(X = i, Y = j) \]
The multinomial distribution is a generalization of the binomial distribution.
The binomial distribution describes the number of successes in a fixed number of independent Bernoulli trials (with two possible outcomes), the multinomial distribution describes the number of occurrences of each of multiple outcomes in a fixed number of independent trials.
Technically it’s an extension of the closely related categorical distribution to n trials.
The multinomial distribution applies when:
If we let \(X_i\) denote the number of times outcome \(O_i\) occurs in n trials, then the vector \(X = (X_1, X_2, \ldots X_k)\) follows a multinomial distribution.
The probability of observing a specific outcome vector \(x = (x_1, x_2, \ldots, x_k)\) is:
\(P(\mathbf{X} = \mathbf{x}) = \frac{n!}{x_1! x_2! \cdots x_k!} p_1^{x_1} p_2^{x_2} \cdots p_k^{x_k}\)
where:
Suppose you are rolling a fair six-sided die 10 times.
If the die is fair, then \(p_1 = p_2 \ldots = p_6 = \frac{1}{6}\). The number of times each side appears follows a multinomial distribution with parameters \(n = 10\) and \(p_1 = p_2 = \cdots = p_6 = \frac{1}{6}\)
A probabilistic model of language that maps naturally to a bag-of-words approach
Improbable/Inaccurate model of language, assumes that words are drawn from a Bag Of Words following a multinomial distribution
Draws are independent: Any given word depends only on the specified probabilities, not on previously used words
Use the frequency of co-occurence of words to uncover latent topics in the data.
If immigrant and asylum and refugee and border keep co-occuring together, they probably have some sort of relationship
Use an extension of the multinomial distribution to model both what words make up topics, and how frequently topics occur
Topic models extensively used in social science research, can be run (in a few hours) on a decent laptop
Do political candidates who are from working class backgrounds talk about different topics that other candidates?
Collected 15,000 campaign advertisements in the UK, covering 1992-2010.
Each topic is assumed to have it’s own (roughly) multinomial distribution over words, where words related to the topic have higher probabilties
Encode each token [~= each word] with a vector of the length of the vocabulary, j:
\[ \begin{gather} \text{Everbody} = (1,0,0,0) \\ \text{Heard} = (0,1,0,0) \\ \text{About} = (0,0,1,0) \\ \text{Bird} = (0,0,0,1) \\ \end{gather} \]
Nice for representing texts mathematically, and is very computationally efficient.
If a document is one token long, we can think of it as a draw from a categorical distribution
\[ W_{i} \sim Categorical(\mu) \]
Consider our previous vocabulary. Bird is more frequent than Everybody or Heard, so:
\[ \mu = (.125, .125, .125, .625) \]
Then we can write the probability mass function for a single draw (categorical distribution) as:
\[ p(W_{i}|\mu) = \prod_{j=1}^{j} \mu_{j}^{W_ij} \]
where j is the vocabulary
Consider the probability of seeing a document of more than one word.
We need to add a parameter M which is the length of the document. So:
\[ p(W_{i}|\mu) = \frac{M!}{\prod_{j=i}^{j}W_{ij}!} \prod_{j=1}^{j}\mu_{j}^{W_{ij}} \]
\[ \begin{gather} \text{Everbody} = (1,0,0,0) \\ \text{Heard} = (0,1,0,0) \\ \text{About} = (0,0,1,0) \\ \text{Bird} = (0,0,0,1) \\ \end{gather} \]
\[ \mu = (.125, .125, .125, .625) \]
\[ p(W_{i}|\mu) = \frac{M!}{\prod_{j=i}^{j}W_{ij}!} \prod_{j=1}^{j}\mu_{j}^{W_{ij}} \]
\[ E[W_{ij}] = M_{i}\mu_{j} \]
\[ Var(W_{ij}) = M_{i}u_{j}(1 - \mu_{j}) \]
\[ MLE: \widehat{\mu}_{j} = \frac{W_{ij}}{M_{i}} \]
One application: Determining authorship of uncertain documents.
Consider the three potential authors of unattributed Federalist Papers (which I am led to believe are important?).
Each author has their own distribution over the vocabulary, \(\widehat{\mu_{ij}}\)
We also have the words counts for three words in the disputed federalist paper
Author | By | Man | Upon |
---|---|---|---|
Hamilton | 859 | 102 | 374 |
Jay | 82 | 0 | 1 |
Madison | 474 | 17 | 7 |
Disputed | 15 | 2 | 0 |
\[ W_{H} \sim Multinomial(1335, \mu_{H}) \]
\[ W_{J} \sim Multinomial(83, \mu_{j}) \]
\[ W_{M} \sim Multinomial(498, \mu_{m}) \]
There was something weird about the prediction for John Jay
We need a way to deal with the model overlearning/overfitting the data.
Regularization: A non data constraint or addition to a model to push our estimate towards a certain value.
Laplace Smoothing Intuition: Add a small amount \(\alpha\) to the count of each word type so the model never guesses 0.
Data Generating Process for Dirichelt Distribution
Sample word probabilities for each author \(\mu_{k} \sim Dirichlet(\alpha)\) \(k \in 1,2,3\)
Stack author specific word rates in column of a matrix \(\mu = [\mu_{1}, \mu_{2}, \mu_{3}]\)
Sample text using the authors word probabilities \(W_{i}|\mu,\pi_{i} \sim Multinomial(M_{i}, \mu \pi_{i})\)
Assumption that word use comes from an underlying, possibly author specific, distribution is a useful fiction.
Going forward: Multinomial and Dirichlet representations are foundational for Topic Models. Both are also very important in machine learning.
One continuous r.v.: prob. of being in an interval on the real line.
Two continuous rvs: probability of being in some subset of the 2-dimensional plane
If two continuous random variables X and Y have a CDF \(F_{x.y}\), their joint pd.f. is the derivative of \(F_{x,y}\) with respect to X and Y
\[ f_{x,y}(x,y) = \frac{\partial^{2} }{\partial x \partial y} F_{x,y}(x,y) \]
To get the probability of a region, Integrate over both dimensions
\[ P(a < X < b)\ = \int\int_{(x,y \in A)} f_{x,y}(x,y)dxdy \]
For the joint distribution of two variables, we can visualize the PDF in 3D, where the probability is now the volume above a specific region. Generalizes for n variables.
We can get the marginal PDF of one of the variables by integrating over the other
\[ f{y}(Y) = \int_{-\infty}^{\infty} f_{x,y}(x,y) dx \]
Works both ways
\[ f_{x}(X) = \int_{-\infty}^{\infty} f_{x,y}(x,y)dy \]
By integrating over the other RV, this flattens the joint density into one dimension.
The conditional PDF of a conditional random variable is
\[ f_{y|x}(y|x) = \frac{f_{x,y}(x,y)}{f_{x}(x)} \]
So,
\[ P(a < Y < b| X = x) = \int_{a}^{b}f_{y|x}(y|x)dy \]
This also implies that:
\[ f_{x,y}(x,y) = f_{y|x}(y|x)f_{x}(x) \]
Note - to actually get the conditional PDF we would need to divide by f(x) at the x value of the slice.
We summarize univariate distributions with expectation and variance.
With multivariate distributions, these are still useful but we also care about how the variables depend on each other
For discrete X and Y
\[ E[X,Y] = \sum_{x}\sum_{y} xy f_{x,y}(x,y) \]
For continuous X and Y
\[ E[X,Y] = \int_{x}\int_{y} xyf_{x,y}(x,y)dxdy \]
Marginal Expectation (discrete)
\[ E[Y] = \sum_{x}\sum_{y} yf_{x,y}(x,y) \]
Independence assumptions are eveywhere
Sampling - each respondents probability of response is independent
In an RCT, treatment assignment is independent of potential confounders
Lack of independence can also be key
Fundamentally, hypothesis testing as about showing associations or dependencies between variables
In observational analyses, treatment is extremely unlikely to be independent, necessitating controls
How do we measure an association between two RVs?
Covariance
\[ \text{Cov}[X,Y] = E[(X - E[X])E[Y - E[Y]) \]
Properties
Cov[X,Y] = E[XY] - E[X]E[Y]
If X and Y are independent, Cov(X,Y) = 0
Cov(X,Y) = 0 does not imply independence. Cov measures linear dependence, can miss non-linear dependence.
Before we saw that Var(X + Y) = Var[X] + Var[Y] if X and Y are independent
But usually X and Y are not independent
Correlation is a scale free measure of linear dependence
\[ \rho(X,Y) = \frac{Cov(X,Y)}{\sqrt{var(X)var(y)}} \\= Cov(\frac{X - E[X]}{SD[X]}, \frac{Y - E[Y]}{SD(Y)} \]
Basically, we normalize the covariance such that it is bounded by \(-1 \leq \rho \leq 1\).
\(\rho\) = 1 if and only if there is a perfect determinstic relationship between X and Y.