R Functions and Multivariate Distributions

Introduction to Functions in R

  • Functions are reusable blocks of code designed to perform a specific task.
    • Sometimes you have repeated tasks, better not to just keep writing the same code over and over
  • In R, functions are defined using the function keyword, and they can take inputs (known as arguments) and return an output.
  • You’ve already used built in R functions, but you can also write your own

Syntax of a Function

# Basic function structure
function_name <- function(argument1, argument2, ...) {
  # Code to execute
  output = (arugment1 - mean(argument1, na.rm = T)/sd(argument1, na.rm = T))
  return(output)
}
  • Can take any number of arguments (named or unnamed)
  • Don’t have to include return() statement at the end, but good practice and other languages do require this (ie, Python)

Example Function: Calculate the Average of a Vector

Let’s create a simple function that calculates the average of a numeric vector.

# Define the function to calculate average
calculate_average <- function(vec) {
  sum_value <- sum(vec)  # Calculate the sum of elements in the vector
  length_value <- length(vec)  # Calculate the number of elements in the vector
  avg <- sum_value / length_value  # Calculate the average
  return(avg)  # Return the calculated average
}

# Example usage
my_vector <- rnorm(20, 0, 1)
calculate_average(my_vector)  # Output: 
[1] -0.2178085

Your Turn

  • Generate a vector of 20 values using rnorm (you can use whatever you want for mean and sd)
  • Write a function to calculate the variance
  • What happens if you increase n to 2 million?
    • Note - literally impossible to make this calculation in Excel!

Where are we?

  • We just wrapped up distributions of a single random variable

    • You should feel comfortable with expectation and variance, as well the various functions that go with the RVs
  • Before we get to hypothesis testing, we need to cover distributions of multiple random variables

    • This will allow us to learn how the variables are jointly distributed, and how one variable influences another variable

Joint Distributions

Adding lines of fit

Discrete Joint Distributions

The joint probability mass function (p.m.f) of a pair of discrete random variables (X,Y) describes the probability of any pair of values:

\[ p_{x,y}(x,y) = P(X =x, Y = y) \]

It may not be obvious, but this allows us to start to answer interesting questions!

Probability of the sample space

So, now the probability for the sample space looks like:

\[ \sum_{x}\sum_{y}P(X = X, Y = y) = 1 \]

Easy to get P(X,Y) from P(X) and P(Y) if X and Y are independent - but usually they aren’t!

We are going to be interested in how variables co-vary

Example - Support for the Dream Act

Gender

Support Dream Act

(Y = 1)

Oppose Dream Act

(Y = 0)

Male

(X = 1)

0.24 0.24

Female

(X = 0)

0.34 0.18

What’s P(Y = 1,X = 1)?

(Discrete) Marginal Distributions

What about the distribution for just one of the variables?

The Marginal PMF of Y is:

\[ P(Y = y) = \sum_{x} P(X=x, Y=y) \]

We are summing the probability that Y = y over all possible values of X = x

Terminology: Often referred to as marginalizing out X. Not the same as marginal change.

(Discrete) Marginal Distributions

Gender Support Dream Act (Y = 1) Oppose Dream Act (Y = 0)
Male (X = 1) 0.24 0.24
Female (X = 0) 0.34 0.18
Marginal Distribution of Support 0.58 0.42

What’s P(Y = 1)? And P(X = 1)?

Conditional PMF

The conditional probability mass function of Y|X is

\[ P(Y = y | X= x) = \frac{P(X = x, Y = Y)}{P(X = x)}, \text{for all x st P(X = x) > 0} \]

Has all the usual properties : P(Y = y | X=x) \(\geq\) 0 and \(\Sigma_{y} P(Y = y| X = x)\) =1

and

\[ E[Y | X = x] = \sum_{y} y P(Y =y |X = x) \]

Conditional Distribution

Gender Support Dream Act (Y = 1) Oppose Dream Act (Y = 0)
Male (X = 1) 0.24/.48 = 0.5 0.24/.48 = 0.5
Female (X = 0) 0.34/0.52 = 0.65 0.18/.52 = 0.35

Plotting the Conditionals

Bayes Rule and LoTP

Bayes Rule for random variables

\[ P(Y = y| X = x) = \frac{P(X=x|Y = y)P(Y = y)}{P(X=x)} \]

Law of total probability for random variables

\[ P(X = x) = \sum_{y} P(X = x|Y = y)P(Y=y) \]

Joint CDFs

For two rvs, X and Y, the joint cdf \(F_{x,y}(x,y)\) can be written

\[ F_{x,y}(x,y) = P(X \leq x, Y \leq y) \]

For a two discrete RVs, that gives us

\[ F_{x,y}(x,y) = \sum_{i<x}\sum_{j<y} P(X = i, Y = j) \]

The Multinomial Distribution

The multinomial distribution is a generalization of the binomial distribution.

The binomial distribution describes the number of successes in a fixed number of independent Bernoulli trials (with two possible outcomes), the multinomial distribution describes the number of occurrences of each of multiple outcomes in a fixed number of independent trials.

Technically it’s an extension of the closely related categorical distribution to n trials.

Definition

The multinomial distribution applies when:

  • There are n independent trials.
  • Each trial results in one of k possible outcomes.
  • The probability of each outcome \(O_i\) on a single trial is \(p_i\), where \(\sum_{i=1}^k p_i = 1\)

If we let \(X_i\) denote the number of times outcome \(O_i\) occurs in n trials, then the vector \(X = (X_1, X_2, \ldots X_k)\) follows a multinomial distribution.

Probability Mass Function (PMF)

The probability of observing a specific outcome vector \(x = (x_1, x_2, \ldots, x_k)\) is:

\(P(\mathbf{X} = \mathbf{x}) = \frac{n!}{x_1! x_2! \cdots x_k!} p_1^{x_1} p_2^{x_2} \cdots p_k^{x_k}\)

where:

  • n is the total number of trials.
  • \(x_i\)is the number of times outcome \(O_i\) occurs.
  • \(p_i\) is the probability of outcome \(O_i\).

Properties

  1. Expectation: The expected value of each \(X_i\) is \(E[X_i] = n \cdot p_i\)
  2. Variance: The variance of each \(X_i\) is \(\text{Var}(X_i) = n p_i (1 - p_i) )\)

Example

Suppose you are rolling a fair six-sided die 10 times.

If the die is fair, then \(p_1 = p_2 \ldots = p_6 = \frac{1}{6}\). The number of times each side appears follows a multinomial distribution with parameters \(n = 10\) and \(p_1 = p_2 = \cdots = p_6 = \frac{1}{6}\)

Text as Data!

We need to make raw text usable

The Multinomial Language Model

  • A probabilistic model of language that maps naturally to a bag-of-words approach

  • Improbable/Inaccurate model of language, assumes that words are drawn from a Bag Of Words following a multinomial distribution

    • But, very useful computationally! Can run models on a decent laptop.
  • Draws are independent: Any given word depends only on the specified probabilities, not on previously used words

Modelling Topics

  • Use the frequency of co-occurence of words to uncover latent topics in the data.

    • If immigrant and asylum and refugee and border keep co-occuring together, they probably have some sort of relationship

    • Use an extension of the multinomial distribution to model both what words make up topics, and how frequently topics occur

  • Topic models extensively used in social science research, can be run (in a few hours) on a decent laptop

Application of Topic Models

  • Do political candidates who are from working class backgrounds talk about different topics that other candidates?

  • Collected 15,000 campaign advertisements in the UK, covering 1992-2010.

    • Converted to raw text files, then to a bag-of-words format
  • Each topic is assumed to have it’s own (roughly) multinomial distribution over words, where words related to the topic have higher probabilties

Example Document

Results

The Worst (Best?) Song Ever

Representing a Vocabulary

Encode each token [~= each word] with a vector of the length of the vocabulary, j:

\[ \begin{gather} \text{Everbody} = (1,0,0,0) \\ \text{Heard} = (0,1,0,0) \\ \text{About} = (0,0,1,0) \\ \text{Bird} = (0,0,0,1) \\ \end{gather} \]

Nice for representing texts mathematically, and is very computationally efficient.

The Categorical Distribution

If a document is one token long, we can think of it as a draw from a categorical distribution

\[ W_{i} \sim Categorical(\mu) \]

Formally: Probability Mass Function

Consider our previous vocabulary. Bird is more frequent than Everybody or Heard, so:

\[ \mu = (.125, .125, .125, .625) \]

Then we can write the probability mass function for a single draw (categorical distribution) as:

\[ p(W_{i}|\mu) = \prod_{j=1}^{j} \mu_{j}^{W_ij} \]

where j is the vocabulary

Multinomial Distribution

Consider the probability of seeing a document of more than one word.

We need to add a parameter M which is the length of the document. So:

\[ p(W_{i}|\mu) = \frac{M!}{\prod_{j=i}^{j}W_{ij}!} \prod_{j=1}^{j}\mu_{j}^{W_{ij}} \]

What is the probability of drawing (Everybody, Heard, Bird, Bird, Bird)?

\[ \begin{gather} \text{Everbody} = (1,0,0,0) \\ \text{Heard} = (0,1,0,0) \\ \text{About} = (0,0,1,0) \\ \text{Bird} = (0,0,0,1) \\ \end{gather} \]

\[ \mu = (.125, .125, .125, .625) \]

\[ p(W_{i}|\mu) = \frac{M!}{\prod_{j=i}^{j}W_{ij}!} \prod_{j=1}^{j}\mu_{j}^{W_{ij}} \]

Features of the Multinomial Distribution

\[ E[W_{ij}] = M_{i}\mu_{j} \]

\[ Var(W_{ij}) = M_{i}u_{j}(1 - \mu_{j}) \]

\[ MLE: \widehat{\mu}_{j} = \frac{W_{ij}}{M_{i}} \]

Multinomial: What is it good for?

  • One application: Determining authorship of uncertain documents.

    • Consider the three potential authors of unattributed Federalist Papers (which I am led to believe are important?).

    • Each author has their own distribution over the vocabulary, \(\widehat{\mu_{ij}}\)

      • To be more specific, we observe word frequencies from the federalist papers for which we know authorship
    • We also have the words counts for three words in the disputed federalist paper

Words Counts

Author By Man Upon
Hamilton 859 102 374
Jay 82 0 1
Madison 474 17 7
Disputed 15 2 0

\[ W_{H} \sim Multinomial(1335, \mu_{H}) \]

\[ W_{J} \sim Multinomial(83, \mu_{j}) \]

\[ W_{M} \sim Multinomial(498, \mu_{m}) \]

Regularization

  • There was something weird about the prediction for John Jay

    • Why did the model predict \(p(W_{Disputed} |\mu_{j}) = 0\)?
  • We need a way to deal with the model overlearning/overfitting the data.

  • Regularization: A non data constraint or addition to a model to push our estimate towards a certain value.

    • Common estimators with regularization: Ridge and Lasso, penalize models with too many parameters

Laplace Smoothing -> Dirichelt Distribution

  • Laplace Smoothing Intuition: Add a small amount \(\alpha\) to the count of each word type so the model never guesses 0.

  • Data Generating Process for Dirichelt Distribution

    • Sample word probabilities for each author \(\mu_{k} \sim Dirichlet(\alpha)\) \(k \in 1,2,3\)

    • Stack author specific word rates in column of a matrix \(\mu = [\mu_{1}, \mu_{2}, \mu_{3}]\)

    • Sample text using the authors word probabilities \(W_{i}|\mu,\pi_{i} \sim Multinomial(M_{i}, \mu \pi_{i})\)

Summing up the Multinomial (and Dirichelt) representations

  • Assumption that word use comes from an underlying, possibly author specific, distribution is a useful fiction.

    • Possible, but complicated, to build conditional distributions that do take context in to account
  • Going forward: Multinomial and Dirichlet representations are foundational for Topic Models. Both are also very important in machine learning.

Continuous Random Variables

One continuous r.v.: prob. of being in an interval on the real line.

Continuous Random Variables

Two continuous rvs: probability of being in some subset of the 2-dimensional plane

Continuous joint PDF

If two continuous random variables X and Y have a CDF \(F_{x.y}\), their joint pd.f. is the derivative of \(F_{x,y}\) with respect to X and Y

\[ f_{x,y}(x,y) = \frac{\partial^{2} }{\partial x \partial y} F_{x,y}(x,y) \]

To get the probability of a region, Integrate over both dimensions

\[ P(a < X < b)\ = \int\int_{(x,y \in A)} f_{x,y}(x,y)dxdy \]

Joint Densities

For the joint distribution of two variables, we can visualize the PDF in 3D, where the probability is now the volume above a specific region. Generalizes for n variables.

Continuous Marginal Distributions

We can get the marginal PDF of one of the variables by integrating over the other

\[ f{y}(Y) = \int_{-\infty}^{\infty} f_{x,y}(x,y) dx \]

Works both ways

\[ f_{x}(X) = \int_{-\infty}^{\infty} f_{x,y}(x,y)dy \]

By integrating over the other RV, this flattens the joint density into one dimension.

Continuous Conditional Distributions

The conditional PDF of a conditional random variable is

\[ f_{y|x}(y|x) = \frac{f_{x,y}(x,y)}{f_{x}(x)} \]

So,

\[ P(a < Y < b| X = x) = \int_{a}^{b}f_{y|x}(y|x)dy \]

This also implies that:

\[ f_{x,y}(x,y) = f_{y|x}(y|x)f_{x}(x) \]

Conditional Distributions are slices

Note - to actually get the conditional PDF we would need to divide by f(x) at the x value of the slice.

Summarizing Joint Distributions

  • We summarize univariate distributions with expectation and variance.

  • With multivariate distributions, these are still useful but we also care about how the variables depend on each other

    • Specifically we will care a lot about covariance and correlation

Joint Expectations

For discrete X and Y

\[ E[X,Y] = \sum_{x}\sum_{y} xy f_{x,y}(x,y) \]

For continuous X and Y

\[ E[X,Y] = \int_{x}\int_{y} xyf_{x,y}(x,y)dxdy \]

Marginal Expectation (discrete)

\[ E[Y] = \sum_{x}\sum_{y} yf_{x,y}(x,y) \]

Correlation and Covariance

Importance

  • Independence assumptions are eveywhere

    • Sampling - each respondents probability of response is independent

    • In an RCT, treatment assignment is independent of potential confounders

  • Lack of independence can also be key

    • Fundamentally, hypothesis testing as about showing associations or dependencies between variables

    • In observational analyses, treatment is extremely unlikely to be independent, necessitating controls

Covariance

How do we measure an association between two RVs?

Covariance

\[ \text{Cov}[X,Y] = E[(X - E[X])E[Y - E[Y]) \]

Properties

Cov[X,Y] = E[XY] - E[X]E[Y]

If X and Y are independent, Cov(X,Y) = 0

Graphical Intuition

Covariance Intuition

Stronger Covariance

Negative Covariance

Properties of Covariance

  1. Cov(X,X) = Var(X)
  2. Cov(X,Y) = Cov(Y,X)
  3. Cov(X,c) = 0
  4. Cov(aX, Y) = aCov(X,Y)
  5. Cov(X +Y, Z) = Cov(X,Z) + Cov(Y,Z)

Cov(X,Y) = 0 does not imply independence. Cov measures linear dependence, can miss non-linear dependence.

Relation to Variance

  • Before we saw that Var(X + Y) = Var[X] + Var[Y] if X and Y are independent

    • And Var[X-Y] also = Var[X] + Var[Y]
  • But usually X and Y are not independent

    • Var[X + Y] = Var[X] + Var[Y] + 2Cov[X,Y]

Correlation

Correlation is a scale free measure of linear dependence

\[ \rho(X,Y) = \frac{Cov(X,Y)}{\sqrt{var(X)var(y)}} \\= Cov(\frac{X - E[X]}{SD[X]}, \frac{Y - E[Y]}{SD(Y)} \]

Basically, we normalize the covariance such that it is bounded by \(-1 \leq \rho \leq 1\).

\(\rho\) = 1 if and only if there is a perfect determinstic relationship between X and Y.