Types of Experiments

Will Horne

Field Experiments

One common type of experiment is a field experiment, where the researcher leaves their office and actually goes out “in the field” to conduct research
Field experiments (or Randomly Controlled Trials) involve actual policy interventions with on real individuals
- Ethics can be a bit murky sometimes. Should try and avoid harmful interventions, consider how any intervention might spillover!
We will give a close look at an fieldexperiment by Nobel Prize winning Economist Esther Duflo and co-authors.

Incentives Work: Getting Teachers to Come to School

Background: Teacher absenteeism is a large problem in rural India, and educational outcomes are poor.
- Seva Mandir runs single-teacher schools in rural India
- Tries to minimzie absenteeism by “berating” absent teachers, yet absenteeism remains high
Question: Can policy makers design an intervention to incentivize higher rates of teacher attendance
- Secondary Question: Does higher teacher attendance improve student outcomes?

Intervention

Seva Mandir gave 57 randomly selected teachers cameras, along with instructions to have students take date stamped pictures with the teacher at the start and end of the day.
Teachers are paid by a non-linear function of the days of school they attend. The receive Rs. 500 if they attend fewer then 10 days, and then Rs. 50 additional if they attend any additional days.
- What is the logic here?
Is this ethical? What are the considerations?

Previewing Results

Teacher attendance increased significantly
Test scores improved in the monitor schools by about .2 standard deviations
- This is a modest, but detectable, effect
After 2.5 years, children exposed to the program were 62 percent more likely to transfer to formal primary schools

Design

Duflo and co-authors work with a non-profit, Seva Mandir, which is a very common way to do this kind of work.
- Sometimes, it’s also a way to find funding if you can provide free analysis for an NGO
- But - don’t outsource the work to the NGO. See recent scandal with GDRI in Bangladesh
120 total schools, randomized. Some attrition; 3 schools in the treatment group and 4 in the control closed.

Detailed Treatment Info

Initial wage of teachers; Rs. 1,000 ($160 ppp) for a minimum of 20 days of work
- Treatment: Rs. 50 bonus ($8 pp) for each day worked over 20, Rs. 50 fine for each day under 20, with fines capped at Rs. 500.
Control group gets initial wage
Teachers reminded that they could be dismissed for poor attendance
- But, no teacher fired during the span of evaluation (Why?)

Data Collection (1)

Study done in collaboration with J-PAL, who is one of the main funders of this type of work. They handled data collection
Data was collected on teacher attendance through two unannounced visits per month
- How many students were present? Was anything written on the blackboard? Was the teacher talking to the children?
  - Why are these good measures?
Seva Mandir provided the camera and payment data to the researchers

Data Collection (2)

Three exams were administered to all students in the program
A pretest (either oral for those who could not write, or written)
Midtest (all students get oral and written exam)
- Basic math, vocabulary (and some more complex math on written)
- Students who cannot write get 0 on written
Post-test (all students get written and oral)

Baseline Data

Baseline (2)

Similar proportion of students (17% in treatment vs 19% in control) of students took the written exam
Control group scored slightly higher on the oral exam, while the treatment group scored slightly higher on the written exam
- Neither difference close to significant

Evaluation

This is simply the difference in % of schools open when randomly checked. Experiments allow for very simple inference!

Evaluation (2)

Attendance was 79% for teachers in the treatment group against 58% in the control group.
- Difference in means of 0.21, standard error of 0.03, so p-value well under 0.05
Standard errors are clustered by schools, because treatments are assigned at school level.
- Using parametric inference rather than randomization inference.
- Randomization inference could also be appropriate here, but again this should just be specified in pre-registration!

Effect of Incentives

Let’s think through what this is telling us. Black line = teachers already ‘in the money’ and red line = teachers not in the money.

What can we learn from this?

What are the precise treatment effects that are measured?
- I think there are three distinct treatment effects here
Are there any potential barriers to inference from internal validity?
What is the level of external validity?
To where might this generalize? To where should we be more skeptical?

Field Experiments

Benefits:
- Very high levels of external validity
- Can involve real interventions
Drawbacks
- Less control over treatments
- More potential for internal validity issues; reliance on third parties
- Cost (but funding is available, but competitive).

Field Experiments

Phenomenal Resources here: https://www.povertyactionlab.org/research-resources?view=toc and https://www.povertyactionlab.org/page/handbook-field-experiments
Keep an eye out for grants, follow NGOs, Non-profits, funders and government agencies in your area
Work with faculty on these types of projects; if you have an idea, we often have experience applying for grants, etc. May also lend credibility to your applications.

Lab Experiments

Can either be true “lab” experiments or “lab-in-the-field” experiments
Either way, recruit subjects to participate in some sort of experiment
Often, but not always, this involves having the subjects play some sort of competitive or cooperative game
- For more on games, see Fehr on canvas.
- Another great example of cooperation games is “Why Does Ethnic Diversity Undermine Public Goods Provision?” by Habyarimana et al

Example

Will focus on a different type of example to show how you can get creative with a lab experiment. Ismail White and co-authors wanted to test how much social group norms explained Black support for the Democratic Party in the US.

Expectations

We expect that blacks are constrained from following strict self-interest bythe social costs incurred when other blacks question their commitment to or standing within the group. Moreover, we argue that such social pressure can be internalized, creating an individual belief in black solidarity that is also constraining and works to prevent self-interested behavior;

Experiemental Design

White et al run three separate experiments to test these expecations
Experiment 1:
- Sample: 150 students from an HBCU
Control Group: Asked to donate $100 to candidate or split
T1: Asked to donate, receive $1 for every$10 donated to Romney
T2: Asked to donate, same payoffs, name will appear in newspaper as a Romney donor

Deception!

The state that this experiment took place in (Louisiana) had a law against allocating state money to political campaigns, so no money was actually allocated
Students were led to believe these would be real donations with real consequences
Were debriefed immediately after
- Is this OK?

Design Questions

What exactly is the design testing?
Any concerns about internal validity?
Any concerns about external validity?
What do we think of the HBCU setting?

Expectations

Treatment Group 1 will donate significantly less to Obama (than control)
- They also include some expectations based on pre-treatment covariates that we will skip
Treatment Group 2 will donate significantly more to Obama than treatment group 1.
- Also include similar sub-hypotheses to T1

Results

Experiment 2 Design

Subjects: 106 black students at a predominantly white university in the midwest
- Why is varying location useful here?
Each student given $10 in dollar bills to donate, could place in boxes marked Obama or Romney. Told that each dollar would be matched 10 to 1, or they could keep the money instead.
T1: Paired with a black actor (same sex) who pretended to be a participant.
T2: Paired with a white actor (same sex) who pretended to be a participant.

Experiment 2 Results

Experiment 3

Sample: 56 black students at a predominantly white school
Told that for every $10 donated to Romney, would receive $1.
Always a black actor present, in the control the actor gives money to Obama and in treatment gives it to Romney.
What is this testing?

Experiment 3 Results

What did we learn?

What are the precise treatment effects we estimate in this study?
Who is the sample - how might this generalize?
What might other barriers to inference be in these studies?

Summing Up

Lab experiments give us greater control over treatment than experiments in the field
- Can test a wider range of interventions
- But, interventions may have less external validity
Less costly than field experiments, but still need money
- In practice, subjects often university students
- gives rise to the WEIRD (Western, Educated, Industrialized, Rich and Democratic) problem

Survey Experiments

We can also embed experiments in surveys
Obvious limitations; limited to text and image prompts
- Or, if interactive survey, web applications and videos too
Cheapest way to run an experiment
Embeddedness within survey allows you to collect lots of background data!

Types of Survey Experiment

We will look at list experiments and conjoints.
This is just scratching the surface.
- Population Based Survey Experiments by Diana Mutz is a good primer
Any sort of intervention you can think of; but external validity is harder to justify

Mechanics

Can code your own interventions, randomization, etc via Qualtrics and pay for a sample
- Cheapest option, you all have access to Qualtrics through Clemson
- Sample cost varies; more reliable = more money
- If you are serious about doing a survey experiment, see me.
  - Dr Rhodes-Purdy has also done lots of survey experiments (currently on sabbatical)
Can also pay a firm to do all that for you (for $$)

List Experiments

List experiments are useful for eliciting sensitive or embarrassing information
- Lessens chance of identification
- Allows respondents to indirectly provide the information
Imai et al have a nice R package list for analysis

Basic Idea

Imagine you wanted to measure racial prejudice. You can’t (or shouldn’t) just ask someone if they are racist.

Instead, we embed a question about racial attitudes in a list, and ask respondents how many of the items we agree with.

How do we get identification? If the list has n items, the control group sees n-1 items (without the race question) and the treatment group sees n items. The difference in means on items agreed to is the treatment effect.

Example (From Imai 2012)

Control:

Now I’m going to read you three things that sometimes make people angry or upset. After I read all three, just tell me HOW MANY of them upset you. (I don’t want to know which ones, just how many.)

the federal government increasing the tax on gasoline
professional athletes getting million-dollar-plus salaries
large corporations polluting the environment

How many, if any, of these things upset you?

Example (From Imai 2012)

Treatment:

Now I’m going to read you four things that sometimes make people angry or upset. After I read all four, just tell me HOW MANY of them upset you. (I don’t want to know which ones, just how many.)

the federal government increasing the tax on gasoline
professional athletes getting million-dollar-plus salaries
large corporations polluting the environment
a black family moving next door to you

How many, if any, of these things upset you?

Key Point + Assumptions

Don’t use items that are highly correlated with either each other or with the treatment item. We don’t want too many people agreeing to all items. Also, don’t use items that everyone or noone will agree with.

Assumption: The inclusion of a sensitive item has no effect on respondents’ answers to control items

Assumption: Respondents don’t lie about sensitive item

Identification

Implies that we can identify the following treatment effect

\[ \widehat{\tau} = \frac{1}{N_{1}}\sum_{i = 1}^{N}T_{i}Y_{i} - \frac{1}{N_{0}}\sum_{i = 1}^{N}(1 - T_{i})Y_{i} \]

which is equal to the proportion of people answering affirmatively to the sensitive item

Kulkinski et al (1997)

By the 1990s, survey data suggested that racial attitudes among northern whites and southern whites were converging, and racism among southern whites had massively declined.

Supposedly, this ushered in a “new South”, where Southern political behavior should come to mirror that of the nation as a whole, if we can believe the data.

Kulkinski and Co-authors use a list experiment to try and tackle this question, using the list on the previous slides

Analysis

Anything to be suspicious about here?

Ceiling Effects

Some people (Particularly in the control) will say all of the items made them angry
- If this is the case, such individuals could not “safely” reveal anti-black bias if they were in the treatment condition
Note - this would lead to underestimation

Fixing Ceiling Effects

We have to make an additional assumption to adjust for potential ceiling effects
- Respondents truthful answers to the sensitive items are independent of their answers for control items, conditional on covariates X
- Formally, $P(Yi(0) = y|z^{*}_{i} = 1, X = x) = P(Yi(0) = y|z^{*}_{i} = 0, X = x)$
What do we make of the plausibility of this assumption?

Estimation

Under this assumption, we can predict individuals responses for the control questions based on pre-treatment covariates to work out the number of “liars” in the treatment condition
Estimation is complicated, uses an expectation maximation algorithim that is beyond the scope of the course
- The intuition isn’t too different from MLE.
- See Blair and Imai (2012) on canvas for a full explanation

Conjoints

Many things we measure in political science are multidimensional
- Ie, what makes for a good political candidate?
It would be nice to have a design that let us vary a bunch of dimensions simultaneously, and randomly, and recover causal effects for each component
This is what we use the conjoint for

Motivation

Imagine we wanted to design the “perfect” political candidate

We could vary age (25, 35, 45, 55, 65)
Race/Ethnicity (Black, White, Hispanic, Asian, Middle Eastern)
Ideology (very left, center left, center, center right, very right)
- Or, issue positions or ideology across multiple dimensions
Prior Occupation (Lawyer, Business Person, Activist, etc)

Example Set Up

Attribute	Candidate A	Candidate B
Ideology	Progressive	Conservative
Race	Black	White
Age	45	60
Prior Occupation	Teacher	Business Executive

Respondent “votes” for one of the candidates?

Considerations

This is pretty cheap to implement
- Can show the same individual several profiles
  - How many is too many? 10?
Should we force choice, or let respondents be undecided?
- Forced choice is more straightforward, but might introduce bias
Full randomization or block unrealistic combinations?
External validity?

The AMCE

We can estimate an Average Marginal Component Effect for each component

\[ \pi_{\ell}(t_1, t_0, p(t)) = \mathbb{E}\left[Y_i(t_1, T_{ijk[-\ell]}, T_{i[-j]k}) \\ - Y_i(t_0, T_{ijk}[\ell], T_i[j \setminus k]) \mid T_{ijk[-\ell]}, T_i[-j] k] \in T \right] = \\ \sum_{(t, t) \in T} \mathbb{E}\left[ Y_i(t_1, t, t) - Y_i(t_0, t, t) \mid T_{ijk [-\ell]}, T_i[-j]k \in T \right] \\ \cdot p(T_{ijk[-\ell]} = t, T_i[-j]k] = t \mid T_{ijk [-\ell]}, T_i[-j]l \in T) \]

This is gross notation! This is the effect of seeing the $\ell$ attribute, averaged over all possible profiles. It’s Law of Total Probability from the fall!

Assumptions

Stability: Faced with two identical profiles, respondents will always choose the same candidate as long as profiles maintain identical attributes.

No profile order effects: Respondents don’t change their behavior between pair 1, pair 2, pair 3 and so on.

Randomization of Profiles: All potential profiles have a non-zero probability of appearing. (can be relaxed with additional assumptions, see Hainmueller et al section 4.1)

Immigration Example

Analyses

Conjoint Considerations

Pretty amazing that you can get that many causal effects from one experiment!
But….
- Everything is conditional on the profiles presented and their probabilities
- Everything is relative to the reference category within an attribute class
- What is the precise effect we are estimating?
External Validity?

Sub Group Analysis

Since everything is relative to a reference group, sub group analysis is relative to a reference group, within a sub group
- This is hard to interpret!
Two ways to resolve
- Pick the most meaningful subgroup, be clear in interpretation
- Present marginal means instead - not relative to anything!

Types of Experiments

Field Experiments

Incentives Work: Getting Teachers to Come to School

Intervention

Previewing Results

Design

Detailed Treatment Info

Data Collection (1)

Data Collection (2)

Baseline Data

Baseline (2)

Evaluation

Evaluation (2)

Effect of Incentives

What can we learn from this?

Field Experiments

Field Experiments

Lab Experiments

Example

Experiemental Design

Deception!

Design Questions

Expectations

Results

Experiment 2 Design

Experiment 2 Results

Experiment 3

Experiment 3 Results

What did we learn?

Summing Up

Survey Experiments

Types of Survey Experiment

Mechanics

List Experiments

Basic Idea

Example (From Imai 2012)

Example (From Imai 2012)

Key Point + Assumptions

Identification

Kulkinski et al (1997)

Analysis

Ceiling Effects

Fixing Ceiling Effects

Estimation

Conjoints

Motivation

Example Set Up

Considerations

The AMCE

Assumptions

Immigration Example

Analyses

Conjoint Considerations

Sub Group Analysis

Subgroup Analysis