One common type of experiment is a field experiment, where the researcher leaves their office and actually goes out “in the field” to conduct research
Field experiments (or Randomly Controlled Trials) involve actual policy interventions with on real individuals
We will give a close look at an fieldexperiment by Nobel Prize winning Economist Esther Duflo and co-authors.
Background: Teacher absenteeism is a large problem in rural India, and educational outcomes are poor.
Seva Mandir runs single-teacher schools in rural India
Tries to minimzie absenteeism by “berating” absent teachers, yet absenteeism remains high
Question: Can policy makers design an intervention to incentivize higher rates of teacher attendance
Seva Mandir gave 57 randomly selected teachers cameras, along with instructions to have students take date stamped pictures with the teacher at the start and end of the day.
Teachers are paid by a non-linear function of the days of school they attend. The receive Rs. 500 if they attend fewer then 10 days, and then Rs. 50 additional if they attend any additional days.
Is this ethical? What are the considerations?
Teacher attendance increased significantly
Test scores improved in the monitor schools by about .2 standard deviations
After 2.5 years, children exposed to the program were 62 percent more likely to transfer to formal primary schools
Duflo and co-authors work with a non-profit, Seva Mandir, which is a very common way to do this kind of work.
Sometimes, it’s also a way to find funding if you can provide free analysis for an NGO
But - don’t outsource the work to the NGO. See recent scandal with GDRI in Bangladesh
120 total schools, randomized. Some attrition; 3 schools in the treatment group and 4 in the control closed.
Initial wage of teachers; Rs. 1,000 ($160 ppp) for a minimum of 20 days of work
Control group gets initial wage
Teachers reminded that they could be dismissed for poor attendance
Study done in collaboration with J-PAL, who is one of the main funders of this type of work. They handled data collection
Data was collected on teacher attendance through two unannounced visits per month
How many students were present? Was anything written on the blackboard? Was the teacher talking to the children?
Seva Mandir provided the camera and payment data to the researchers
Three exams were administered to all students in the program
A pretest (either oral for those who could not write, or written)
Midtest (all students get oral and written exam)
Basic math, vocabulary (and some more complex math on written)
Students who cannot write get 0 on written
Post-test (all students get written and oral)
Similar proportion of students (17% in treatment vs 19% in control) of students took the written exam
Control group scored slightly higher on the oral exam, while the treatment group scored slightly higher on the written exam
This is simply the difference in % of schools open when randomly checked. Experiments allow for very simple inference!
Attendance was 79% for teachers in the treatment group against 58% in the control group.
Standard errors are clustered by schools, because treatments are assigned at school level.
Using parametric inference rather than randomization inference.
Randomization inference could also be appropriate here, but again this should just be specified in pre-registration!
Let’s think through what this is telling us. Black line = teachers already ‘in the money’ and red line = teachers not in the money.
What are the precise treatment effects that are measured?
Are there any potential barriers to inference from internal validity?
What is the level of external validity?
To where might this generalize? To where should we be more skeptical?
Benefits:
Very high levels of external validity
Can involve real interventions
Drawbacks
Less control over treatments
More potential for internal validity issues; reliance on third parties
Cost (but funding is available, but competitive).
Phenomenal Resources here: https://www.povertyactionlab.org/research-resources?view=toc and https://www.povertyactionlab.org/page/handbook-field-experiments
Keep an eye out for grants, follow NGOs, Non-profits, funders and government agencies in your area
Work with faculty on these types of projects; if you have an idea, we often have experience applying for grants, etc. May also lend credibility to your applications.
Can either be true “lab” experiments or “lab-in-the-field” experiments
Either way, recruit subjects to participate in some sort of experiment
Often, but not always, this involves having the subjects play some sort of competitive or cooperative game
For more on games, see Fehr on canvas.
Another great example of cooperation games is “Why Does Ethnic Diversity Undermine Public Goods Provision?” by Habyarimana et al
Will focus on a different type of example to show how you can get creative with a lab experiment. Ismail White and co-authors wanted to test how much social group norms explained Black support for the Democratic Party in the US.
Expectations
We expect that blacks are constrained from following strict self-interest bythe social costs incurred when other blacks question their commitment to or standing within the group. Moreover, we argue that such social pressure can be internalized, creating an individual belief in black solidarity that is also constraining and works to prevent self-interested behavior;
White et al run three separate experiments to test these expecations
Experiment 1:
Control Group: Asked to donate $100 to candidate or split
T1: Asked to donate, receive $1 for every$10 donated to Romney
T2: Asked to donate, same payoffs, name will appear in newspaper as a Romney donor
The state that this experiment took place in (Louisiana) had a law against allocating state money to political campaigns, so no money was actually allocated
Students were led to believe these would be real donations with real consequences
Were debriefed immediately after
What exactly is the design testing?
Any concerns about internal validity?
Any concerns about external validity?
What do we think of the HBCU setting?
Treatment Group 1 will donate significantly less to Obama (than control)
Treatment Group 2 will donate significantly more to Obama than treatment group 1.
Subjects: 106 black students at a predominantly white university in the midwest
Each student given $10 in dollar bills to donate, could place in boxes marked Obama or Romney. Told that each dollar would be matched 10 to 1, or they could keep the money instead.
T1: Paired with a black actor (same sex) who pretended to be a participant.
T2: Paired with a white actor (same sex) who pretended to be a participant.
Sample: 56 black students at a predominantly white school
Told that for every $10 donated to Romney, would receive $1.
Always a black actor present, in the control the actor gives money to Obama and in treatment gives it to Romney.
What is this testing?
What are the precise treatment effects we estimate in this study?
Who is the sample - how might this generalize?
What might other barriers to inference be in these studies?
Lab experiments give us greater control over treatment than experiments in the field
Can test a wider range of interventions
But, interventions may have less external validity
Less costly than field experiments, but still need money
In practice, subjects often university students
gives rise to the WEIRD (Western, Educated, Industrialized, Rich and Democratic) problem
We can also embed experiments in surveys
Obvious limitations; limited to text and image prompts
Cheapest way to run an experiment
Embeddedness within survey allows you to collect lots of background data!
We will look at list experiments and conjoints.
This is just scratching the surface.
Any sort of intervention you can think of; but external validity is harder to justify
Can code your own interventions, randomization, etc via Qualtrics and pay for a sample
Cheapest option, you all have access to Qualtrics through Clemson
Sample cost varies; more reliable = more money
If you are serious about doing a survey experiment, see me.
Can also pay a firm to do all that for you (for $$)
List experiments are useful for eliciting sensitive or embarrassing information
Lessens chance of identification
Allows respondents to indirectly provide the information
Imai et al have a nice R package list for analysis
Imagine you wanted to measure racial prejudice. You can’t (or shouldn’t) just ask someone if they are racist.
Instead, we embed a question about racial attitudes in a list, and ask respondents how many of the items we agree with.
How do we get identification? If the list has n items, the control group sees n-1 items (without the race question) and the treatment group sees n items. The difference in means on items agreed to is the treatment effect.
Control:
Now I’m going to read you three things that sometimes make people angry or upset. After I read all three, just tell me HOW MANY of them upset you. (I don’t want to know which ones, just how many.)
the federal government increasing the tax on gasoline
professional athletes getting million-dollar-plus salaries
large corporations polluting the environment
How many, if any, of these things upset you?
Treatment:
Now I’m going to read you four things that sometimes make people angry or upset. After I read all four, just tell me HOW MANY of them upset you. (I don’t want to know which ones, just how many.)
the federal government increasing the tax on gasoline
professional athletes getting million-dollar-plus salaries
large corporations polluting the environment
a black family moving next door to you
How many, if any, of these things upset you?
Don’t use items that are highly correlated with either each other or with the treatment item. We don’t want too many people agreeing to all items. Also, don’t use items that everyone or noone will agree with.
Assumption: The inclusion of a sensitive item has no effect on respondents’ answers to control items
Assumption: Respondents don’t lie about sensitive item
Implies that we can identify the following treatment effect
\[ \widehat{\tau} = \frac{1}{N_{1}}\sum_{i = 1}^{N}T_{i}Y_{i} - \frac{1}{N_{0}}\sum_{i = 1}^{N}(1 - T_{i})Y_{i} \]
which is equal to the proportion of people answering affirmatively to the sensitive item
By the 1990s, survey data suggested that racial attitudes among northern whites and southern whites were converging, and racism among southern whites had massively declined.
Supposedly, this ushered in a “new South”, where Southern political behavior should come to mirror that of the nation as a whole, if we can believe the data.
Kulkinski and Co-authors use a list experiment to try and tackle this question, using the list on the previous slides
Anything to be suspicious about here?
Some people (Particularly in the control) will say all of the items made them angry

Note - this would lead to underestimation
We have to make an additional assumption to adjust for potential ceiling effects
Respondents truthful answers to the sensitive items are independent of their answers for control items, conditional on covariates X
Formally, \(P(Yi(0) = y|z^{*}_{i} = 1, X = x) = P(Yi(0) = y|z^{*}_{i} = 0, X = x)\)
What do we make of the plausibility of this assumption?
Under this assumption, we can predict individuals responses for the control questions based on pre-treatment covariates to work out the number of “liars” in the treatment condition
Estimation is complicated, uses an expectation maximation algorithim that is beyond the scope of the course
The intuition isn’t too different from MLE.
See Blair and Imai (2012) on canvas for a full explanation
Many things we measure in political science are multidimensional
It would be nice to have a design that let us vary a bunch of dimensions simultaneously, and randomly, and recover causal effects for each component
This is what we use the conjoint for
Imagine we wanted to design the “perfect” political candidate
We could vary age (25, 35, 45, 55, 65)
Race/Ethnicity (Black, White, Hispanic, Asian, Middle Eastern)
Ideology (very left, center left, center, center right, very right)
Prior Occupation (Lawyer, Business Person, Activist, etc)
| Attribute | Candidate A | Candidate B |
|---|---|---|
| Ideology | Progressive | Conservative |
| Race | Black | White |
| Age | 45 | 60 |
| Prior Occupation | Teacher | Business Executive |
Respondent “votes” for one of the candidates?
This is pretty cheap to implement
Can show the same individual several profiles
Should we force choice, or let respondents be undecided?
Full randomization or block unrealistic combinations?
External validity?
We can estimate an Average Marginal Component Effect for each component
\[ \pi_{\ell}(t_1, t_0, p(t)) = \mathbb{E}\left[Y_i(t_1, T_{ijk[-\ell]}, T_{i[-j]k}) \\ - Y_i(t_0, T_{ijk}[\ell], T_i[j \setminus k]) \mid T_{ijk[-\ell]}, T_i[-j] k] \in T \right] = \\ \sum_{(t, t) \in T} \mathbb{E}\left[ Y_i(t_1, t, t) - Y_i(t_0, t, t) \mid T_{ijk [-\ell]}, T_i[-j]k \in T \right] \\ \cdot p(T_{ijk[-\ell]} = t, T_i[-j]k] = t \mid T_{ijk [-\ell]}, T_i[-j]l \in T) \]
This is gross notation! This is the effect of seeing the \(\ell\) attribute, averaged over all possible profiles. It’s Law of Total Probability from the fall!
Stability: Faced with two identical profiles, respondents will always choose the same candidate as long as profiles maintain identical attributes.
No profile order effects: Respondents don’t change their behavior between pair 1, pair 2, pair 3 and so on.
Randomization of Profiles: All potential profiles have a non-zero probability of appearing. (can be relaxed with additional assumptions, see Hainmueller et al section 4.1)
Pretty amazing that you can get that many causal effects from one experiment!
But….
Everything is conditional on the profiles presented and their probabilities
Everything is relative to the reference category within an attribute class
What is the precise effect we are estimating?
External Validity?
Since everything is relative to a reference group, sub group analysis is relative to a reference group, within a sub group
Two ways to resolve
Pick the most meaningful subgroup, be clear in interpretation
Present marginal means instead - not relative to anything!