This assignment is due Friday October 6, 2023 by 11:59pm. Note that late submiss
This assignment is due Friday October 6, 2023 by 11:59pm. Note that late submissions will not be
accepted. This problem set contains four questions and is worth a total of 100 points. You may work in
groups of up to four people total, but you must write up (and understand) your own answers. Do not copy
and paste answers or parts of answers (including code) from one another.
Please submit your completed assignment by uploading to bCourses. Your submission should include three
components: (1) a list of any other students you worked with, (2) the text of your response (along with any
equations, diagrams, etc.), (3) a ”do file” or ”log file” which shows your work for any coding exercises. Feel
free to attach your output file as an addition. However, I’d prefer it to be incorporated in your text file.
Question 1: Omitted Variable Bias (17 points)
A microfinance institution (MFI) currently operates in about 50 villages in Uganda, and wants to learn more
about the impact of their microfinance loans on small business owners. Specifically, the MFI is interested in
learning whether having a loan allows a business to expand their operations, increase sales, and earn more
revenue. The MFI has conducted a survey of 250 business owners across the 50 villages where it operates.
Some of these business owners have a microfinance loan and some don’t. The results of this survey includes
whether the business owner currently has a microfinance loan (LOANi) and the value of business revenues
over the past month (REV ENUEi).
The MFI asks you to run the following short regression:
REV ENUEi = α0 + α1LOANi + εi
(a) (2 pts) In words, how do you interpret α0? How do you interpret α1?
(Hint: What is E[REV ENUE|LOAN = 0)]? What is E[REV ENUE|LOAN = 1]?
(b) (2 pts) Explain why α1 would be a biased estimate of the true impact of loans on business revenues.
Using only your intuition, would you expect α1 to be an overestimate or an underestimate of the true
impact of the causal impact of microfinance loans on revenues, and why?
You learn that the MFI also collected educational attainment (measured as number of years education
completed) (EDUi) as part of the survey. Now you are able to run the following long regression:
REV ENUEi = β0 + β1LOANi + β2EDUi + νi
(c) (2 pts) Write down the auxiliary regression (use γ0 and γ1 as parameters). Interpret each of γ0 and γ1.
(d) (3 pts) Plug the auxiliary into the long regression, and derive the expression showing how α1 and β1
relate. In your own words, interpret this expression.
(e) (3 pts) Would you expect β2 to be positive, negative, or zero? Would you expect γ1 to be positive,
negative, or zero?
(f) (3 pts) Once you include educational attainment as a control in the long regression, can you interpret
β1 as the causal impact of microfinance loans on business revenues? Why or why not?
(g) (2 pts) Give and example of a bad control. What kind of problem does it create?
1
Economics 174: Fall 2023 Problem Set 1
Question 2: Potential Outcomes (18 points)
Suppose there is an NGO working in rural areas of Kenya that provides farmers with a year-long agricultural
training program led by agronomists. As part of this program, farmers are trained on various seed
technologies and agricultural practices that can improve crop yields and reduce the incidence of pests and
disease. In the first year, farmers can choose whether or not they would like to participate in the program.
You and your research partner are collaborating with the NGO to learn more about the impact of the training
program. The NGO provides you with a list of all farmers in several villages, whether or not they chose to
participate in the training program (TRAININGi), and crop yields for the past 3 months (Y IELDi).
(a) (1 pt ) What is the treatment variable and what is the outcome variable of interest in this example?
(b) (4 pts) Write down the four relevant potential outcomes using notation similar to what we saw in
lecture and section. In words, explain what each of these capture. Which of these potential outcomes
do you actually observe?
Based on your data, you learn that crop yields are on average 1.2 tons per hectare for farmers who did not
participate in the training program, and 1.5 tons per hectare for farmers who did participate.
(c) (3 pts) Your research partner concludes that the impact of the training program was to boost yields
by 0.3 tons per hectare. Do you agree with your research partner? In four sentences or less, explain
why you agree with their conclusion, or explain why their conclusion is incorrect.
(d) (1 pts) Write down the mathematical expression that captures the observational difference in yields
between farmers who chose to participate and farmers who chose not to participate.
(e) (1 pts) Write down an expression that captures the true impact of the program on yields.
(Hint: use some of the potential outcomes you wrote down in part (b)).
(f) (4 pts) Use the selection bias framework to relate/compare the expression from part (d) (the observed
difference in yields between participants and non-participants) to the expression from part (e) (the true
impact of the program on yields). In words, interpret the difference between the observed difference
in yields and the true impact of the program on yields. (Hint: start with the expression from part (d),
rewrite an equivalent expression using potential outcomes notation, and proceed from there.)
Suppose for the next year, the NGO has more farmers signing up to participate than there are available slots.
So, the NGO takes the full list of farmers who have signed up, and randomly selects half to participate.
(g) (4 pts) Explain mathematically and intuitively what randomization achieves. Do you expect the
observed difference in yields between participants and non-participants in the second year (with randomization)
to be larger or smaller than the observed difference in yields in the first year (without
randomization)? Why? Be clear about any assumptions critical to your reasoning.
2
Economics 174: Fall 2023 Problem Set 1
Question 3: Intent to Treat & Local Average Treatment Effects (30 pts)
The Supplemental Nutrition Assistance Program (SNAP) is a federal program designed to improve nutrition
and reduce food insecurity among low-income households in the US by providing funds that can be used
for purchasing food. Households that earn below a certain threshold level of income are eligible for SNAP.
However, not all households that meet this eligibility criteria make use of this potentially-beneficial program.
Suppose a state agency is interested in increasing take up of SNAP and evaluating the impact of SNAP
on food security among low-income households. They partner with an organization that provides eligible
low-income households with more information about SNAP and with help going through the administrative
process required to sign up.
The organization is expanding to a new geographic area. In the process, the organization identifies a sample of
eligible, low-income households in that area. Approximately half of the households are assigned to treatment,
and so receive direct outreach (consisting of information about SNAP and the offer of sign up assistance if
the household has not already signed up for SNAP). The other half of households are assigned to control,
and so don’t receive direct outreach with information and sign up assistance.
For all households in treatment and control, the organization collects (1) basic information about the household
(including income, number of household members, etc.), (2) whether or not the household makes use
of SNAP, (3) information related to food consumption that indicates whether a household is food insecure
(defined as being in the condition where lack of income and other resources prevent access to adequate food).
After the first year, your research assistant produces the following table of results for you to interpret:
Makes use of SNAP benefits
(=1 if yes, =0 if no)
Food Insecure
(=1 if insecure, =0 if secure)
(1) (2)
Treatment 0.085 -0.062
(0.0302) (0.0259)
Observations 926 926
Control Group Mean 0.648 0.225
(a) (1 pt) Write down the expression representing the regression presented in column (1). Use π0 to
represent the intercept and π1 to represent the slope coefficient.
(b) (4 pts) In words, what does the parameter π0 represent? What is the estimate for π0? In words, what
does π1 represent? What is the estimate for π1?
(c) (4 pts) Construct a confidence interval for the difference in SNAP sign up rates between the two groups.
Is the estimate presented in the table statistically significant at the 5% level?
(d) (1 pt) Write down the expression representing the regression presented in column (2). Use β0 to
represent the intercept and β1 to represent the slope coefficient.
(e) (4 pts) In words, what does the parameter β0 represent? What is the estimate for β0? In words, what
does β1 represent? What is the estimate for β1?
(f) (4 pts) Now identify the fraction of always-takers, never-takers, and compliers in the sample. Are the
fractions of always-takers, never-takers, and compliers the same across treatment and control groups?
(g) (4 pts) The organization suggests disregarding those households assigned to the treatment group who
did not sign up for SNAP and those assigned to the control group who did sign up for SNAP. Instead,
the organization suggests comparing food insecurity across those in the treatment group who have
3
Economics 174: Fall 2023 Problem Set 1
signed up for SNAP to those in the control group who have not signed up for SNAP. Is this a good
approach? In four sentences or less, explain why or why not.
(h) (3 pts) What is the the ITT estimate of the impact of assignment to treatment on food insecurity?
Calculate the t-statistic associated with the estimate. Is this estimate statistically significant? Explain.
(i) (2 pts) What is the LATE estimate of the impact of signing up for the program on food insecurity?
(j) (3 pts) In four sentences or less, explain the difference between the ITT and LATE estimates.
(Consider: What does the ITT capture? The LATE? What is the distinction between the two?)
Question 4: Randomized Evaluations (35 points)
In this question, you’ll perform some data analysis and interpret your findings. To receive full credit, you’ll
need to turn in a well-commented ”do file” or ”log file” containing all the code you used to complete this
question. You can submit your code separately, or attach your code to the end of your problem set and
submit in one file.
This question uses an adapted dataset based on Muralidharan, Singh, and Ganimian’s 2019 paper “Disrupting
Education? Experimental Evidence on Technology-Aided Instruction in India.” The paper is available online
and the replication data is available on ICPSR. Download the adapted dataset from bCourses.
This project evaluated the impact of a center-based and technology-aided after-school educational program
on math and Hindi performance among middle schoolers living in low-income neighborhoods in urban India.
The technology-based curriculum was designed to be high-quality, adaptive, and engaging. Approximately
600 middle schoolers were recruited to participate in the study. Half of these recruited students were randomly
allocated by lottery to receive a voucher to participate in the program (treatment group), and half were not
(control group). For the purposes of this question, you can assume that 100% of those assigned to treatment
participated in the program and that 0% of those assigned to control participated in the program.
Below is a list of the variables included in the dataset, with a brief description of each. Note that BL refers
to versions of variables collected at baseline (collected before the program began) while EL refers to variables
collected at endline (collected at the conclusion of the program).
Variable Description
student id Identification numbers that uniquely identify students
student age Age of the student (collected at baseline)
student female Indicator (0/1) for whether the student is female
student grade Grade of the student (collected at baseline)
treatment Indicator (0/1) for treatment status of the student
BL math percent, EL math percent Math score in percent correct
BL hindi percent, EL hindi percent Hindi score in percent correct
BL ses index, EL ses index Household wealth index
(a) (5 pts) Import the data and generate a table of summary statistics. What is the range of ages and
grade levels within this sample? What are average scores for math and Hindi at baseline? At endline?
(b) First, you’ll check whether the treatment and control groups are balanced in terms of each of the
following variables: (1) age, (2) sex, (3) household wealth index at baseline, (4) math scores at baseline,
and (5) Hindi scores at baseline. (Hint: You’ll run a separate regression for each of these five variables.
Use outreg2 or other preferable command to output the tables.)
(i) (5 pts) First (before you code up any regressions), write down the regression you plan to run for
at least one of the variables. Which parameter represents the coefficient of interest? What do you
4
Economics 174: Fall 2023 Problem Set 1
expect the estimate of this parameter to be (positive, negative, zero) and why?
(ii) (6 pts) Use the reg command to run the regressions in Stata. Are there significant differences
between the treatment and control groups for any of these five variables? What is the purpose of
this exercise, and what are you able to conclude?
(c) Next, you’ll estimate the impact of the treatment on math and Hindi scores at endline.
(i) (5 pts) First, write the two regressions you will run to estimate these treatment effects. In words,
what will each of these parameters capture?
(ii) (6 pts) Run the two regressions using the reg command and use the outreg2 command to produce
tables containing the results of these two regressions. Interpret your results. What is the effect of
the treatment on each of math and Hindi scores? Are these estimated treatment effects statistically
significant? Explain.
(d) (6 pts) Next, replicate your analysis, including each of the five variables listed in part (b) as controls.
Use outreg2 to produce tables containing the results of these two regressions (or, you may create one
table with results from the four regressions from parts (c) and (d) side by side). How do your estimated
treatment effects compare to those from part (c)? Explain.
(e) (2 pt) What do you conclude about the impact of the program on math and Hindi scores at endline?