RD Designs (in R)#
This is a jupyter notebook that can be executed with an R kernel running in the background to execute R code.
The content below draws heavily upon these two excellent souces:
Cattaneo, Idrobo and Titiunik (2008) “A Practical Introduction to Regression Discontinuity Designs: Part I,” in Cambridge Elements: Quantitative and Computational Methods for Social Science, Cambridge University Press. See also “Part II” paper.
Meyersson (2014): Islamic Rule and the Empowerment of the Poor and Pious, Econometrica 82(1): 229-269.
Links to these papers and data and both Stata and R code for replication at RD Software Packages site.
R Code#
Show code cell source
#install.packages('rdrobust') #install rbrobust if needed
options(repr.plot.width=6, repr.plot.height=5) #control the size of R plots in jupyter
# Load R libraries
library(foreign)
library(ggplot2)
library(rdrobust)
Show code cell output
Registered S3 methods overwritten by 'ggplot2':
method from
[.quosures rlang
c.quosures rlang
print.quosures rlang
data = read.dta("data/CIT_2018_Cambridge_polecon.dta")
Y = data$Y
X = data$X
T = data$T
T_X = T*X
Raw Comparison of means (Figure 2.3a)
If we simply compare outcomes in treated (Islamic major) and non-treated areas (secular major), we find lower female educational attainment in areas where Islamic parties won.
rdplot(Y, X, nbins = c(2500, 500), p = 0, col.lines = "red", col.dots = "gray",
title = "", x.label = "Islamic Margin of Victory",
y.label = "Female High School Share", y.lim = c(0,70))

Local comparison of means: Narrowing the bandwidth#
In last plot we compared mean value in treated and control group. But these two groups differ considerably.
Below we focus on more ‘local’ effects (closer to just above and below the cutoff).
In next plot we fit a 4th degree polynomial on each side but now limit bandwidth to ‘closer’ contests where absolute victory margin was within 50 points.
rdplot(Y[abs(X) <= 50], X[abs(X) <= 50], nbins = c(2500, 500), p = 4, col.lines = "red",
col.dots = "gray", title = "", x.label = "Islamic Margin of Victory",
y.label = "Female High School Share", y.lim = c(0,70))

rdplot(Y[abs(X) <= 10], X[abs(X) <= 10], nbins = c(2500, 500), p = 4, col.lines = "red", col.dots = "gray", title = "",
x.label = "Islamic Margin of Victory", y.label = "Female High School Share", y.lim = c(0,70))

RD plots#
scatter plots (as above) make it hard to discern ‘jumps’
RD plots aggregate or ‘smooth’ data before plotting.
local sample means (from bins) represented by dots
polynomial (or local linear) fits
Plenty of variation in bin and bandwidth selection and curve fitting based on these concepts.
rdplot(Y, X, nbins = c(20,20), binselect = 'esmv', x.label = 'Running Variable',
y.label = 'Outcome', title = '', y.lim = c(0,25))

The estimating equation:
All in one: $\( Y_i = f(X_i) +\beta D_i + \epsilon_i \)$
The function \(f(X_i)\) (e.g. a polynomial) must be continuous at \(X_0\)
The RD estimate is difference between weighted average of outcomes on the either side of the discontinuity. Fitting a high order polynomial can mean these weighted averages are driven by observations far from the threshold.
Parametric (polynomial on full data) or non-parametric (local regression closer to cutoff) approaches?
Andrew Gelman’s blog post from 8/13 discusses this issue. See also Development Impact Blog
Chen et al (2013) “Evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River policy,” PNAS.
Policy discontinuity: North of China’s Huai river free coal for heating is distributed in winter. None ti the south. Using RDD method find total suspended particulates (TSPs) air pollution 55% higher just North of river compared to just south. Estimates China’s coal-burning was reducing lifespan by 5 years for half a billion people.

Are observations far from threshold affecting polynomial fit, driving results? Smaller estimates with linear fits?
Fuzzy RD#
A fuzzy RD is like an RD with imperfect compliance or non-compliance. Assignment to Treatment is as before summarized by indicator \(I({X_i \ge X_0})\) but compliance is now imperfect. Like an Intent-to-Treat (ITT) estimator, the measured jump at the cutoff in the outcome needs to be rescaled.
Structural equation: \(Y_i = f(X_i)+\beta D_i +\epsilon_i\)
2nd stage reduced form: \(Y_i = f(X_i)+\pi_2 I({X_i \ge X_0} +\xi_{2i}\)
First stage: \(D_i = g(X_i)+\pi_1 I({X_i \ge X_0} +\xi_{1i}\)
Standard IV: \( \beta=\frac{\pi_2}{\pi_1} \)
Multi-Score RD design#
Example: Third graders have to go to summer school if score below \(X_{10}\) in reading test OR below \(X_{20}\) in math test. Does summer school raise test performance next year?
From Zamosc 2012
Polynomial Fit in 2 dimensions#
With two forcing variables
The cutoff is now a boundary
Heterogenous effects, depending on what segment of the boundary
with right assumptions (lots of data), can identify one treatment effect for every point b on boundary. Treatment effect curve \(\tau(b)\)
Average Treatment Effect (ATE) would average over boundary
Geographic RD designs#
Latitude, Longitude forcing variables. Collapse to one-dimensional distance (e.g. ‘distance to border’) problematic. The following image from Keele & Titiunik (2014).
See also
Dell, Melissa. 2010. “The Persistent Effects of Peru’s Mining ‘Mita.’” Econometrica 78 (6): 1863–1903.