Regression discontinuity with sci-kit learn models

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ExpSineSquared, WhiteKernel
from sklearn.linear_model import LinearRegression

import causalpy as cp

WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

%config InlineBackend.figure_format = 'retina'

Load data

data = cp.load_data("rd")
data.head()

	x	y	treated
0	-0.932739	-0.091919	False
1	-0.930778	-0.382663	False
2	-0.929110	-0.181786	False
3	-0.907419	-0.288245	False
4	-0.882469	-0.420811	False

Linear, main-effects model

result = cp.skl_experiments.RegressionDiscontinuity(
    data,
    formula="y ~ 1 + x + treated",
    model=LinearRegression(),
    treatment_threshold=0.5,
)

fig, ax = result.plot();

../_images/c889a6c1734c2c76790a7d3c9e3720fc296677807aa51199edde6b79ef8269ff.png

Linear, main-effects, and interaction model

result = cp.skl_experiments.RegressionDiscontinuity(
    data,
    formula="y ~ 1 + x + treated + x:treated",
    model=LinearRegression(),
    treatment_threshold=0.5,
)

result.plot();

../_images/80588536471fb11495e5da019a8f5386dd286d28966bfc52942c4deab512883e.png

Though we can see that this does not give a good fit of the data almost certainly overestimates the discontinuity at threshold.

Using a bandwidth

One way how we could deal with this is to use the bandwidth kwarg. This will only fit the model to data within a certain bandwidth of the threshold. If \(x\) is the running variable, then the model will only be fitted to data where \(threshold - bandwidth \le x \le threshold + bandwidth\).

result = cp.skl_experiments.RegressionDiscontinuity(
    data,
    formula="y ~ 1 + x + treated + x:treated",
    model=LinearRegression(),
    treatment_threshold=0.5,
    bandwidth=0.3,
)

result.plot();

../_images/88a3f61e66189b4cf8d6726394ab0b71b5920a859b005b45a368613b57ad28a3.png

We could even go crazy and just fit intercepts for the data close to the threshold. But clearly this will involve more estimation error as we are using less data.

result = cp.skl_experiments.RegressionDiscontinuity(
    data,
    formula="y ~ 1 + treated",
    model=LinearRegression(),
    treatment_threshold=0.5,
    bandwidth=0.3,
)

result.plot();

../_images/c0a6c04908bab4782a8b751c81a062cefa6d20b72f8dfe87382665b517ac95b6.png

Using Gaussian Processes

kernel = 1.0 * ExpSineSquared(1.0, 5.0) + WhiteKernel(1e-1)
result = cp.skl_experiments.RegressionDiscontinuity(
    data,
    formula="y ~ 1 + x + treated",
    model=GaussianProcessRegressor(kernel=kernel),
    treatment_threshold=0.5,
)

fig, ax = result.plot();

../_images/38ea53e263f7ddb2192fcaa96c50436c53bb6552aee4afadd2a62fcd67d97044.png

Using basis splines

result = cp.skl_experiments.RegressionDiscontinuity(
    data,
    formula="y ~ 1 + bs(x, df=6) + treated",
    model=LinearRegression(),
    treatment_threshold=0.5,
)

fig, ax = result.plot()

../_images/b7c87272e1a60630f3d16a5d4011a911fdc0d419f1c71cd9a48191ef5b93db38.png