Banking dataset with a pymc model

This notebook analyses historic data on banking closures from Richardson and Troost [2009] and used as a case study for a difference in differences analysis in the excellent book Mastering Metrics [Angrist and Pischke, 2014]. Here, we replicate this analysis, but using Bayesian inference.

import arviz as az
import pandas as pd

import causalpy as cp
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'
seed = 42

Load data

The raw dataset has a date column which is just some uninterpretable number - all we need for our analysis is the year column. We also have columns bib6, bio6, bib8, bio8. We know that the 6 and 8 represent the 6th and 8th Federal Reserve districts, respectively. I assume bib means “banks in business”, so we’ll keep those columns. The data is at daily resolution, but we will convert this to yearly resolution. And from what I can tell from Figure 5.2 of Angrist and Pischke [2014], they seem to present the median number of banks open per year. Let’s load the data up and do those steps.

df = (
    cp.load_data("banks")
    # just keep columns we want
    .filter(items=["bib6", "bib8", "year"])
    # rename to meaningful variables
    .rename(columns={"bib6": "Sixth District", "bib8": "Eighth District"})
    # reduce from daily resolution to examine median banks open by year
    .groupby("year")
    .median()
)

treatment_time = 1930.5

# set treatment time to zero
df.index = df.index - treatment_time
treatment_time = 0

Let’s visualise what we have. This matches up exactly with Figure 5.2 of the Angrist and Pischke [2014].

ax = df[["Sixth District", "Eighth District"]].plot(style="o-")
ax.set(ylabel="Number of banks in business")
ax.axvline(x=treatment_time, c="r", lw=3, label="intervention")
ax.legend();
../_images/94fd4fda3142b503a1369b96d13fda17edc7aa6962bdc84546866a23e22eeef0.png

Just a few more final processing steps to make the data amenable to analysis. We will convert from wide form into long form. Then we will add a new column treated which indicates the observations where treatment has taken place.

df.reset_index(level=0, inplace=True)
df_long = pd.melt(
    df,
    id_vars=["year"],
    value_vars=["Sixth District", "Eighth District"],
    var_name="district",
    value_name="bib",
).sort_values("year")

# We also need to create a column called `unit` which labels each distinct
# unit. In our case we just have two treatment units (each district). So
# we can build a `unit` column from `district`.
df_long["unit"] = df_long["district"]

# We also need to create a `post_treatment` column to define times after
# the intervention.
df_long["post_treatment"] = df_long.year >= treatment_time
df_long

# Dummy coding for district
df_long = df_long.replace({"district": {"Sixth District": 1, "Eighth District": 0}})
df_long
/var/folders/pd/p2qnky2x3xl4w3mgc4lct2200000gn/T/ipykernel_17493/4155710090.py:21: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df_long = df_long.replace({"district": {"Sixth District": 1, "Eighth District": 0}})
year district bib unit post_treatment
0 -1.5 1 141.0 Sixth District False
6 -1.5 0 170.0 Eighth District False
1 -0.5 1 135.0 Sixth District False
7 -0.5 0 165.0 Eighth District False
2 0.5 1 121.0 Sixth District True
8 0.5 0 132.0 Eighth District True
3 1.5 1 113.0 Sixth District True
9 1.5 0 120.0 Eighth District True
4 2.5 1 102.0 Sixth District True
10 2.5 0 111.0 Eighth District True
5 3.5 1 102.0 Sixth District True
11 3.5 0 110.0 Eighth District True

Analysis 1 - classic 2\(\times\)2 DiD

First we’ll do an analysis just looking at data from 1930 and 1931. This way we just have a single pre-intervention and a single post-intervention measurement.

We will use the formula: bib ~ 1 + district * post_treatment which equates to the following model of expected values:

\[\begin{split} \begin{aligned} \mu_i & = \beta_0 \\ & \quad + \beta_{d} \cdot district_i \\ & \quad + \beta_{p} \cdot post~treatment_i \\ & \quad + \beta_{\Delta} \cdot district_i \cdot post~treatment_i \end{aligned} \end{split}\]

Let’s just run through this to make sure we understand:

  • \(\mu_i\) is the expected value of the outcome (number of banks in business) for the \(i^{th}\) observation.

  • \(\beta_0\) is an intercept term to capture the basiline number of banks in business of the control group, in the pre-intervention period.

  • district is a dummy variable, so \(\beta_{d}\) will represent a main effect of district, that is any offset of the treatment group relative to the control group.

  • post_treatment is also a dummy variable which captures any shift in the outcome after the treatment time, regardless of the recieving treatment or not.

  • the interaction of the two dummary variables district:post_treatment will only take on values of 1 for the treatment group after the intervention. Therefore \(\beta_{\Delta}\) will represent our estimated causal effect.

result1 = cp.pymc_experiments.DifferenceInDifferences(
    df_long[df_long.year.isin([-0.5, 0.5])],
    formula="bib ~ 1 + district * post_treatment",
    time_variable_name="post_treatment",
    group_variable_name="district",
    model=cp.pymc_models.LinearRegression(
        sample_kwargs={"target_accept": 0.98, "random_seed": seed}
    ),
)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, sigma]
100.00% [8000/8000 00:05<00:00 Sampling 4 chains, 7 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 6 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
There were 7 divergences after tuning. Increase `target_accept` or reparameterize.
Sampling: [beta, sigma, y_hat]
Sampling: [y_hat]
Sampling: [y_hat]
Sampling: [y_hat]
Sampling: [y_hat]

Note

We have some divergences here, which is a bad sign. This is likely related to the fact that we only have 4 data points but 5 parameters. This is not always a dealbreaker for a Bayesian analysis (because we have priors), nevertheless when we get divergent samples, this is a warning sign.

Using the following code, we can see that we have a classic “funnel problem” as the divergences occur when the sampler is exploring near zero values of the measurement error (the \(\sigma\) parameter).

az.plot_pair(result1.idata, var_names="~mu", divergences=True);

For ‘proper’ work, we’d want to fix things to avoid divergences by, for example, exploring different prior distributions for \(\sigma\).

fig, ax = result1.plot(round_to=3)
../_images/a7b9922fac871b72bf8b1fe6a7f78878dcac1b0dd57a2b2d821054a8418a3820.png
result1.summary()
===========================Difference in Differences============================
Formula: bib ~ 1 + district * post_treatment

Results:
Causal impact = 19, $CI_{94\%}$[15, 23]
Model coefficients:
Intercept                     165, 94% HDI [163, 167]
post_treatment[T.True]        -33, 94% HDI [-36, -30]
district                      -30, 94% HDI [-33, -27]
district:post_treatment[T.True]19, 94% HDI [15, 23]
sigma                         0.84, 94% HDI [0.11, 2.1]
ax = az.plot_posterior(result1.causal_impact, ref_val=0, round_to=3)
ax.set(title="Posterior estimate of causal impact");
../_images/d465b965eca6a49e1a84e14cd1959d4cc71363c18f2041435fe3d12d52e59093.png

Analysis 2 - DiD with multiple pre/post observations

Now we’ll do a difference in differences analysis of the full dataset. This approach has similarities to CITS (Comparative Interrupted Time-Series) with a single control over time. Although slightly abitrary, we distinguish between the two techniques on whether there is enough time series data for CITS to capture the time series patterns.

We will use the formula: bib ~ 1 + year + district*post_treatment which equates to the following model of expected values:

\[\begin{split} \begin{aligned} \mu_i & = \beta_0 \\ & \quad + \beta_{y} \cdot year_i \\ & \quad + \beta_{d} \cdot district_i \\ & \quad + \beta_{p} \cdot post~treatment_i \\ & \quad + \beta_{\Delta} \cdot district_i \cdot post~treatment_i \end{aligned} \end{split}\]

The only change here compared to the classic 2\(\times\)2 DiD model above is the addition of a main effect of year. Because this is coded numerically (not categorically) this can capture any linear changes in the outcome variable over time.

result2 = cp.pymc_experiments.DifferenceInDifferences(
    df_long,
    formula="bib ~ 1 + year + district*post_treatment",
    time_variable_name="year",
    group_variable_name="district",
    model=cp.pymc_models.LinearRegression(
        sample_kwargs={"target_accept": 0.95, "random_seed": seed}
    ),
)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, sigma]
100.00% [8000/8000 00:02<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.
Sampling: [beta, sigma, y_hat]
Sampling: [y_hat]
Sampling: [y_hat]
Sampling: [y_hat]
Sampling: [y_hat]
fig, ax = result2.plot(round_to=3)
../_images/4ef4a59b7f4ebc8433a975235a91298209f2f1403a1e449ef9391670d539e44a.png
result2.summary()
===========================Difference in Differences============================
Formula: bib ~ 1 + year + district*post_treatment

Results:
Causal impact = 20, $CI_{94\%}$[15, 26]
Model coefficients:
Intercept                     160, 94% HDI [157, 164]
post_treatment[T.True]        -28, 94% HDI [-33, -22]
year                          -7.1, 94% HDI [-8.5, -5.7]
district                      -29, 94% HDI [-34, -24]
district:post_treatment[T.True]20, 94% HDI [15, 26]
sigma                         2.4, 94% HDI [1.7, 3.2]

By looking at the interaction term, which captures the causal impact of the intervention, we can see that it looks like about 20 banks were saved by the intervention. Though there is some uncertainty around this, we can see the full posterior estimate of this impact below.

ax = az.plot_posterior(result2.causal_impact, ref_val=0, round_to=3)
ax.set(title="Posterior estimate of causal impact");
../_images/1dc7bb482e53023f237ae99cf30036094d75c1c5c230c08bb52cfb5edfa62531.png

References

[1]

Gary Richardson and William Troost. Monetary intervention mitigated banking panics during the great depression: quasi-experimental evidence from a federal reserve district border, 1929–1933. Journal of Political Economy, 117(6):1031–1073, 2009.

[2] (1,2,3)

Joshua D Angrist and Jörn-Steffen Pischke. Mastering 'Metrics: The path from cause to effect. Princeton University Press, 2014.