import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import causalpy as cp

Inverse Propensity Score Weighting with pymc#

In this notebook we will briefly demonstrate how to use propensity score weighting schemes to recover treatment effects in the analysis of observational data. We will first showcase the method with a simulated data example drawn from Lucy D’Agostino McGowan’s excellent blog on inverse propensity score weighting. Then we shall apply the same techniques to NHEFS data set discussed in Miguel Hernan and Robins’ Causal Inference: What if book. This data set measures the effect of quitting smoking between the period of 1971 and 1982. At each of these two points in time the participant’s weight was recorded, and we seek to estimate the effect of quitting in the intervening years on the weight recorded in 1982.

We will use inverse propensity score weighting techniques to estimate the average treatment effect. There are a range of weighting techniques available: we have implemented raw, robust, doubly robust and overlap weighting schemes all of which aim to estimate the average treatment effect. The idea of a propensity score (very broadly) is to derive a one-number summary of individual’s probability of adopting a particular treatment. This score is typically calculated by fitting a predictive logit model on all an individual’s observed attributes predicting whether or not the those attributes drive the individual towards the treatment status. In the case of the NHEFS data we want a model to measure the propensity for each individual to quit smoking.

The reason we want this propensity score is because with observed data we often have a kind of imbalance in our covariate profiles across treatment groups. Meaning our data might be unrepresentative in some crucial aspect. This prevents us cleanly reading off treatment effects by looking at simple group differences. These “imbalances” can be driven by selection effects into the treatment status so that if we want to estimate the average treatment effect in the population as a whole we need to be wary that our sample might not give us generalisable insight into the treatment differences. Using propensity scores as a measure of the prevalance to adopt the treatment status in the population, we can cleverly weight the observed data to privilege observations of “rare” occurrence in each group. For example, if smoking is the treatment status and regular running is generally not common among the group of smokers, then on the occasion we see a smoker marathon runner we should heavily weight their outcome measure to overcome their low prevalence in the treated group but real presence in the unmeasured population. Inverse propensity weighting tries to define weighting schemes are inversely proportional to an individual’s propensity score so as to better recover an estimate which mitigates (somewhat) the risk of selection effect bias. For more details and illustration of these themes see the PyMC examples write up on Non-Parametric Bayesian methods. [Forde, 2024]

Simulated Data#

First we simulate some data for treatment and outcome variables.

df1 = pd.DataFrame(
    np.random.multivariate_normal([0.5, 1], [[2, 1], [1, 1]], size=10000),
    columns=["x1", "x2"],
)
df1["trt"] = np.where(
    -0.5 + 0.25 * df1["x1"] + 0.75 * df1["x2"] + np.random.normal(0, 1, size=10000) > 0,
    1,
    0,
)
TREATMENT_EFFECT = 2
df1["outcome"] = (
    TREATMENT_EFFECT * df1["trt"]
    + df1["x1"]
    + df1["x2"]
    + np.random.normal(0, 1, size=10000)
)
df1.head()
x1 x2 trt outcome
0 2.134696 1.245262 1 4.829071
1 2.353206 1.414617 1 6.417505
2 0.352728 1.946989 1 1.884942
3 1.173141 1.718525 1 4.902149
4 2.870237 1.089666 1 5.857759

Note how we have specified the treatment effect of interest to be exactly 2.

Now we invoke the InversePropensityWeighting experiment class, with the PropensityScore model. This will by default fit a simple logistic regression model and store the idata under result1.idata.

seed = 42
result1 = cp.InversePropensityWeighting(
    df1,
    formula="trt ~ 1 + x1 + x2",
    outcome_variable="outcome",
    weighting_scheme="robust",
    model=cp.pymc_models.PropensityScore(
        sample_kwargs={
            "draws": 1000,
            "target_accept": 0.95,
            "random_seed": seed,
            "progressbar": False,
        },
    ),
)

result1
/Users/benjamv/mambaforge/envs/CausalPy/lib/python3.11/site-packages/pymc/data.py:304: FutureWarning: MutableData is deprecated. All Data variables are now mutable. Use Data instead.
  warnings.warn(
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [b]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 12 seconds.
Sampling: [b, t_pred]
Sampling: [t_pred]
<causalpy.exp_inverse_propensity_weighting.InversePropensityWeighting at 0x1784c9810>

We can interrogate this inference data object in the usual fashion to assess the model fit of the propensity score model. Looking here at the parameters in the logistic regression.

az.plot_posterior(result1.idata, var_names=["b"]);
../_images/5d41f56a9ecf577a8c4c6846018a6a745ced3233d444e256149e39e56a35461d.png

The health of the sampling trace.

az.plot_trace(result1.idata, var_names=["b"]);
../_images/184cc29fe80256bb70b23d312fcc64d653feaeb66c9753162cf6ac7e217a605b.png

The posterior predictive checks of our propensity model show the predicted binary outcomes (i.e. smoker yes/no) drawn from the posterior predictive distribution in blue. This is contrasted against the observed outcomes in our data set in black. In short this plot shows how the model is able to recover the observed data.

fig, ax = plt.subplots(figsize=(20, 6))
az.plot_ppc(result1.idata, ax=ax);
../_images/f2fb0a2cff7e09637f97e8bd30db5462ddb5498dcd4b0ca07300220f254b79b1.png

But our primary focus when we’re conducting an inverse propensity weighting experiment is not on the accuracy of the propensity score model itself. Instead we want to incorporate these propensity scores latent in the logistic regression to re-weight the outcomes of interest to have a better, more representative measure of the treatment mitigating some of the risk of selection effects driving the data generating function.

result1.idata["posterior"]["p"].mean(dim=("chain", "draw"))
<xarray.DataArray 'p' (p_dim_0: 10000)> Size: 80kB
array([0.83735902, 0.87570042, 0.85111462, ..., 0.54100343, 0.4645717 ,
       0.68424256])
Coordinates:
  * p_dim_0  (p_dim_0) int64 80kB 0 1 2 3 4 5 ... 9994 9995 9996 9997 9998 9999

It is these propensity scores which we will use to determine how to weigh the contribution of each individual in our sample when calculating the causal contrast of interest.

Evaluating Balance#

One of the main criteria for success with the estimation of propensity scores is to check how balanced the covariate profiles of our data are across the treatment status under different re-weighting schemes. A good balance of the covariate values across the treatment status is suggestive of the requirement that assignment to a treatment status should be as good as random when conditional on the covariate profile \(X\). That is to say, the condition of strong ignorability holds when the treatment status \(T\) is independent of the propensity \(p(X)\) conditional on \(X\).

One visual way to analyse this balance is to look at the empirical cumulative distribution function for each covariate conditional on the different realisation of propensity scores under the different weighting schemes. We can contrast the difference in the shapes of the ECDFs using the following function.

result1.plot_balance_ecdf("x1", weighting_scheme="raw");
../_images/3e46871b35645439796f202338a52bddecb441f5767bc26a3a8280b1606ca2fd.png

On the left we see the unweighted ECDF of the x1 variable for both treatment and control groups. On the right we show the same but have used weigting to better align the two ECDF profiles. Note here how the re-weighting of the variable using the raw scheme has served to align the shapes of the distribution among both treatment groups, apart from a slight gap at the upper quantiles of the re-weighted ECDF. What happens if we use a different weighting scheme?

result1.plot_balance_ecdf("x1", weighting_scheme="overlap");
../_images/bd048c92992523072702a385ee5f7cb097ea3c02c0f37d7760546b0156ec35bb.png

Here we see an even tighter alignment. This weighting scheme refers to the ATO: Average Treatment Effect Among the Overlap Population described in Lucy D’Agostino McGowan’s linked blog. In both cases we can be reasonably happy that conditional on the propensity the weighting mechanism serves to balance the covariate distribution across the treatment effects.

Reminder - by using these propensity scores we are trying to reduce the bias due to selection effects. When the covariate distributions are balanced or close in these plots conditional on our propensity score weighting, we are arguing that this justifies treating the reweighted causal contrast as akin to one derived in an experiment where the treatment assignment mechanism was really random. We are trying to weight our data to recreate the conditions of as-if-random-allocation to the treatment, and use this to appeal to license our causal conclusion. Next we’ll look to estimate the average treatment effect under these schemes and see if we can discern differences in achieved accuracy.

Estimating the Average Treatment Effect#

Again the InversePropensityWeighting class bundles the functionality to inspect the propensity score distribution and evaluate the average treatment effect under different weighting schemes

result1.plot_ate(method="raw", prop_draws=10, ate_draws=500);
../_images/bb714698a57652ab63f43811d9569b5498dc8bfc6403c89e81195fe85926fcdf.png

Here we have plotted in three panels:

  • mirrored draws from the propensity score distribution split by treated and control groups in the yellow and blue with the grey showing the pseudo-population created by the weighting.

  • the expected outcome in those groups under re-weighting under each draw.

  • the derived estimates for the average treatment effect.

Note here how expected value of the ATE is pulled slightly away from the true value under this weighting scheme. This is likely due to the high number of individuals with extreme propensity scores - denoted here in the first plot as individuals with propensity scores in excess of .9 and below .1.

Let’s check what happens using the overlap weighting scheme?

result1.plot_ate(method="overlap", prop_draws=10, ate_draws=500);
../_images/dbba85b577fac7b174cf9592df3836a32948afc8f006fd45aca1120b3b9c0038.png

We see here how the particular weighting scheme was able to recover the true treatment effect by defining a contrast in a different pseudo population. This is a useful reminder in that, while propensity score weighting methods are aids to inference in observational data, not all weighting schemes are created equal and we need to be careful in our assessment of when each is applied appropriately. Fundamentally the weighting scheme of choice should be tied to the question of what are you trying to estimate. Aronow and Miller’s Foundations of Agnostic Statistics [Aronow and Miller, 2019] has a good explanation of the differences between the raw, robust and doubly robust weighting schemes. In some sense these offer an escalating series of refined estimators each trying to improve the variance in the ATE estimate. The doubly robust approach also tries to offer some guarantees against model misspecification. The overlap estimator represents an attempt to calculate the ATE among the population with the overlapping propensity scores. This can be used to guard against poor inference in cases where propensity score distributions have large non-overlapping regions.


NHEFS Data#

Now we’ll apply the same techniques to real data.

This data set from the National Health and Nutrition Examination survey records details of weight, activity and smoking habits of around 1500 individuals over two periods. The first period established a baseline and a follow-up period 10 years later. We will analyse whether the individual (trt == 1) quit smoking before the follow up visit. Each individuals’ outcome represents a relative weight gain/loss comparing the two periods.

df = cp.load_data("nhefs")
df[["age", "race", "trt", "smokeintensity", "smokeyrs", "outcome"]].head()
age race trt smokeintensity smokeyrs outcome
0 42 1 0 30 29 -10.093960
1 36 0 0 20 24 2.604970
2 56 1 0 20 26 9.414486
3 68 1 0 3 53 4.990117
4 40 0 0 20 19 4.989251
formula = """trt ~ 1 + age + race + sex + smokeintensity + smokeyrs + wt71 + active_1 + active_2 + 
         education_2 + education_3 + education_4 + education_5 + exercise_1 + exercise_2"""

result = cp.InversePropensityWeighting(
    df,
    formula=formula,
    outcome_variable="outcome",
    weighting_scheme="robust",  ## Will be used by plots after estimation if no other scheme is specified.
    model=cp.pymc_models.PropensityScore(
        sample_kwargs={
            "draws": 1000,
            "target_accept": 0.95,
            "random_seed": seed,
            "progressbar": False,
        },
    ),
)

result
/Users/benjamv/mambaforge/envs/CausalPy/lib/python3.11/site-packages/pymc/data.py:304: FutureWarning: MutableData is deprecated. All Data variables are now mutable. Use Data instead.
  warnings.warn(
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [b]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 15 seconds.
There were 1000 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
Sampling: [b, t_pred]
Sampling: [t_pred]
<causalpy.exp_inverse_propensity_weighting.InversePropensityWeighting at 0x3a4abefd0>

Evaluating Balance#

result.plot_balance_ecdf("age");
../_images/e7fd7ee179faf3eb84db873b202e4c2dc06da3dc1bbea7924f30faaff032022a.png
result.plot_balance_ecdf("smokeyrs");
../_images/b52f5d2a99dd28ed74eb0792cdb0dd14116f9bc0175d1eb71a64f263d181e160.png
result.plot_balance_ecdf("smokeintensity");
../_images/7e6adfc5b36d4324310549e22461f686a30ea227e9c145e31f9f4b4bdd23d743.png

In all cases re-weighting seems to have a positive effect and helps achieve conditional balance.

Estimating the Average Treatment Effect#

Here we’ll use two different weighting schemes to highlight the functionality of the robust and doubly robust weighting. We can use these two approaches by passing in different kwargs to the plotting functions. We are still re-weighting the same propensity score distribution derived with our logit model above.

result.plot_ate(method="robust", prop_draws=10, ate_draws=500);
../_images/91a38d4bf83e26622529ec38a79ad6419b8699e1aad988f764f45230b54b4441.png
result.plot_ate(method="doubly robust", prop_draws=10, ate_draws=500);
../_images/3776bd78b59c499da2457af98dc7a379b1ed45da081a9443ebe51932f12d7a77.png

The thing to note here is that while (a) the propensity distributions for both control and treatment groups seem broadly overlapping and (b) both weighting schemes recover substantially similar effects here, the variance on the doubly robust estimator is much tighter. This aspect of the doubly robust estimator is by design and can be important where precise estimation of the treatment effects are important. It will not always be true that the robust and doubly robust weighting schemes will yield similar results, and as such differences between these methods could point to issues with the propensity score model.

Note:
We have limited our focus on the implementation of the inverse propensity score weighting for CausalPy to a simple Logistic regression model of the propensity score. However, the analysis routines of the InversePropensityWeighting experiment class will run on any arviz inference data object where the propensity score posterior distribution can be identified as p. So this frees up the possibility of using non-parametric propensity score designs as discussed in more depth here. [Forde, 2024]

Conclusion#

This concludes our brief tour of inverse-propensity weighting experiments. Propensity modelling and propensity weighting are a powerful tool in causal inference and their potential is by no means limited to the use-cases implemented here. Thinking through the propensity score mechanism and what drives different selection effects is always a good first step in causal modelling. If the drivers of treatment choice can be modelled well propensity score adjustment is often a good way to recover the causal quantity of interest.

References#

[1] (1,2)

Nathaniel Forde. Bayesian non-parametric causal inference. In PyMC Team, editor, PyMC examples. 2024. doi:10.5281/zenodo.5654871.

[2]

P Aronow and B Miller. Foundations of Agnostic Statistics. Cambridge University Press, 2019.