Introduction

In this article, we will apply causal inference techniques to a dataset collected for the Infant Health and Development Program (IHDP).

Researchers instructed trained personnel to provide comprehensive, high-quality childcare to remedy problems of low-birth-weight, premature infants. By providing this early intervention, researchers hoped that there would be a causal effect on children’s cognitive test scores.

The goal of this article will be to estimate this causal effect, thus determining whether this intervention had a causal effect on children’s cognitive test scores.

Technical Overview

This dataset does not represent a randomized controlled trial in which treatments were randomly assigned, so there may be confounders between the treatment and outcome. Fortunately, the dataset provides us with 26 features, which may be potential confounders. We will assume that these are the only confounders in the experiment for the sake of this application.

The dataset comprises of a column X (the treatment variable), a column Y (the outcome), and columns Z-0, Z-1, …, Z-25 (feature variables). X is a binary random variable that is labeled 1 if the patient was treated, and 0 otherwise. Y is a continuous random variable the indicates the children’s cognitive test score. Z-0, Z-1, …, Z-25 are a mix continuous and binary random variables that may confound X and Y, and indicate certain features such as the mother’s age, whether or not she smokes, etc.

Recall that when our feature variables (confounders) are continuous, we can use Inverse Propensity Score Weighting (IPSW) to estimate the causal effect of the treatment variable on the outcome variable. We can estimate this as:

With mathematical manipulation, this estimate can be expressed as:

Only if e(Z=z) is not equal to 0, for all z. We can approximate a learned model e(Z) by leveraging a machine learning technique, such as logistic regression, by fitting the features Z to the treatment variable X.

Programming

First, we want to import relevant libraries and get the data. I highly recommend Jupyter Notebook in Python for this task.

Now, let’s estimate e(Z) by fitting the feature variables to the treatment variable X, using a logistic regression model.

We’re almost done! Now we need to write code to estimate the treatment effect using our data samples.

If we call this function with the correct parameters, the estimated treatment effect would be 3.5. What does this mean? Since this estimate of the treatment effect is positive, it indicates that the treatment had a beneficial effect on cognitive test scores. In other words, the treatment has a causal effect on children’s cognitive test scores.

Recall that we assumed the only confounders were those listed in the data. Our inferences are not a guaranteed solution, there may be other confounders not directly measured, or unobserved confounders researchers are not aware of that may influence the total causal of effect of the treatment on the outcome.

However, this result still provides useful insights to researchers in the study and is aligned with the study’s assumptions. One can plausibly infer that the treatment will help improve low-birth-weight, premature infants’ cognitive test scores compared to no treatment.

Future Application

In this article, we applied an IPSW estimator to understand causality in a study. Many causal inference studies are not as simple as this one — there may be many variables that researchers want to explore to understand which of them have causal effects on an outcome.

For an appropriate study, we could carefully extrapolate this application, in addition to proper causal inference techniques, to build a causal story. After reading my previous article and this one, try applying the technical and conceptual framework of causality to datasets here!