Reproduce Section 8.10 in Imbens and Rubin (2005) with Stan
Overview
Chapters 1-11 of "Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction" deal with causal effect estimation in the case of random assignment. I think this book is very educational in that it often uses concrete data as a starting point. This book is also characterized by the facts "DAGs do not appear at all", "the causal effect is estimated through the problem of missing value imputation of potential outcomes" and "the assignment mechanism is expressed by a random variable". In this blog post, I implemented the case in Section 8.10 with Stan. Some notations have been modified.
Target Readers
People who have read the above book and know a little about Stan's grammar.
Data
I will explain the data. Here I estimate the effect of a job training program (Natinal Supported Work program, NSW program) on how much earnings were increased. Assignment to the program is completely randomized. The data are included in the Matching
package in R or can be obtained from the web site (direct link).
- Outcome: income in 1978 after completing the training program (
re78
) - Covariate used: age (
age
), years of education (education
), whether they were now or ever before married (married
), whether they were high school dropouts (nodegree
), ethnicity (black
), income in 1974 (re74
), whether income in 1974 is greater than zero (re74pos
), income in 1975 (re75
), whether income in 1975 is greater than zero (re75pos
)
The histogram of the outcomes, which is not plotted here, has frequent zero-valued observations and a heavy tail.
Causal Effect
In this book, the analysis unit
Model
Four different models are used in Section 8.10:
- Model 1: no covariates, correlations between potential outcomes, common variance
- Model 2: no covariates, potential outcomes are independent, different variances
- Model 3: with covariates, potential outcomes are independent, different variances
- Model 4: with covariates, potential outcomes are independent, different variances, using a zero-inflated lognormal distribution
We will examine each one in detail below.
Model 1
Assume that the potential outcomes at the unit level follows the following bivariate normal distribution:
Here, the unknown parameters are
The correlation coefficient of the variance-covariance matrix is set to 1 for a reason: it has been shown in Chapter 6 that the variance of the estimated causal effect is largest (a conservative evaluation) when the correlation is 1.
The causal effect at the population level is estimated by imputing the missing potential outcomes according to this model equation. The code is as follows:
Lines 44-45 and 47-48: The missing potential outcomes that were not observed are imputed. Note that one of the bivariate variables in the bivariate normal distribution is observed, so it is generated from a conditional distribution. If you do not know what it is, search for "bivariate normal distribution conditional distribution" or something similar.
Line 36: This corresponds to the mean causal effect on the super population.
Line 37: This corresponds to the mean causal effect on the population corresponding to the data you have (finite sample).
Lines 38-40: Instead of the quantiles of the causal effect, the differences between the quantiles of the control group and the quantiles of the treatment group are calculated (see the book for details).
The R code to be executed is as follows:
Line 5 gives the correlation coefficient, which works fine in R for a normal distribution with 0 standard deviation, but will give an error in Stan, so I subtract a small value.The results are coincident with the book.
Model 2
Assume that the potential outcomes at the unit level follows the following bivariate normal distribution:
The prior distributions for
The code is almost the same as in Model 1, so it is omitted. The results are almost consistent with the book.
Model 3
Assume that the potential outcomes at the unit level follows the following bivariate normal distribution:
The prior distributions for
The code is as follows:
Lines 22-23: The coefficients of the covariates are defined as vector
s.
Lines 29 and 31: The mean of the normal distribution is a linear combination of the covariates and coefficients.
The results are almost consistent with the book.
Model 4
In Models 1-3 above, a normal distribution has been fitted without special care for the large number of zeros in the observed outcomes, but here a zero-inflated lognormal distribution is used for the fitting (although the book does not use the term "zero-inflated lognormal distribution," I think it is clearer to assume that the data follow that distribution.)
Another zero-inflated lognormal distribution is fitted to the treatment group data in the same way.
The code is as follows:
Lines 10-17: Random number generation according to the ZILN distribution is defined.
The R code to be executed is as follows:Stan Forum. However, the regression coefficients for the covariates are almost consistent. I don't know the details because the code for the official implementation has not been published. If you write a book or paper, be sure to publish the code for reproducibility.
The results are...a bit inconsistent! This issue of inconsistency was discussed in the past in
Discussion