📈

Difference-in-differences in Binary Two-by-two Factorial Design

2022/12/26に公開

Disclaimer: the information on this note is authentic to the best of your knowledge, and as such, it is prone to errors and the absence of some key information. The content should not be perceived as professional advice in regards to health or finances, or any other field.
In short, there could be stupid mistakes in this article. So please use the content at your own risk.

Let's consider a two-by-two factorial experiment and the difference-in-differences (DiD) estimator for the average treatment effect.

Setup

Assume we have g \in \{0, 1\} and t \in \{0, 1\} as the indicators. Also assume y \in \{0, 1\} is the target random variable (the "success" label).

As causal effect, we want to estimate

\mathrm{DiD} := (P(Y = 1|g = 1, t = 1) - P(Y = 1|g = 1, t = 0)) \\ \quad - (P(Y = 1|g = 0, t = 1) - P(Y = 1|g = 0, t = 0)).

Estimating the DiD estimand with ordinary least squares

To do so, we estimate the conditional expectation of Y given x
with a linear model and ordinary least squares (OLS).
That is, we estimate the coefficients of the following linear regression:

\mathbb{E}[Y|x] \simeq \alpha + \beta g + \gamma t + \delta g \cdot t

using the squared error as the loss function.

The idea behind this operation is as follows. The OLS estimates the conditional expectation \mathbb{E}[Y | x], and in this case, since Y is binary, \mathbb{E}[Y | x] = P(Y = 1|x).

Then, the DiD estimator will coincide with \delta:

\begin{split} &\mathrm{DiD} \\ &=(\mathbb{E}[Y|g = 1, t = 1] - \mathbb{E}[Y|g = 1, t = 0]) \\ &\qquad - (\mathbb{E}[Y|g = 0, t = 1] - \mathbb{E}[Y|g = 0, t = 0]) \\ &= \{(\alpha + \beta + \gamma + \delta) - (\alpha + \beta)\} - \{(\alpha + \gamma) - \alpha\} \\ &= \delta. \end{split}

Thus, this regression trick allows us to estimate the DiD estimand as the coefficient of an ordinary linear regression.
See [1].

Notation

For convenience, let us define the following:

x = \begin{pmatrix} x^0 \\ x^1 \\ x^2 \\ x^3 \end{pmatrix} = \begin{pmatrix} 1 \\ g \\ t \\ g \cdot t \end{pmatrix}.

We compile the data into the following matrix formats.

\begin{split} X &= \begin{pmatrix} \quad x_1^\top \quad \\ \quad \vdots \quad \\ \quad x_n^\top \quad \end{pmatrix} = \begin{pmatrix} 1 & x_1^1 & x_1^2 & x_1^3 \\ 1 & x_2^1 & x_2^2 & x_2^3 \\ \vdots \\ 1 & x_n^1 & x_n^2 & x_n^3 \\ \end{pmatrix}, \\ Y &= \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}. \end{split}

Outline

The logic will be as follows.

  1. We consider a linear probability model [2]:
    P(Y=1|x) = \theta^\top x.

    This is a saturated model, i.e., this model can express P(Y=1|x) exactly without approximation. With the above notation, the output of the model (the predicted values for the samples) can be concisely expressed as
\begin{split} X \theta &= \begin{pmatrix} x_1^\top \\ \vdots \\ x_n^\top \end{pmatrix} \theta \\ &= \begin{pmatrix} x_1^\top \theta \\ \vdots \\ x_n^\top \theta \end{pmatrix} = \begin{pmatrix} \theta^\top x_1 \\ \vdots \\ \theta^\top x_n \end{pmatrix}. \end{split}
  1. We use the OLS to estimate the parameter \theta := (\alpha, \beta, \gamma, \delta)^\top.
    The resulting estimator is
\hat \theta := (X^\top X)^{-1} X^\top Y.
  1. The goodness of \hat \theta is theoretically analyzed by defining the following quantity:
\epsilon := Y - X \theta.

Then, we have

\begin{split} \hat \theta &= (X^\top X)^{-1} X^\top Y \\ &= (X^\top X)^{-1} X^\top \epsilon + (X^\top X)^{-1} X^\top X \theta \\ &= \theta + (X^\top X)^{-1} X^\top \epsilon. \end{split}
  1. Finally, we want to generate a confidence interval (CI) for \delta. We can generate an asymptotic confidence interval based on the behavior of \epsilon.
    We have
\hat \theta \sim \mathrm{Dist}(\theta, ((X^\top X)^{-1}X^\top)\mathbb{V}(\epsilon)(X(X^\top X)^{-1})).

For the sake of explanation, let us consider for a moment, the case that we can assume \mathbb{E}\epsilon = 0 and

\mathbb{V}(\epsilon) = \sigma^2 I_n \quad (\forall n).

Then (with just that assumption!) we have
\hat \theta \sim \mathrm{Dist}(\theta, \sigma^2 (X^\top X)^{-1})
and we can apply the central limit theorem to get an asymptotic CI [3].
In our case, we have \mathbb{E}\epsilon = 0 by definition, and, if we additionally assume independence, we get the following simple structure where the variance is solely determined by the group g and time t:

\mathbb{V}(\epsilon) = \begin{pmatrix} \sigma^2_1 & & & & \\ & \ddots & & & \\ & & \sigma^2_i & & \\ & & & \ddots & \\ & & & & \sigma^2_n\\ \end{pmatrix}.

In this matrix, \sigma^2_i := \mathbb{V}(Y_i - \theta x_i) = \mathbb{V}(Y_i) depends only on the (g, t) values of the data point i. Thus, there are only four non-zero values that \sigma^2_i take on. That is, \{\sigma^2_{00}, \sigma^2_{01}, \sigma^2_{10}, \sigma^2_{11}\}.

  1. For each (g, t), we have constant x. Thus, the variance of \epsilon_i := y_i - \theta^\top x_i is simply the variance of y_i. We can then bound this variance based on our knowledge of the range of \mathbb{P}(Y_i=1) (the success probability). At least we can use \mathbb{V}(Y) \in [0, \frac{1}{4}] since we know that Y is a Bernoulli random variable. Then, we can have a lower bound on the sample size to achieve at least the desired half-width of the CI.

Dataset

To perform the regression analysis in this setting, instead of dealing with the raw dataset of the form

g t y
\quad{} 0 \quad{} \quad{} 0 \quad{} \quad{} 0 \quad{}
0 0 1
0 1 1
1 0 1
\vdots \vdots \vdots

we can just deal with the following two-by-two count matrices.

Count data table for all data points (calculated by just looking at g and t):

t=0 t=1 (total)
g=0 n_{00} n_{01} n_{g=0}
g=1 n_{10} n_{11} n_{g=1}
(total) n_{t=0} n_{t=1} n

for example, n_{11} := \sum_i g_i t_i.

Count data table for all data points with y = 1 (calculated by looking at both g, t and y):

t=0 t=1 (total)
g=0 m_{00} m_{01} m_{g=0}
g=1 m_{10} m_{11} m_{g=1}
(total) m_{t=0} m_{t=1} m

for example, m_{g=1} := \sum_i g_i y_i.

Point Estimator

We have

\begin{split} X^\top X &= (x_1 \cdots x_n) \begin{pmatrix} x_1^\top \\ \vdots \\ x_n^\top \end{pmatrix} \\ &= \sum_i x_i x_i^\top \\ &= \begin{pmatrix} \sum_i x_i^0 x_i^0 & \sum_i x_i^0 x_i^1 & \sum_i x_i^0 x_i^2 & \sum_i x_i^0 x_i^3 \\ \sum_i x_i^1 x_i^0 & \sum_i {}^1 {\ \ }^1 & \sum_i {}^1 {\ \ }^2 & \sum_i {}^1 {\ \ }^3\\ \sum_i {}^2 {\ \ }^0 & \sum_i {}^2 {\ \ }^1 & \sum_i {}^2 {\ \ }^2 & \sum_i {}^2 {\ \ }^3\\ \sum_i {}^3 {\ \ }^0 & \sum_i {}^3 {\ \ }^1 & \sum_i {}^3 {\ \ }^2 & \sum_i {}^3 {\ \ }^3\\ \end{pmatrix} \\ &= \begin{pmatrix} \sum_i 1 & \sum_i g_i & \sum_i t_i & \sum_i g_i t_i \\ \sum_i g_i & \sum_i g_i & \sum_i g_i t_i & \sum_i g_i t_i \\ \sum_i t_i & \sum_i g_i t_i & \sum_i t_i & \sum_i g_i t_i \\ \sum_i g_i t_i & \sum_i g_i t_i & \sum_i g_i t_i & \sum_i g_i t_i \\ \end{pmatrix} \\ &= \begin{pmatrix} n & n_{g=1} & n_{t=1} & n_{11} \\ n_{g=1} & n_{g=1} & n_{11} & n_{11} \\ n_{t=1} & n_{11} & n_{t=1} & n_{11} \\ n_{11} & n_{11} & n_{11} & n_{11} \\ \end{pmatrix} \end{split}

Using Wolfram Alpha, we get

\begin{pmatrix} a & b & c & d \\ b & b & d & d \\ c & d & c & d \\ d & d & d & d \\ \end{pmatrix}^{-1} = \frac{1}{a - b - c + d} \begin{pmatrix} 1 & -1 & -1 & 1 \\ \\ -1 & - \frac{a - c}{d - b} & 1 & \frac{a - c}{d - b} \\ \\ -1 & 1 & -\frac{a - b}{d - c} & \frac{a - b}{d - c} \\ \\ 1 & \frac{a - c}{d - b} & \frac{a - b}{d - c} & \frac{abc - ad^2 -b^2c - bc^2 + 2bcd}{bcd - bd^2 - cd^2 + d^3} \\ \end{pmatrix}

Thus,

\begin{split} &(X^\top X)^{-1} \\ &= \frac{1}{n - n_{g=1} - n_{t=1} + n_{11}} \cdot\\ &\quad \begin{pmatrix} 1 & -1 & -1 & 1 \\ \\ -1 & - \frac{n - n_{t=1}}{n_{11} - n_{g=1}} & 1 & \frac{n - n_{t=1}}{n_{11} - n_{g=1}} \\ \\ -1 & 1 & -\frac{n - n_{g=1}}{n_{11} - n_{t=1}} & \frac{n - n_{g=1}}{n_{11} - n_{t=1}} \\ \\ 1 & \frac{n - n_{t=1}}{n_{11} - n_{g=1}} & \frac{n - n_{g=1}}{n_{11} - n_{t=1}} & \frac{nn_{g=1}n_{t=1} - nn_{11}^2 - n_{g=1}^2n_{t=1} - n_{g=1}n_{t=1}^2 + 2n_{g=1}n_{t=1}n_{11}}{n_{g=1}n_{t=1}n_{11} - n_{g=1}n_{11}^2 - n_{t=1}n_{11}^2 + n_{11}^3} \\ \end{pmatrix} \\ &= \frac{1}{n_{00}} \begin{pmatrix} 1 & -1 & -1 & 1 \\ \\ -1 & \frac{n_{t=0}}{n_{10}} & 1 & -\frac{n_{t=0}}{n_{10}} \\ \\ -1 & 1 & \frac{n_{g=0}}{n_{01}} & -\frac{n_{g=0}}{n_{01}} \\ \\ 1 & - \frac{n_{t=0}}{n_{10}} & - \frac{n_{g=0}}{n_{01}} & \frac{n_{00}n_{01}n_{10}+n_{00}n_{01}n_{11}+n_{00}n_{10}n_{11}+n_{01}n_{10}n_{11}}{n_{01}n_{10}n_{11}} \end{pmatrix} \\ &= \frac{1}{n_{00}} \begin{pmatrix} 1 & -1 & -1 & 1 \\ \\ -1 & \frac{n_{t=0}}{n_{10}} & 1 & -\frac{n_{t=0}}{n_{10}} \\ \\ -1 & 1 & \frac{n_{g=0}}{n_{01}} & -\frac{n_{g=0}}{n_{01}} \\ \\ 1 & - \frac{n_{t=0}}{n_{10}} & - \frac{n_{g=0}}{n_{01}} & \frac{n_{00}}{n_{11}} + \frac{n_{00}}{n_{10}} + \frac{n_{00}}{n_{01}} + 1 \end{pmatrix} \end{split}

On the other hand,

\begin{split} X^\top Y &= (x_1 \cdots x_n) \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} \\ &= \sum_i x_i y_i \\ &= \begin{pmatrix} \sum_i x_i^0 y_i \\ \vdots \\ \sum_i x_i^3 y_i \end{pmatrix} \\ &= \begin{pmatrix} \sum_i y_i \\ \sum_i g_i y_i \\ \sum_i t_i y_i \\ \sum_i g_i t_i y_i \end{pmatrix} \\ &= \begin{pmatrix} m \\ m_{g=1} \\ m_{t=1} \\ m_{11} \end{pmatrix}. \end{split}

Therefore, we get

\begin{split} \hat \theta &= (X^\top X)^{-1} X^\top Y \\ \end{split}

In particular,

\begin{split} \hat \delta &= \frac{m}{n_{00}} - \frac{m_{g=1}n_{t=0}}{n_{00}n_{10}} - \frac{m_{t=1}n_{g=0}}{n_{01}n_{00}} + \left(\frac{1}{n_{11}} + \frac{1}{n_{10}} + \frac{1}{n_{01}} + \frac{1}{n_{00}}\right)m_{11} \\ &= m \cdot \frac{1}{n_{00}} - m_{g=1} \cdot \left(\frac{1}{n_{00}} + \frac{1}{n_{10}}\right) - m_{t=1} \cdot \left(\frac{1}{n_{00}} + \frac{1}{n_{01}}\right) + m_{11} \cdot \left(\frac{1}{n_{11}} + \frac{1}{n_{10}} + \frac{1}{n_{01}} + \frac{1}{n_{00}}\right) \\ &= \frac{m_{11}}{n_{11}} - \frac{m_{10}}{n_{10}} - \frac{m_{01}}{n_{01}} + \frac{m_{00}}{n_{00}} \\ &= \left(\frac{m_{11}}{n_{11}} - \frac{m_{10}}{n_{10}}\right) - \left(\frac{m_{01}}{n_{01}} - \frac{m_{00}}{n_{00}}\right), \end{split}

Note that this is a natural result given that the estimator is for the DiD estimand.

Interval Estimator (Confidence Interval)

Since we are dealing with a binary outcome Y, we can derive a confidence interval that is independent of the underlying parameters \theta.
First, we have

\hat \theta \sim \mathrm{Dist}(\theta, ((X^\top X)^{-1}X^\top)\mathbb{V}(\epsilon)(X(X^\top X)^{-1})).

Assuming the independence of sample,

\mathbb{V}(\epsilon) = \begin{pmatrix} \sigma^2_1 & & & & \\ & \ddots & & & \\ & & \sigma^2_i & & \\ & & & \ddots & \\ & & & & \sigma^2_n\\ \end{pmatrix},

where \sigma^2_i = \sigma^2_{g_i, t_i}. That is, each \sigma^2_i is one of \{\sigma^2_{00}, \sigma^2_{01}, \sigma^2_{10}, \sigma^2_{11}\} that depends only on (g_i, t_i).

Since we only need the confidence interval for the last parameter \theta_3, we compute the (4, 4)-element of ((X^\top X)^{-1}X^\top)\mathbb{V}(\epsilon)(X(X^\top X)^{-1}). Based on the expression of (X^\top X)^{-1}, we get

\begin{aligned} &((X^\top X)^{-1}X^\top)\mathbb{V}(\epsilon)(X(X^\top X)^{-1})_{4,4} \\ &= (0, 0, 0, 1) (X^\top X)^{-1} \ \ (x_1 \cdots x_n) \begin{pmatrix} \ddots && \\ & \sigma^2_i & \\ &&\ddots \end{pmatrix} \begin{pmatrix} x_1^\top \\ \vdots \\ x_n^\top \end{pmatrix} \ \ (X^\top X)^{-1} \begin{pmatrix} 0 \\ 0 \\ 0 \\ 1 \end{pmatrix} \\ &= (\text{4th row of }(X^\top X)^{-1}) \left(\sum_{i=1}^n \sigma_i^2 x_i x_i^\top\right) \begin{pmatrix} \\ \text{4th column of }(X^\top X)^{-1} \\ \ \end{pmatrix} \\ &= \begin{pmatrix} \frac{1}{n_{00}} & - \frac{n_{t=0}}{n_{00}n_{10}} & - \frac{n_{g=0}}{n_{00}n_{01}} & \frac{1}{n_{00}} + \frac{1}{n_{01}} + \frac{1}{n_{10}} + \frac{1}{n_{11}} \end{pmatrix} \\ &\qquad \cdot \left( \sigma^2_{00}n_{00}\begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{pmatrix} + \sigma^2_{01}n_{01}\begin{pmatrix} 1 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 \end{pmatrix} + \sigma^2_{10}n_{10}\begin{pmatrix} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{pmatrix} + \sigma^2_{11}n_{11}\begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \end{pmatrix} \right) \\ &\qquad\qquad \cdot \begin{pmatrix} \frac{1}{n_{00}} \\ - \frac{n_{t=0}}{n_{00}n_{10}} \\ - \frac{n_{g=0}}{n_{00}n_{01}} \\ \frac{1}{n_{00}} + \frac{1}{n_{01}} + \frac{1}{n_{10}} + \frac{1}{n_{11}} \end{pmatrix} \\ &= \left( \sigma^2_{00} \begin{pmatrix} 1 & 0 & 0 & 0 \end{pmatrix} + \sigma^2_{01} \begin{pmatrix} -1 & 0 & -1 & 0 \end{pmatrix} + \sigma^2_{10} \begin{pmatrix} -1 & -1 & 0 & 0 \end{pmatrix} + \sigma^2_{11} \begin{pmatrix} 1 & 1 & 1 & 1 \end{pmatrix} \right) \\ &\qquad \cdot \begin{pmatrix} \frac{1}{n_{00}} \\ - \frac{n_{t=0}}{n_{00}n_{10}} \\ - \frac{n_{g=0}}{n_{00}n_{01}} \\ \frac{1}{n_{00}} + \frac{1}{n_{01}} + \frac{1}{n_{10}} + \frac{1}{n_{11}} \end{pmatrix} \\ &= \frac{\sigma^2_{00}}{n_{00}} + \frac{\sigma^2_{01}}{n_{01}} + \frac{\sigma^2_{10}}{n_{10}} + \frac{\sigma^2_{11}}{n_{11}} \end{aligned}

That is, we have

\hat \theta_3 \sim \mathrm{Dist}\left(\theta, \frac{\sigma^2_{00}}{n_{00}} + \frac{\sigma^2_{01}}{n_{01}} + \frac{\sigma^2_{10}}{n_{10}} + \frac{\sigma^2_{11}}{n_{11}}\right).

Asymptotically, we get

\hat \theta_3 \overset{\text{approx.}}{\sim} \mathcal{N}\left(\theta, \frac{\sigma^2_{00}}{n_{00}} + \frac{\sigma^2_{01}}{n_{01}} + \frac{\sigma^2_{10}}{n_{10}} + \frac{\sigma^2_{11}}{n_{11}}\right).

Therefore, an asymptotic approximate 95% confidence interval would be

\hat \delta \ \pm\ 1.96 \sqrt{\frac{\sigma^2_{00}}{n_{00}} + \frac{\sigma^2_{01}}{n_{01}} + \frac{\sigma^2_{10}}{n_{10}} + \frac{\sigma^2_{11}}{n_{11}}},

Since Y is a Bernoulli random variable, we have \sigma^2_{00}, \sigma^2_{01}, \sigma^2_{10}, \sigma^2_{11} \in [0, \frac{1}{4}].[4]
Therefore, the conservative estimate

\hat \delta \ \pm\ \frac{1.96}{2} \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{10}} + \frac{1}{n_{01}} + \frac{1}{n_{00}}}

would be an asymptotic approximate 95% confidence interval.

For general confidence levels, an asymptotic approximate \alpha-confidence interval for \delta would be obtained by

\hat \delta \ \pm\ \frac{z_{(1-\frac{\alpha}{2})}}{2} \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{10}} + \frac{1}{n_{01}} + \frac{1}{n_{00}}}

where z_{(1-\frac{\alpha}{2})} is the (1-\frac{\alpha}{2})-quantile of the standard normal distribution (for \alpha = 0.05, we have z_{(1-\frac{\alpha}{2})} \simeq 1.96).

Sample Size Selection

Finally, we select the sample size. For any value of the desired half-width A of the CI, it is sufficient to have

\sqrt{\frac{1}{n_{11}} + \frac{1}{n_{10}} + \frac{1}{n_{01}} + \frac{1}{n_{00}}} \leq \frac{2A}{1.96}

for the CI to have the half-width that is as thin as desired. For other confidence levels, we use z_{(1-\frac{\alpha}{2})} instead of 1.96.

Practical case where t=0 data are already collected

In the practical case, we might already have some baseline measurements (i.e., under a baseline policy that will be used if we do not apply an intervention). Thus, we probably have large n_{00} and n_{10}. Then, if we want to set a lower bound on the sample size, n = n_{01} = n_{11}, we can use

\frac{2}{\left(\frac{2A}{1.96}\right)^2 - \frac{1}{n_{00}} - \frac{1}{n_{10}}} \leq n.

and collect samples of size n from each group g \in \{0, 1\} (i.e., 2n data points in total).

For other confidence levels, use z_{(1-\frac{\alpha}{2})} instead of 1.96.

For example, if we have n_{00}, n_{10} \geq 20000, then

select
  ceil((2 / ((2 * effect_size) * (2 * effect_size) / quantile - 2/20000))) as sample_size
  , effect_size
  , confidence
from
  (select 0.03 as effect_size
  union all select 0.05
  union all select 0.07
  union all select 0.10
   )
  cross join
  (select 2.58 as quantile, "99%" as confidence
  union all select 1.96, "95%"
  union all select 1.64, "90%"
  union all select 1.28, "80%"
  )
order by
  sample_size asc

gives you

sample_size effect_size confidence
64 0.1 80%
82 0.1 90%
98 0.1 95%
129 0.1 99%
131 0.07 80%
168 0.07 90%
200 0.07 95%
256 0.05 80%
264 0.07 99%
328 0.05 90%
392 0.05 95%
516 0.05 99%
712 0.03 80%
912 0.03 90%
1089 0.03 95%
1434 0.03 99%

Thus, e.g., for A = 0.07, we may design n \geq 264 and collect 2n points (i.e., n_{01}=n_{11}=n) in total.

Case when t=0 data are also to be collected

If we want to set a lower bound on the sample size, n = n_{00} = n_{10} = n_{01} = n_{11}, we can use

\frac{4}{\left(\frac{2A}{1.96}\right)^2} \leq n.

and collect samples of size n from each group g \in \{0, 1\} at each time point t \in \{0, 1\} (i.e., 4n data points in total).

For other confidence levels, use z_{(1-\frac{\alpha}{2})} instead of 1.96.

Example Implementation in SQL

Sample size

/* Count data for t=0, assuming they already exist. */
with example_all_counts as (
  select 0 as g, 0 as t, 10 as _count
  union all select 1, 0, 10
)
/* Turn the count data into columns. */
, aux_counts as (
  select * from
    (select max(_count) as n_00 from example_all_counts where g = 0 and t = 0)
    cross join
    (select max(_count) as n_10 from example_all_counts where g = 1 and t = 0)
)
/* Constants */
, confidence_quantile as (
  select "95%" as confidence, 1.96 as quantile
  union all select "90%", 1.64
  union all select "80%", 1.28
)
/* Parameters */
, desired_halfwidths as (
  select 0.05 as halfwidth
  union all select 0.07
  union all select 0.10
)
/* Use the above equation to select the (half) sample size. */
select
  confidence
  , halfwidth
  , 2 / ((2 * halfwidth / quantile) * (2 * halfwidth / quantile) - 1/n_00 - 1/n_10) as half_sample_size
from
  confidence_quantile
  cross join desired_halfwidths
  cross join aux_counts
order by
  half_sample_size

Estimates

/* Count data resulting from the experiment and the historical data. */
with example_all_counts as (
  select 0 as g, 0 as t, 10 as _count
  union all select 0, 1, 10
  union all select 1, 0, 10
  union all select 1, 1, 10
)
/* Count data conditioned on y=1. */
, example_success_counts as (
  select 0 as g, 0 as t, 0 as _count
  union all select 0, 1, 5
  union all select 1, 0, 2
  union all select 1, 1, 3
)
/* Turn the count data into columns. */
, _counts as (
  select * from
    /* nullif() avoids zero-division and generates null instead */
    (select nullif(max(_count), 0) as n_00 from example_all_counts where g = 0 and t = 0)
    cross join (select nullif(max(_count), 0) as n_10 from example_all_counts where g = 1 and t = 0)
    cross join (select nullif(max(_count), 0) as n_01 from example_all_counts where g = 0 and t = 1)
    cross join (select nullif(max(_count), 0) as n_11 from example_all_counts where g = 1 and t = 1)
    cross join (select max(_count) as m_00 from example_success_counts where g = 0 and t = 0)
    cross join (select max(_count) as m_10 from example_success_counts where g = 1 and t = 0)
    cross join (select max(_count) as m_01 from example_success_counts where g = 0 and t = 1)
    cross join (select max(_count) as m_11 from example_success_counts where g = 1 and t = 1)
)
/* Constants */
, confidence_quantile as (
  select "80%" as confidence, 1.28 as quantile
  union all select "99%", 2.58
  union all select "95%", 1.96
  union all select "90%", 1.64
)
/* Use the above equation to obtain the half-width. */
, _halfwidth as (
  select
    confidence
    , (quantile / 2) * sqrt(1/n_11 + 1/n_10 + 1/n_01 + 1/n_00) as halfwidth
  from
    confidence_quantile
    cross join _counts
)
/* Use the above equation to obtain the center of the CI. */
, point_estimate as (
  select
    (m_11/n_11 - m_10/n_10) - (m_01/n_01 - m_00/n_00) as center
  from
    _counts
)
/* Return the results. */
select
  confidence
  , center - halfwidth as lower_bound
  , center
  , center + halfwidth as upper_bound
from
  _halfwidth
  cross join point_estimate

Notes

  • A very good introduction to robust standard errors can be found in [5].
  • A different account on a relevant setting may be found in Section 6.4.1 of [6].
  • Confidence interval for the geometric distribution: [7]

References

脚注
  1. Angrist, J. D., & Pischke, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton university press. Section 5.2.1. ↩︎

  2. https://stats.stackexchange.com/questions/89513/difference-in-differences-estimator-for-logistic-regressions ↩︎

  3. https://math.stackexchange.com/questions/3107991/how-to-compute-asymptotic-confidence-interval-for-linear-regression ↩︎

  4. https://en.wikipedia.org/wiki/Popoviciu's_inequality_on_variances ↩︎

  5. https://data.library.virginia.edu/understanding-robust-standard-errors/ ↩︎

  6. Dean, A., Voss, D., & Draguljić, D. (2017). Design and Analysis of Experiments. Springer International Publishing. https://doi.org/10.1007/978-3-319-52250-0 ↩︎

  7. https://stats.stackexchange.com/questions/310413/confidence-interval-using-central-limit-theorem ↩︎

Discussion