🐬

Rによる媒介分析

2022/04/16に公開

はじめに

本記事では基本的な媒介分析のRでの実装を行う。媒介変数が1つの場合で、アウトカムが連続量もしくは2値での直接効果、間接効果の推定を行う。SASのプログラムが付録に記載された総説（矢田, 2020）の結果をRのpackage "CMAverse"で再現することを目標とする。

私は媒介分析を勉強するにあたり主にVanderWeele先生のセミナー動画を利用した。6時間ほどあるがかなりわかりやすくYouTubeで見られるためおすすめである。ただしこのセミナーでは直接効果の分解方法については1種類しか扱わないため、その後に総説（矢田, 2020）を読み直接効果の分解方法を勉強した。

【参考文献】

【参考動画】

Program on Integrative Knowledge At Harvard

Package
結果変数が連続量の媒介分析
- Notation
- Cognitive dataset
- Variables
- 媒介分析で想定されたDAG
- データの読み込み
- データの要約
- 媒介分析
結果変数が2値の媒介分析
- データの読み込み
- データの要約
- 媒介分析
まとめ
- （メモ）今後、追加する可能性のある解析

Package

利用するPackageは以下の通り。

#package
library(dplyr)
library(tidyverse)
library(gtsummary)
library(ggplot2)

library(CMAverse)
library(haven)

結果変数が連続量の媒介分析

Notation

矢田(2020)に倣い、共変量 $C=c$ で条件付けたときの平均的な効果を以下のように定義する。反応変数等の詳細はNested Counterfactualの記法になるので元の論文を参考にして頂きたい。 $a^*$ のように $*$ が上についている場合は参照レベルであることを意味する。

\begin{align*} &E[TE(a, a^*)|c] = E[Y(a)-Y(a^*)|c] \\ &E[PDE(a, a^*)|c] = E[Y(a, M(a^*))-Y(a^*, M(a^*))|c] \\ &E[TDE(a, a^*)|c] = E[Y(a, M(a))-Y(a^*, M(a))|c] \\ &E[CDE(a, a^*; m^*)|c] = E[Y(a, m^*)-Y(a^*, m^*)|c] \\ &E[TIE(a, a^*)|c] = E[Y(a, M(a))-Y(a, M(a^*))|c] \\ &E[PIE(a, a^*)|c] = E[Y(a^*, M(a))-Y(a^*, M(a^*))|c] \\ &E[IRF(a, a^*; m^*)|c] = E[Y(a, M(a^*))-Y(a^*, M(a^*))-Y(a, m^*)+Y(a^*, m^*)|c] \\ &E[IMD(a, a^*; m^*)|c] = E[Y(a, M(a))-Y(a^*, M(a))-Y(a, M(a^*))+Y(a^*, M(a^*))|c] \\ \end{align*}

媒介分析ではTotal Effect(TE)を直接効果と間接効果に分解することを考える。ExpoaureとMediatorの交互作用がないとすればシンプルになるが、交互作用を考慮するとより複雑になる。

例えば、2つの効果に分類する場合でも以下の2種類の場合が考えられる。

\begin{align*} &TE = PDE + TIE \\ &TE = TDE + PIE \\ \end{align*}

1つ目の分解でのPDEとTIEは、NDE(Natural direct effect)とNIE(Natural indirect effect)と呼ばれることもある。

例えば、4つの効果に分類するのであれば以下の式になる。

\begin{align*} &TE = CDE + IRF + IMD + PIE \\ \end{align*}

ほかにも3種類の効果に分類する場合もある。

Cognitive dataset

矢田(2020)で取り上げられたアウトカム、媒介変数ともに連続量であるMarjoribanks(1974)による家庭環境が子どもの認知能力に与える影響の媒介分析の結果を再現する。Marjoribanks(1974)を参考に作成されたデータセット（以下、Cognitive dataset）は、ここから取得できる。

(補足)データはSASのプログラムの形式で書かれているが、以下のプログラムを用いればRでも読み込むことができる。

library(DescTools)

Input = ("
data Cognitive;
   input SubjectID FamSize  SocStatus  Encourage  Motivation  CogPerform;
   datalines;
  1 7 31 36 40 103
    …
  300 5　30 37 41 111
;
run;
;
")

Data = ParseSASDatalines(Input)
save(Cognitive, file = "path/cognitive.rda")

Variables

・データに含まれている変数

変数名	概要
SubjectID	参加者コード
FamSize	家族数
SocStatus	社会的身分スコア
Encourage	親のサポート（Exposure）
Motivation	子どもの学習意欲（Mediator）
CogPerform	子どもの認知度テストスコア（Outcome）

媒介分析で想定されたDAG

"親のサポート"→"子どもの学習意欲"→"子どもの認知度テストスコア"というパスを考え、親のサポートの認知度テストスコアへの直接効果と間接効果を推定することを考える。考慮する共変量として、"親のサポート", "子どもの学習意欲", "子どもの認知度テストスコア"に影響を与える"家族数"と、これら4変数に影響を与える"社会的身分スコア"が用いられた。

※矢田(2020)を参考に作成

データの読み込み

# cognitive
# read_data
load("path/cognitive.rda")

データの要約

データの要約として、家族数は各人数での割合を、その他の変数では中央値と四分位範囲を算出した。またExposureとOutcome、ExposureとMediator、MediatorとOutcomeの関連として散布図を作成した。共変量は特に考慮していない。

# summary
table1 <- tbl_summary(
  Cognitive) %>% 
  add_n %>% 
  modify_header(label = "**Variable**")

# Scattter Plot: Encourage and CogPerform
ggplot(Cognitive, aes(x=Encourage, y=CogPerform)) + geom_point()

# Scattter Plot: Encourage and Motivation
ggplot(Cognitive, aes(x=Encourage, y=Motivation)) + geom_point()

# Scattter Plot: Motivation and Cogperform
ggplot(Cognitive, aes(x=Motivation, y=CogPerform)) + geom_point()

媒介分析

CMAverseのcmestを用いて媒介分析を行った。Exposureの参照値と効果が知りたい値をastar、aとして指定する必要があり、今回はExposureが連続量のため"平均-1/2"および"平均+1/2"とした。今回は1/2を設定したが、SDを用いたり、RQによって自由に決めることができる。またMediatorで条件付けするときの値をmvalで指定するが、今回は平均値とした。

basecは共変量、EMintはExposure-Mediatorの交互作用効果の有無、mreg、yregは回帰分析のリンク関数（2値の
変数であれば"logistic"とする）である。今回は信頼区間をブートストラップ法で推定するためinference = "bootstrap"とした。他にはデルタ法でも推定できる。

# Mediation Analysis
a_0 <- mean(Cognitive$Encourage) - 1/2
a_1 <- mean(Cognitive$Encourage) + 1/2
m_mean <- mean(Cognitive$Motivation)

est <- cmest(data = Cognitive, model = "rb", 
             outcome = "CogPerform", 
             exposure = "Encourage",
             mediator = "Motivation", 
             basec = c("FamSize", "SocStatus"), 
             EMint = TRUE,
             mreg = list("linear"), 
             yreg = "linear",
             astar = a_0, a = a_1, mval = list(m_mean),
             estimation = "paramfunc", inference = "bootstrap", nboot = 1000)

summary(est)

結果の見方を少し説明する。

まず、はじめにOutcome regressionの結果が出力される。今回では以下の平均構造の式（リンク関数は恒等関数）である。

CogPerform = Encourage + Motivation + Encourage * Motivation + FamSize + SocStatus

## Causal Mediation Analysis
## 
## # Outcome regression:
## 
## Call:
## glm(formula = CogPerform ~ Encourage + Motivation + Encourage * 
##     Motivation + FamSize + SocStatus, family = gaussian(), data = getCall(x$reg.output$yreg)$data, 
##     weights = getCall(x$reg.output$yreg)$weights)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.64890  -0.20373   0.00770   0.09149   0.75357  
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -91.957931   3.113717 -29.533   <2e-16 ***
## Encourage              0.967975   0.103132   9.386   <2e-16 ***
## Motivation             0.997369   0.090255  11.051   <2e-16 ***
## FamSize               -0.024709   0.012747  -1.938   0.0535 .  
## SocStatus              0.012118   0.017808   0.680   0.4967    
## Encourage:Motivation   0.083485   0.002298  36.323   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03885623)
## 
##     Null deviance: 70634.587  on 299  degrees of freedom
## Residual deviance:    11.424  on 294  degrees of freedom
## AIC: -115.06
## 
## Number of Fisher Scoring iterations: 2
##

次にMediator regression の結果が出力される。今回では以下の平均構造の式（リンク関数は恒等関数）である。

Motivation = Encourage + FamSize + SocStatus

## # Mediator regressions: 
## 
## Call:
## glm(formula = Motivation ~ Encourage + FamSize + SocStatus, family = gaussian(), 
##     data = getCall(x$reg.output$mreg[[1L]])$data, weights = getCall(x$reg.output$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.83892  -0.10523  -0.04000   0.05182   0.80365  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.29930    0.84684  13.343  < 2e-16 ***
## Encourage    0.68786    0.03628  18.959  < 2e-16 ***
## FamSize     -0.11037    0.01303  -8.470 1.16e-15 ***
## SocStatus    0.15621    0.01817   8.597 4.81e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.05055445)
## 
##     Null deviance: 1106.730  on 299  degrees of freedom
## Residual deviance:   14.964  on 296  degrees of freedom
## AIC: -38.075
## 
## Number of Fisher Scoring iterations: 2
##

最後に媒介分析の結果が出力される。はじめに述べたNotaionとの対応を記載する。

TE: te
PDE: pnde
TDE: tnde
CDE: cde
TIE: tnie
PIE: pnie
IRF: intref
IMD: intmed

これらの変数はTEを2つもしくは4つの効果に分解した場合に対応しているが、3つの効果に分解したときの効果に直接対応している項目はない。

## # Effect decomposition on the mean difference scale via the regression-based approach
##  
## Closed-form parameter function estimation with 
##  bootstrap standard errors, percentile confidence intervals and p-values 
##  
##                Estimate  Std.error    95% CIL 95% CIU  P.val    
## cde           4.1796618  0.0564562  4.0776042   4.298 <2e-16 ***
## pnde          4.1509488  0.0565246  4.0443984   4.268 <2e-16 ***
## tnde          4.2083749  0.0562074  4.1073115   4.328 <2e-16 ***
## pnie          2.6337510  0.2241863  2.1913642   3.063 <2e-16 ***
## tnie          2.6911771  0.2291226  2.2388853   3.130 <2e-16 ***
## te            6.8421259  0.2228087  6.3921778   7.287 <2e-16 ***
## intref       -0.0287131  0.0033370 -0.0349215  -0.022 <2e-16 ***
## intmed        0.0574261  0.0052928  0.0475197   0.068 <2e-16 ***
## cde(prop)     0.6108718  0.0211050  0.5720838   0.653 <2e-16 ***
## intref(prop) -0.0041965  0.0003665 -0.0048325  -0.003 <2e-16 ***
## intmed(prop)  0.0083930  0.0005290  0.0073571   0.009 <2e-16 ***
## pnie(prop)    0.3849317  0.0209557  0.3429331   0.424 <2e-16 ***
## pm            0.3933247  0.0214059  0.3500156   0.433 <2e-16 ***
## int           0.0041965  0.0002300  0.0037239   0.005 <2e-16 ***
## pe            0.3891282  0.0211050  0.3468638   0.428 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (cde: controlled direct effect; pnde: pure natural direct effect; tnde: total natural direct effect; pnie: pure natural indirect effect; tnie: total natural indirect effect; te: total effect; intref: reference interaction; intmed: mediated interaction; cde(prop): proportion cde; intref(prop): proportion intref; intmed(prop): proportion intmed; pnie(prop): proportion pnie; pm: overall proportion mediated; int: overall proportion attributable to interaction; pe: overall proportion eliminated)
## 
## Relevant variable values: 
## $a
## [1] 34.91667
## 
## $astar
## [1] 33.91667
## 
## $mval
## $mval[[1]]
## [1] 38.47
## 
## 
## $basecval
## $basecval[[1]]
## [1] 3.803333
## 
## $basecval[[2]]
## [1] 25.07333

結果変数が2値の媒介分析

結果変数が2値だとTEや直接効果、間接効果の定義や分解方法が連続量の場合とは異なるが基本的な考え方は同様である。また、矢野(2020)で用いられたデータセットがSASを利用できない場合にはアクセスできないと思われるため、本記事ではプログラムを記載することに留める。

数点だけ述べておくと、媒介分析のプログラムではmregとyregで"logistic"を指定している、また矢野(2020)では媒介変数は $m^*=1$ で制御したとあるが $m^*=1$ では再現することができず $m^*=0$ ではないかと思われる。

補足：SASプログラム

proc causalmed data=birthwgt2 decomp pmedmod poutcomemod;
class LowBirthWgt2 Smoking2(ref=first) Death2(ref=first) AgeGroup Married2 Drinking2 SomeCollege2;
model Death2=LowBirthWgt2 | Smoking2;
mediator LowBirthWgt2=Smoking2;
covar AgeGroup Married2 Drinking2 SomeCollege2;
bootstrap  bootci(bc) nsamples=1000 seed=4989;
evaluate 'Conditional on M=1' LowBirthWgt2="1";
evaluate 'Conditional on M=0' LowBirthWgt2="0";
run;

データの読み込み

# birthwgt
birthwgt <- read_sas("path/birthwgt2.sas7bdat")
head(birthwgt)

## # A tibble: 6 x 14
##   LowBirthWgt Married AgeGroup Race   Drinking Death Smoking SomeCollege Death2
##   <chr>       <chr>      <dbl> <chr>  <chr>    <chr> <chr>   <chr>        <dbl>
## 1 No          No             3 Asian  No       No    No      "Yes"            0
## 2 No          No             2 White  No       No    No      "No"             0
## 3 Yes         Yes            2 Native No       Yes   No      "No"             1
## 4 No          No             2 White  No       No    No      "No"             0
## 5 No          No             2 White  No       No    No      "Yes"            0
## 6 No          No             2 White  No       No    No      ""               0
## # ... with 5 more variables: Drinking2 <dbl>, Smoking2 <dbl>,
## #   SomeCollege2 <dbl>, LowBirthWgt2 <dbl>, Married2 <dbl>

データの要約

birthwgt <- birthwgt %>% 
  filter(!is.na(Smoking2)) %>% 
  dplyr::select(Smoking2,LowBirthWgt2,Death2,AgeGroup,Married2,Drinking2,SomeCollege2)

i <- 1:ncol(birthwgt)
birthwgt <- as.data.frame(lapply(birthwgt[i],as.factor))

# summary
table2 <- tbl_summary(
  birthwgt,
  by = Smoking2,
  missing = "ifany") %>% 
  add_n() %>% 
  modify_header(label = "**Variable**")

# 2*2 table
table3 <- birthwgt %>% 
  tbl_cross(row = Smoking2, col = Death2, percent = "row") %>%

table4 <- birthwgt %>% 
  tbl_cross(row = Smoking2, col = LowBirthWgt2, percent = "row") %>%

table5 <- birthwgt %>% 
  tbl_cross(row = LowBirthWgt2, col = Death2, percent = "row") %>%

媒介分析

# mediation analysis
est2 <- cmest(data = birthwgt, model = "rb", 
             outcome = "Death2", 
             exposure = "Smoking2",
             mediator = "LowBirthWgt2", 
             basec = c("AgeGroup", "Married2", "Drinking2", "SomeCollege2"), 
             EMint = TRUE,
             mreg = list("logistic"), 
             yreg = "logistic",
             astar = 0, a = 1, mval = list(0), yval=1,
             estimation = "paramfunc", inference = "bootstrap", nboot = 1000)

summary(est2)

## Causal Mediation Analysis
## 
## # Outcome regression:
## 
## Call:
## glm(formula = Death2 ~ Smoking2 + LowBirthWgt2 + Smoking2 * LowBirthWgt2 + 
##     AgeGroup + Married2 + Drinking2 + SomeCollege2, family = binomial(), 
##     data = getCall(x$reg.output$yreg)$data, weights = getCall(x$reg.output$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3845  -0.0694  -0.0608  -0.0525   3.7112  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -6.33652    0.16531 -38.331  < 2e-16 ***
## Smoking21                0.68108    0.18456   3.690 0.000224 ***
## LowBirthWgt21            3.28595    0.11285  29.117  < 2e-16 ***
## AgeGroup2               -0.03394    0.13540  -0.251 0.802062    
## AgeGroup3                0.05351    0.17870   0.299 0.764591    
## Married21                0.30848    0.09975   3.093 0.001985 ** 
## Drinking21              -0.38793    0.16932  -2.291 0.021953 *  
## SomeCollege21           -0.21444    0.10233  -2.096 0.036111 *  
## Smoking21:LowBirthWgt21 -0.56027    0.20577  -2.723 0.006473 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6506.8  on 93291  degrees of freedom
## Residual deviance: 5310.0  on 93283  degrees of freedom
##   ( 1096 個の観測値が欠損のため削除されました )
## AIC: 5328
## 
## Number of Fisher Scoring iterations: 9
## 
## 
## # Mediator regressions: 
## 
## Call:
## glm(formula = LowBirthWgt2 ~ Smoking2 + AgeGroup + Married2 + 
##     Drinking2 + SomeCollege2, family = binomial(), data = getCall(x$reg.output$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6248  -0.4387  -0.3736  -0.3616   2.5444  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.56776    0.04194 -61.219  < 2e-16 ***
## Smoking21      0.37376    0.03721  10.044  < 2e-16 ***
## AgeGroup2     -0.05914    0.03945  -1.499   0.1339    
## AgeGroup3      0.25724    0.04989   5.156 2.52e-07 ***
## Married21      0.40223    0.02753  14.611  < 2e-16 ***
## Drinking21    -0.50193    0.04792 -10.475  < 2e-16 ***
## SomeCollege21 -0.06805    0.02720  -2.502   0.0124 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 52494  on 93291  degrees of freedom
## Residual deviance: 51962  on 93285  degrees of freedom
##   ( 1096 個の観測値が欠損のため削除されました )
## AIC: 51976
## 
## Number of Fisher Scoring iterations: 5
## 
## 
## # Effect decomposition on the odds ratio scale via the regression-based approach
##  
## Closed-form parameter function estimation with 
##  bootstrap standard errors, percentile confidence intervals and p-values 
##  
##                Estimate Std.error  95% CIL 95% CIU  P.val    
## Rcde            1.97602   0.40054  1.31584   2.875  0.002 ** 
## Rpnde           1.40163   0.18638  1.06235   1.776  0.016 *  
## Rtnde           1.33743   0.17409  1.01778   1.688  0.038 *  
## Rpnie           1.26531   0.03312  1.20830   1.341 <2e-16 ***
## Rtnie           1.20736   0.03129  1.15373   1.274 <2e-16 ***
## Rte             1.69226   0.22229  1.28248   2.153 <2e-16 ***
## ERcde           0.33934   0.12910  0.11225   0.621  0.002 ** 
## ERintref        0.06228   0.10561 -0.13172   0.286  0.572    
## ERintmed        0.02533   0.04335 -0.05567   0.116  0.572    
## ERpnie          0.26531   0.03312  0.20830   0.341 <2e-16 ***
## ERcde(prop)     0.49019   0.14847  0.24511   0.792  0.002 ** 
## ERintref(prop)  0.08997   0.21337 -0.38991   0.284  0.572    
## ERintmed(prop)  0.03659   0.08468 -0.15743   0.118  0.572    
## ERpnie(prop)    0.38325   0.21927  0.22869   0.925 <2e-16 ***
## pm              0.41984   0.15041  0.29507   0.794 <2e-16 ***
## int             0.12656   0.29755 -0.55035   0.398  0.572    
## pe              0.50981   0.14847  0.20791   0.755  0.006 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Rcde: controlled direct effect odds ratio; Rpnde: pure natural direct effect odds ratio; Rtnde: total natural direct effect odds ratio; Rpnie: pure natural indirect effect odds ratio; Rtnie: total natural indirect effect odds ratio; Rte: total effect odds ratio; ERcde: excess relative risk due to controlled direct effect; ERintref: excess relative risk due to reference interaction; ERintmed: excess relative risk due to mediated interaction; ERpnie: excess relative risk due to pure natural indirect effect; ERcde(prop): proportion ERcde; ERintref(prop): proportion ERintref; ERintmed(prop): proportion ERintmed; ERpnie(prop): proportion ERpnie; pm: overall proportion mediated; int: overall proportion attributable to interaction; pe: overall proportion eliminated)
## 
## Relevant variable values: 
## $a
## [1] 1
## 
## $astar
## [1] 0
## 
## $yval
## [1] 1
## 
## $mval
## $mval[[1]]
## [1] 0
## 
## 
## $basecval
## $basecval[[1]]
## [1] 0.7564839 0.1402191
## 
## $basecval[[2]]
## [1] 0.3430415
## 
## $basecval[[3]]
## [1] 0.1427512
## 
## $basecval[[4]]
## [1] 0.4818527

まとめ

媒介分析について、媒介変数が1つの比較的シンプルな状況でTotal Effectを4つに分解する場合での効果推定を行った。本記事ではSASのプログラムが付録に記載された総説（矢田, 2020）の結果をRのpackage "CMAverse"で再現することを目標としたが、今後より発展的な媒介分析についても勉強したいと考えており追加するかもしれない。

（メモ）今後、追加する可能性のある解析

感度分析
回帰分析以外の媒介分析
複数の媒介変数が存在するときの媒介分析

サポートして頂けるとモチベーションに繋がりますのでぜひ宜しくお願いします。データ解析や臨床研究でのご相談があれば、お気軽にTwiiterもしくはメールにてご連絡下さい。

作成者：Masahiro Kondo
作成日：2022/4/16
連絡先：m.kondo1042(at)gmail.com

Rによる媒介分析

はじめに

目次

Package

結果変数が連続量の媒介分析

Notation

Cognitive dataset

Variables

媒介分析で想定されたDAG

データの読み込み

データの要約

媒介分析

結果変数が2値の媒介分析

データの読み込み

データの要約

媒介分析

まとめ

（メモ）今後、追加する可能性のある解析

Discussion