二値の時間依存性変数（一度変化したら一定）を考慮した生存時間の生成

ことの始まり

生存時間解析だけでなく、longitudinal dataの解析や因果推論など、いろいろな場面で時間依存性の変数を持つデータセットを用いる機会は増えている。

数理統計苦手マンとしては、サンプルデータを生成できたら事前に色々な評価ができて便利なので、時間依存性の変数を持つデータの生成ができるようになりたい。

また、そのような生成関数を作っておいて、諸条件を変えていけるような便利コードを作りたい。

生存時間の生成

下記の要領で、シンプルな事例からトライした。

指数分布を使う。指数分布だとrexp()があるので、ハザード関数の逆関数を用いて生存時間を得る必要はない。
時間依存性変数X3は二値変数で、どこかで1となったらずっと1。
簡単のため時間依存変数以外の変数は独立した2つの標準正規分布。
マトリックス表記でないので変数増やしたい時は行列形式にする。
365日目で打ち切りとしている。ランダムな打ち切りに変える／追加することもできると思う。
実際のデータだと生存時間はintegerなのでfloor関数を使っている。
- ChatGPTに見せるとこの点は不要ではないかといつも指摘される。
X3の変化点までの時間とそこからイベントが起きるまでの時間を足しているのは、指数分布のmemoryless propertyによるものかもしれない？
X3が変化する時点の候補から365日目を抜かないとstarttime = endtime = 365 となってしまうことがあるので調整している。max(1, hoge)なども同様の理由で、integerにしている関係で生成された時間が0になってstartime = endtime = 0となることを避けるための条件。
X3は時間依存性変数としてCoxモデルに入れたいので計数過程形式のデータにしている。
ベースとなるコードはChatGPTに言って書いてもらった。
- なので自分ならこうは書かないというものもある。逆に参考になる書き方もあるのでいい勉強になる。

library(survival)
library(tidyverse)

# Prep --------------------------------------------------
# Set parameters
set.seed(123)  # For reproducibility
N <- 500        # Number of subjects
T_max <- 365   # Maximum follow-up time in days
true_beta <- c(0.2, -0.3, 0.5)  # True coefficients for the covariates (log hazard)

# Approach using rexp() - binary static TD variable ----------------------------
## data generating function -------
gen_data <-
  function() {
    # Generate static covariates
    X1 <- rnorm(N)  # Covariate 1
    X2 <- rnorm(N)  # Covariate 2
    
    # Initialize dataset
    data <- data.frame(id=integer(0), starttime=double(0), endtime=double(0),
                       X1=double(0), X2=double(0), X3=integer(0), status=integer(0))
    
    # Hazard function and survival time simulation
    lambda_base <- 0.01  # Base hazard rate
    
    for (i in 1:N) {
      # Random time point to switch the time-varying covariate from 0 to 1
      # T_max - 1 is to ensure after the switch there is at least one day follow-up,
      # preventing starttime = endtime
      switch_time <- sample(1:(T_max-1), 1)
      
      # Hazard rate for this subject before the switch
      hazard_rate_pre_switch <- lambda_base * exp(true_beta[1] * X1[i] + true_beta[2] * X2[i])
      
      # Simulate survival time before the switch
      survival_time_pre_switch <- max(1,floor(rexp(1, hazard_rate_pre_switch)))
      
      if (survival_time_pre_switch < switch_time) {
        # Subject experiences the event before the switch
        data <- bind_rows(data, c(id=i, starttime=0, endtime=survival_time_pre_switch, X1=X1[i], X2=X2[i], X3=0, status=1))
      } else {
        # Subject does not experience the event before the switch
        data <- bind_rows(data, c(id=i, starttime=0, endtime=switch_time, X1=X1[i], X2=X2[i], X3=0, status=0))
        
        # Hazard rate for this subject after the switch
        hazard_rate_post_switch <- lambda_base * exp(true_beta[1] * X1[i] + true_beta[2] * X2[i] + true_beta[3])
        
        # Simulate survival time after the switch
        survival_time_post_switch <- max(1,floor(rexp(1, hazard_rate_post_switch)))
        
        # Determine end time and status after the switch
        end_time_post_switch <- min(switch_time + survival_time_post_switch, T_max) 
        status_post_switch <- as.integer(switch_time + survival_time_post_switch <= T_max)
        
        # Add data for the period after the switch
        data <- bind_rows(data, c(id=i, starttime=switch_time, endtime=end_time_post_switch, X1=X1[i], X2=X2[i], X3=1, status=status_post_switch))
      }
      
    }
    
    return(data)
    
  }

データの中身はこんな感じ。

> sample_data <- gen_data()
> head(sample_data)
  id starttime endtime         X1          X2 X3 status
1  1         0      14 -1.1998415 -0.08959434  0      1
2  2         0      79  1.8164182 -0.28290468  0      1
3  3         0      71  0.2776409  2.16411666  0      1
4  4         0      62 -1.1545589 -0.90907695  0      0
5  4        62      90 -1.1545589 -0.90907695  1      1
6  5         0      46  0.4880977 -0.56249520  0      1

データ生成関数の検証

バイアスとカバレッジを確認する。
このやり方は{simsurv}のvignetteを参考にした。

sim <-
  function() {

    data <- gen_data()
    
    fit <- coxph(Surv(starttime, endtime, status) ~ X1+X2+X3, data)
    summary(fit)
    
    est_cox <- fit$coefficients
    ses_cox <- sqrt(diag(fit$var))
    cil_cox <- est_cox + qnorm(.025) * ses_cox
    ciu_cox <- est_cox + qnorm(.975) * ses_cox
    
    c(bias_cox = est_cox - true_beta, 
      coverage_cox = ((true_beta > cil_cox) & (true_beta < ciu_cox))
    )
    
  }

# Get the means of replications
sim_res <-rowMeans(replicate(100, sim()))

実行結果はこんな感じ。多分うまくできている。

> sim_res
    bias_cox.X1     bias_cox.X2     bias_cox.X3 coverage_cox.X1 coverage_cox.X2 coverage_cox.X3 
  -0.0015067926   -0.0003750131    0.0058340024    0.9400000000    0.9800000000    0.9500000000

ironman

逆変換を使って前と同じことをやってみる

やること

次は同じことをAustin (2012)の式1から生成する。

この文献では乱数生成の一つである逆変換を用いている。「生存時間を表す確率変数T、その累積密度関数を $S$ としたときに、 $S(T)$ が一様分布に従い、 $F(T)=1-S(T)$ も一様分布に従うこと」を利用する。ここで $F(T)$ がイベント発生時間の関数であるので、これからデータを生成する。

式で書くと：

U = exp(−H_{0}(T) \cdot e^{\beta' X}) \sim U[0,1]

T = H_{0}^{−1} (\cfrac {−log(U )} {e^{\beta' X}})

となる。

仮定している生存時間によってハザードの逆関数の形は変わってくるが、指数分布の場合：

T = \cfrac {−log(U)} {\lambda \cdot e^{\beta' X} }

となる。時間依存性共変量が1になる前はこの式からのサンプリングになる。

なお、ある確率変数にその分布関数を適用すると一様分布に変換されることをProbability integral transformと言い、帰無仮説が成り立つときのP値が一様分布であることや、回帰モデルの診断プロットであるQQプロットなどにも関連する。

データ生成

上のコードとは違ってfloorではなくceilingを使ったり、ちょっと変わっているところがあるが気にしない。

# Set parameters
set.seed(123)  # For reproducibility
N <- 500       # Number of subjects
T_max <- 365  # Maximum time in days
lambda <- 0.01  # Base hazard rate
true_beta <- c(0.2, -0.3, 0.5)  # Coefficients for covariates

## data generating function -------
gen_data2 <-
  function() {
    
    # Generate static covariates
    X1 <- rnorm(N)  # Covariate 1 (standard normal)
    X2 <- rnorm(N)  # Covariate 2 (standard normal)
    
    # Initialize dataset
    data <- data.frame(id=integer(0), starttime=double(0), endtime=double(0),
                       X1=double(0), X2=double(0), X3=integer(0), status=integer(0))
    
    # Simulate survival times
    for (i in 1:N) {
      t0 <- sample(1:(T_max-1), 1)  # Random time point for covariate change
      
      # Calculate hazard before and after t0
      hazard_pre_t0 <- lambda * exp(true_beta[1] * X1[i] + true_beta[2] * X2[i])
      hazard_post_t0 <- lambda * exp(true_beta[1] * X1[i] + true_beta[2] * X2[i] + true_beta[3])
      
      # Simulate survival time
      u <- runif(1)
      if (-log(u) < hazard_pre_t0 * t0) {
        # Event occurs before t0
        survival_time <- ceiling(-log(u) / hazard_pre_t0)
        data <- bind_rows(data, c(id=i, starttime=0, endtime=survival_time, X1=X1[i], X2=X2[i], X3=0, status=1))
      } else {
        # Event occurs after t0
        survival_time_post_t0 <- ceiling(
          ( (-log(u) - hazard_pre_t0*t0 + hazard_post_t0*t0) / hazard_post_t0 )
                                         )
        survival_time_post_t0 <- min(survival_time_post_t0, T_max)
        status_post_t0 <- as.integer(survival_time_post_t0 <= T_max)
        data <- bind_rows(data, c(id=i, starttime=0, endtime=t0, X1=X1[i], X2=X2[i], X3=0, status=0))
        data <- bind_rows(data, c(id=i, starttime=t0, endtime=survival_time_post_t0, X1=X1[i], X2=X2[i], X3=1, status=status_post_t0))
      }
    }
    
    return(data)
  }

生成コードの検証

sim2 <-
  function() {
    
    data <- gen_data2()
    
    fit <- coxph(Surv(starttime, endtime, status) ~ X1+X2+X3, data)
    summary(fit)
    
    est_cox <- fit$coefficients
    ses_cox <- sqrt(diag(fit$var))
    cil_cox <- est_cox + qnorm(.025) * ses_cox
    ciu_cox <- est_cox + qnorm(.975) * ses_cox
    
    c(bias_cox = est_cox - true_beta, 
      coverage_cox = ((true_beta > cil_cox) & (true_beta < ciu_cox))
    )
    
  }

# Get the means of replications
sim_res2 <- rowMeans(replicate(100, sim2()))

実行結果。時間依存変数だけ少しバイアスが大きいような気がしないでもない・・・。

> sim_res2
    bias_cox.X1     bias_cox.X2     bias_cox.X3 coverage_cox.X1 coverage_cox.X2 
  -0.0003315594    0.0052254085    0.0178312514    0.9700000000    0.9700000000 
coverage_cox.X3 
   0.9300000000

ironman

Austin (2012) 再現の続き

論文リンク
3.2 Continuous time-varying covariateの式(4)を再現しようとしてうまくいっていない。
これは、時間依存変数の値が時間tに比例するという仮定を置いた場合に導出される式を用いているとのこと。
途中、計数過程形式に変換するときにchatGPTが教えてくれたdata$counting <- with(data, ave(time, id, FUN = function(x) c(0, head(x, -1))))は使ったことのないave()という関数を使っていて勉強になった。

データ生成関数の作成

# Set parameters
N <- 500       # Number of subjects
T_max <- 365  # Maximum time in days
lambda <- 0.01  # Base hazard rate
true_beta <- c(0.2, -0.3, 0.5)  # Coefficients for covariates

## data generating function -------
gen_data3 <-
  function() {
    
    # Generate static covariates
    X1 <- rnorm(N)  # Covariate 1 (standard normal)
    X2 <- rnorm(N)  # Covariate 2 (standard normal)
    

    # Initialize dataset
    data <- data.frame(id=integer(0), 
                       time=double(0),
                       X1=double(0), 
                       X2=double(0), 
                       X3=integer(0), 
                       status=integer(0)
                       )
    
    
    # Simulate survival times
    for (i in 1:N) {
      k <- runif(1,min=0.01,max=1)
      u <- runif(1)
      
      baselinehazard <- lambda * exp(true_beta[1] * X1[i] + true_beta[2] * X2[i])
      
        # Event occurs 
        survival_time <- ceiling((1/(true_beta[3]*k)) * log( 1 + (true_beta[3] * k * (-log(u))) / baselinehazard))
    
        survival_time <- min(survival_time, T_max)
        status <- as.integer(survival_time <= T_max)
        
 
        data <- bind_rows(data, 
                          data.frame(
                          id=rep(i,survival_time), 
                          time=1:survival_time,
                          X1=rep(X1[i],survival_time), 
                          X2=rep(X2[i],survival_time),
                          X3=k*(1:survival_time), 
                          status=c(rep(0, survival_time-1),1)
                          )
        )

              }
    
    data$counting <- with(data, ave(time, id, FUN = function(x) c(0, head(x, -1))))

    # Rearranging and renaming columns
    data_counting_process <- data.frame(id = data$id, 
                                        starttime = data$counting, 
                                        endtime = data$time, 
                                        X1 = data$X1, 
                                        X2 = data$X2, 
                                        X3 = data$X3, 
                                        status = data$status)
    
    # ave(time, id, FUN = function(x) ...) applies a function to each subset of the time vector, where subsets are defined by unique values of the id column. In this case, since all ids are the same, it considers the whole time vector as one group.
    # function(x) c(0, head(x, -1)) defines a function that is applied to each group. Here, x represents a grouped subset of the time vector. The function does two things:
    # head(x, -1) takes all but the last element of x. For your data, it would return c(1, 2) from the original c(1, 2, 3).
    # c(0, ...) concatenates 0 at the beginning of the vector returned by head(x, -1). So, you get c(0, 1, 2).
    
    return(data_counting_process)
  }

検証のための関数の作成

sim3 <-
  function() {
    
    data <- gen_data3()
    
    fit <- coxph(Surv(starttime, endtime, status) ~ X1+X2+X3, data)
    summary(fit)
    
    est_cox <- fit$coefficients
    ses_cox <- sqrt(diag(fit$var))
    cil_cox <- est_cox + qnorm(.025) * ses_cox
    ciu_cox <- est_cox + qnorm(.975) * ses_cox
    
    c(bias_cox = est_cox - true_beta, 
      coverage_cox = ((true_beta > cil_cox) & (true_beta < ciu_cox))
    )
    
  }

検証

data <- gen_data3()

# Get the means of replications
system.time(
  sim_res3 <-
    rowMeans(replicate(100, sim3()))
)

sim_res3

肝心の時間依存変数X3のカバレッジが95%前後にならず、修正方法募集中。

> sim_res3
    bias_cox.X1     bias_cox.X2     bias_cox.X3 coverage_cox.X1 coverage_cox.X2 
   0.0004148894    0.0013051439   -0.0250350783    0.9600000000    0.9400000000 
coverage_cox.X3 
   0.7700000000