iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📖

Implementing Score Tests Using SymPy Symbolic Computation: A Case Study on Gamma Distribution Shape Parameter Testing

に公開

Introduction

This article was inspired by Sibley A, Li Z, Jiang Y, Li YJ, Chan C, Allen A, Owzar K. Facilitating the Calculation of the Efficient Score Using Symbolic Computing. Am Stat. 2018;72(2):199-205.

The probability density function of the Gamma distribution is given by the following equation:

f(x;\alpha,\theta)= \frac{1}{\Gamma(\alpha)\theta^\alpha} x^{\alpha-1}e^{-x/\theta}, \qquad x>0.

Here, \alpha is the shape parameter, and \theta is the scale parameter.

The properties of the Gamma distribution vary significantly depending on the shape parameter. When the shape parameter is greater than 1, the distribution is concentrated around the mode; when it is 1 or less, it becomes a heavy-tailed distribution that diverges at 0.


Probability density function of the Gamma distribution

If we consider this as a distribution of wealth, one might say that a society with a shape parameter of 1 or less has a large wealth gap.

On the other hand, a scale parameter increasing by a factor of k is equivalent to multiplying the random variable following the Gamma distribution by k. The scale parameter can literally change depending on how the scale is set.

Therefore, in Testing the Shape Parameter of the Gamma Distribution (or an example of a score test with nuisance parameters), I considered testing the shape parameter while treating the scale parameter as a nuisance parameter.

Specifically, let X_1, \dots, X_n be independent random variables following a Gamma distribution with shape parameter \alpha > 0 and scale parameter \theta > 0. In this context, I derived a score test to test the null hypothesis \alpha = \alpha_0 with:

  • Parameter of interest: Shape parameter \alpha
  • Nuisance parameter: Scale parameter \theta

While such hand calculations are manageable, mistakes tend to increase as models become more complex, even if they can be solved analytically. Thus, this time, with an eye toward applications to more complex models, I will perform the same task using the symbolic computing capabilities of SymPy. Since I am accustomed to R, I will use SymPy via R using reticulate.

The Main Topic

First, install SymPy.

library(reticulate)
virtualenv_install("r-reticulate", "sympy")

If it is already installed, you can import it as follows:

library(reticulate)
reticulate::py_require("sympy")
sympy <- import("sympy")
stats <- import("sympy.stats")

Declare the symbols. In particular, we declare X as follows because we want to calculate the expected value later.

alpha <- sympy$Symbol("alpha", positive = TRUE)
theta <- sympy$Symbol("theta", positive = TRUE)
X <- stats$Gamma("X", alpha, theta)

Declare the likelihood per sample (i.e., the logarithm of the density function) and calculate the score and Fisher information.

logpdf <- sympy$simplify(sympy$log(stats$density(X)(X)))

# score
S_alpha <- sympy$simplify(sympy$diff(logpdf, alpha))
S_theta <- sympy$simplify(sympy$diff(logpdf, theta))

# Fisher information
I_11 <- sympy$simplify(-stats$Expectation(sympy$diff(S_alpha, alpha)))
I_12 <- sympy$simplify(-stats$Expectation(sympy$diff(S_alpha, theta)))
I_21 <- sympy$simplify(-stats$Expectation(sympy$diff(S_theta, alpha)))
I_22 <- sympy$simplify(-stats$Expectation(sympy$diff(S_theta, theta)))

Calculate the efficient information.

I_eff <- sympy$simplify((I_11 - I_12 * I_21 / I_22))

If you print this, you should find that it matches the result derived in Testing the Shape Parameter of the Gamma Distribution (or an example of a score test with nuisance parameters). Note that here n=1.

> print(I_eff)
polygamma(1, alpha) - 1/alpha

We provide the data using random numbers. Let the null hypothesis be \alpha = 1.

# Data
set.seed(1234)
x_data <- rgamma(5, shape = 2)

# alpha under the null hypothesis
alpha0 <- 1

Sum the scores and the efficient information over the number of samples.

# Sum of scores
S_alpha0 <- sympy$Integer(0)
S_theta0 <- sympy$Integer(0)
I_eff0  <- sympy$Integer(0)
for(val in x_data) {
  S_alpha0 <- S_alpha0 + S_alpha$subs(dict(alpha = alpha0, X = val))
  S_theta0 <- S_theta0 + S_theta$subs(dict(alpha = alpha0, X = val))
  I_eff0 <- I_eff0 + I_eff$subs(dict(alpha=alpha0))
}

Solve for \theta by setting the score with respect to \theta to 0.

sol_theta <- sympy$solve(sympy$Eq(S_theta0, 0), theta)

Substitute \hat \theta into the efficient score to obtain the test statistic.

U_eff0 <- sympy$simplify(
  S_alpha0$subs(dict(theta=sol_theta[[1]]))^2/I_eff0$subs(dict(theta=sol_theta[[1]]))
)

Convert it to an R numeric value.

builtins <- import_builtins()
U_eff0_val <- py_to_r(builtins$float(U_eff0$evalf()))

Let's compare it with the following function derived in Testing the Shape Parameter of the Gamma Distribution (or an example of a score test with nuisance parameters).

gamma_shape_score_test <- function(x, alpha0) {
  n <- length(x)
  xbar <- mean(x)
  logxbar <- mean(log(x))
  
  U <- n * (logxbar - log(xbar) + log(alpha0) - digamma(alpha0))
  Ieff <- n * (trigamma(alpha0) - 1 / alpha0)
  stat <- (U^2) / Ieff
  pval <- pchisq(stat, df = 1, lower.tail = FALSE)
  
  list(
    statistic = stat,
    p.value = pval,
    null.value = alpha0
  )
}

res_r <- gamma_shape_score_test(x = x_data, 1)

You should see that almost identical values were obtained.

> print(U_eff0_val)
[1] 1.145261
> print(res_r$statistic)
[1] 1.145261
> print(all.equal(res_r$statistic, U_eff0_val))
[1] TRUE

The full R code can be found here:
https://github.com/abikoushi/Zenn_content/blob/main/R/sympy_score_gam.R

That is all.

GitHubで編集を提案

Discussion