🦥

【Stats Method】How interpretation of the PearsonCorrelationCoefficient

2024/08/15に公開

機械学習

statistics

tech

0. What to use

Pearson Correlation Coefficient measures the strength and direction of the linear relationship between two continuous variables.

There are two use cases in machine learning.

To evaluate the relationship between features and target variables, which can inform feature selection.
To evaluate the relationship between the two models' output, which can be used to consider whether the model should be added to the ensemble or not(similar model ensemble is less performance many times).

1. What is the Pearson Correlation Coefficient

1.1 The value r

The Pearson Correlation Coefficient is used to quantifies the linear relationship between two variables. Often denoted as r.

This measures the strength and direction of the linear relationship between two continuous variables. Its value ranges from -1 to 1.

・r=1 indicates perfect positive linear relationship. As one variable increases, the other also increases proportionally. The value close to 1, means it has positive relationship.
・r=0 indicates no linear relationship. The variables do not exhibit any linear trend with each other.
・r=-1 indicates perfect negative linear relationship. As one variable increases, the other decreases proportionally. The value close to 1, means it has negative relationship.

1.2 The value p

We also will get the p-value from analysis, which shows the reliability of the r.

・Low P-value (typically < 0.05)
Indicates that you can reject the null hypothesis, suggesting that the correlation between the variables is statistically significant. In other words, the observed correlation is unlikely to have occurred by chance, and there is likely a genuine linear relationship between the variables.

・High P-value (typically > 0.05)
This indicates that you cannot reject the null hypothesis, suggesting that the observed correlation could be due to random chance, and there might not be a significant linear relationship between the variables(but truly uncertain whether both have a relationship or not).

2. Formula

$r = \dfrac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n} (X_i - \bar{X})^2} \cdot \sqrt{\sum_{i=1}^{n} (Y_i - \bar{Y})^2}}$

Where:

$X_i$ and $Y_i$ are the individual data points.
$\bar{X}$ and $\bar{Y}$ are the mean values of $X$ and $Y$ .
$n$ is the number of data points.

The denominator means like dispersion,

3. Limitations

Linearity
Pearson correlation only captures linear relationships. It won't detect non-linear relationships effectively.
Outliers
Outliers can significantly affect the value of $r$ , potentially leading to misleading conclusions.
Scale Sensitivity
The coefficient assumes that the variables are normally distributed and have a similar scale. If the variables are not, the correlation might be less meaningful.

4. Code

4.1 Example

Here is the example code.

import numpy as np
from scipy.stats import pearsonr

# Example data: hours studied vs. exam scores
hours_studied = np.array([1, 2, 3, 4, 3.5])
exam_scores = np.array([60, 70, 75, 85, 90])

# Calculate Pearson Correlation Coefficient
correlation_coefficient, p_value = pearsonr(hours_studied, exam_scores)

print(f"Pearson Correlation Coefficient: {correlation_coefficient:.2f}")
print(f"P-value: {p_value:.2f}")

・output

Pearson Correlation Coefficient: 0.94
P-value: 0.02

We can see the relationship between the two distributions easily. It shows that both have a linear relationship because PCC is high(close to 1) and P-value is low(<0.05).

4.2 Non-Linear

If the relationship is monotonic but non-linear, Spearman's rank correlation can be a better choice than Pearson's correlation.

from scipy.stats import spearmanr
# Example data: x vs. y (non-linear, monotonic relationship)
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2, 4, 8, 16, 32, 64, 128, 256, 512, 1024])

# Plot the data
plt.scatter(x, y)
plt.title("Original Data")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


# Calculate Correlation
correlation_coefficient, p_value = pearsonr(x, y)
spearman_corr, p_value = spearmanr(x, y)


print(f"Pearson Correlation Coefficient (log-transformed): {correlation_coefficient:.2f}")
print(f"P-value: {p_value:.2f}")

print(f"Spearman's Rank Correlation: {spearman_corr:.2f}")
print(f"P-value: {p_value:.2f}")

・Results

Pearson Correlation Coefficient (log-transformed): 0.80
P-value: 0.00
Spearman's Rank Correlation: 1.00
P-value: 0.00

Spearman seems to treat non-linear relationship better.

5. Summary

This time, I explained Pearson Correlation Coefficient.
This is useful for what is useful in features or model relationships.
please try to do it.