Correlation & Covariance
| !uv pip install -q\
pandas==2.3.3 \
numpy==2.3.3 \
scipy==1.16.2
|
| import math
import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr
|
Covariance and correlation are statistical measures used to determine the relationship between two variables. Both are used to understand how changes in one variable are associated with changes in another one.
Covariance
Covariance is a measure of how much two random variables change together. If the variables tend to increase and decrease together, the covariance is positive. If one tends to increase when other decreases, the covariance is negative.
Covariance of \((x, y)\)
\(s_{xy} = \text{Cov}(X, Y) = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x}) (y_i - \bar{y})\)
Covariance of \((x, x)\)
\(s^2_x = \text{Cov}(X, X) = \text{Var}(X) = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2\)
Advantages
- Quantify the relationship between \(X\) and \(Y\)
Disadvantages
- Covariance does not have a specific limit value. So is not possible to compare two covariances to decide which one is stronger.
| X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
N = len(X)
N
|
Calculate the means
| x_bar = sum(X) / N
y_bar = sum(Y) / N
print(x_bar)
print(y_bar)
|
Calculate the sum of the products of the deviations
| sum_of_products = sum((x_i - x_bar) * (y_i - y_bar) for x_i, y_i in zip(X, Y))
sum_of_products
|
Calculate the sample covariance
| sample_covariance = sum_of_products / (N - 1)
sample_covariance
|
Using Numpy
| X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
data = np.array([X, Y])
covariance_matrix = np.cov(data)
print(f"Cov(X, Y) check: {covariance_matrix[0, 1]:.4f}")
|
Correlation
Pearson Correlation Coefficient
- It limits the values between \(-1\) and \(+1\)
- Use with straight lines
\(r = \frac{\sum_{i=1}^{N} (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_{i=1}^{N} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{N} (y_i - \bar{y})^2}}\)
- \(N\) is the number of observations in the sample
- \(x_i\) and \(y_i\) are individual data points
- \(\bar{x}\) and \(\bar{y}\) are the sample means of \(X\) and \(Y\)
| X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
N = len(X)
N
|
Calculate the sample means
| x_bar = sum(X) / N
y_bar = sum(Y) / N
print(x_bar)
print(y_bar)
|
Calculate the numerator: Sum of products of deviations
| numerator = sum((x_i - x_bar) * (y_i - y_bar) for x_i, y_i in zip(X, Y))
print(f"Numerator (Covariance numerator): {numerator}")
|
Numerator (Covariance numerator): 165.0
Calculate the components of the denominator
Sum of squared deviations for \(X\): \(sum((x_i - \bar{x})^2)\)
| sum_sq_dev_x = sum((x_i - x_bar) ** 2 for x_i in X)
print(f"Sum of squared deviations for X: {sum_sq_dev_x}")
|
Sum of squared deviations for X: 82.5
Sum of squared deviations for \(Y\): \(sum((y_i - \bar{y})^2)\)
| sum_sq_dev_y = sum((y_i - y_bar) ** 2 for y_i in Y)
print(f"Sum of squared deviations for Y: {sum_sq_dev_y}")
|
Sum of squared deviations for Y: 330.0
Calculate the denominator: \(\sqrt{\sum_{i=1}^{N} (x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{N} (y_i - \bar{y})^2}\)
| denominator = math.sqrt(sum_sq_dev_x) * math.sqrt(sum_sq_dev_y)
print(f"Denominator: {denominator}")
|
Calculate the Pearson correlation coefficient (r)
| pearson_correlation = numerator / denominator
print(f"Pearson Correlation Coefficient (r): {pearson_correlation}")
|
Pearson Correlation Coefficient (r): 1.0
Using SciPy
| X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
correlation, p_value = pearsonr(X, Y)
print(f"Pearson Correlation (r): {correlation:.4f}")
print(f"P-value: {p_value:.2e}")
|
Pearson Correlation (r): 1.0000
P-value: 1.70e-61
Spearman Rank Correlation
- It's a non-parametric measure of the strength and direction of the association between two ranked variables.
- Use with curves
\(\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\)
- \(\rho\) (rho) is the Spearman rank correlation coefficient
- \(d_i\) is the difference between the ranks of the \(i\)-th observation of \(x_i\) and \(y_i\)
- \(n\) is the number of observations (or data points) in the sample
| X = [10, 2, 8, 1, 5]
Y = [90, 50, 85, 40, 70]
n = len(X)
n
|
Create a list of (value, index) pairs for X, then sort by value
| sorted_x_indexed = sorted([(val, i) for i, val in enumerate(X)])
sorted_x_indexed
|
[(1, 3), (2, 1), (5, 4), (8, 2), (10, 0)]
Create a list of (value, index) pairs for Y, then sort by value
| sorted_y_indexed = sorted([(val, i) for i, val in enumerate(Y)])
sorted_y_indexed
|
[(40, 3), (50, 1), (70, 4), (85, 2), (90, 0)]
Rank_X: The rank for each element in the original X list
| sorted_x = sorted(X)
rank_x = [sorted_x.index(x) + 1 for x in X]
rank_x
|
Rank_Y: The rank for each element in the original Y list
| sorted_y = sorted(Y)
rank_y = [sorted_y.index(y) + 1 for y in Y]
rank_y
|
| print(f"Original X: {X}")
print(f"Ranked X: {rank_x}")
print(f"Original Y: {Y}")
print(f"Ranked Y: {rank_y}")
|
Original X: [10, 2, 8, 1, 5]
Ranked X: [5, 2, 4, 1, 3]
Original Y: [90, 50, 85, 40, 70]
Ranked Y: [5, 2, 4, 1, 3]
Calculate the Sum of Squared Differences \(\sum_{i=1}^{N} d_i^2\)
| sum_d2 = sum((rx - ry) ** 2 for rx, ry in zip(rank_x, rank_y))
sum_d2
|
Calculate Spearman's Rho (rho)
| numerator = 6 * sum_d2
denominator = n * (n**2 - 1)
rho = 1 - (numerator / denominator)
print(f"Numerator (6 * sum_d2): {numerator}")
print(f"Denominator (n * (n^2 - 1)): {denominator}")
print(f"Spearman's Rho (ρ): {rho:.4f}")
|
Numerator (6 * sum_d2): 0
Denominator (n * (n^2 - 1)): 120
Spearman's Rho (ρ): 1.0000
Using Scipy
| X = [10, 2, 8, 1, 5]
Y = [90, 50, 85, 40, 70]
correlation, p_value = spearmanr(X, Y)
print(f"Spearman Correlation (rho): {correlation:.4f}")
print(f"P-value: {p_value:.4f}")
|
Spearman Correlation (rho): 1.0000
P-value: 0.0000
Using Pandas
| import pandas as pd
df = pd.DataFrame(
{
"X": [1, 2, 3, 4, 5],
"Y": [90, 50, 85, 40, 70],
"Z": [1, 5, 3, 2, 4],
}
)
pearson_matrix = df.corr(method="pearson")
spearman_matrix = df.corr(method="spearman")
print("Pearson Correlation Matrix:\n", pearson_matrix)
print("\nSpearman Correlation Matrix:\n", spearman_matrix)
|
Pearson Correlation Matrix:
X Y Z
X 1.000000 -0.364662 0.300000
Y -0.364662 1.000000 -0.364662
Z 0.300000 -0.364662 1.000000
Spearman Correlation Matrix:
X Y Z
X 1.0 -0.5 0.3
Y -0.5 1.0 -0.4
Z 0.3 -0.4 1.0
Usage
It can be applied in machine learning on feature selection step. The more closer to 0 the correlation is, the less relevant the feature is.