Skip to content

Correlation & Covariance

1
2
3
4
!uv pip install -q\
    pandas==2.3.3 \
    numpy==2.3.3 \
    scipy==1.16.2
1
2
3
4
5
import math

import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr

Covariance and correlation are statistical measures used to determine the relationship between two variables. Both are used to understand how changes in one variable are associated with changes in another one.

Covariance

Covariance is a measure of how much two random variables change together. If the variables tend to increase and decrease together, the covariance is positive. If one tends to increase when other decreases, the covariance is negative.

Covariance of \((x, y)\)

\(s_{xy} = \text{Cov}(X, Y) = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x}) (y_i - \bar{y})\)

Covariance of \((x, x)\)

\(s^2_x = \text{Cov}(X, X) = \text{Var}(X) = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2\)

Advantages

  • Quantify the relationship between \(X\) and \(Y\)

Disadvantages

  • Covariance does not have a specific limit value. So is not possible to compare two covariances to decide which one is stronger.
1
2
3
4
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
N = len(X)
N
10

Calculate the means

1
2
3
4
5
x_bar = sum(X) / N
y_bar = sum(Y) / N

print(x_bar)
print(y_bar)
5.5

19.0

Calculate the sum of the products of the deviations

sum_of_products = sum((x_i - x_bar) * (y_i - y_bar) for x_i, y_i in zip(X, Y))
sum_of_products
165.0

Calculate the sample covariance

sample_covariance = sum_of_products / (N - 1)
sample_covariance
18.333333333333332

Using Numpy

1
2
3
4
5
6
7
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

data = np.array([X, Y])
covariance_matrix = np.cov(data)

print(f"Cov(X, Y) check: {covariance_matrix[0, 1]:.4f}")
Cov(X, Y) check: 18.3333

Correlation

Pearson Correlation Coefficient

  • It limits the values between \(-1\) and \(+1\)
  • Use with straight lines

\(r = \frac{\sum_{i=1}^{N} (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_{i=1}^{N} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{N} (y_i - \bar{y})^2}}\)

  • \(N\) is the number of observations in the sample
  • \(x_i\) and \(y_i\) are individual data points
  • \(\bar{x}\) and \(\bar{y}\) are the sample means of \(X\) and \(Y\)
1
2
3
4
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
N = len(X)
N
10

Calculate the sample means

1
2
3
4
5
x_bar = sum(X) / N
y_bar = sum(Y) / N

print(x_bar)
print(y_bar)
5.5

19.0

Calculate the numerator: Sum of products of deviations

numerator = sum((x_i - x_bar) * (y_i - y_bar) for x_i, y_i in zip(X, Y))
print(f"Numerator (Covariance numerator): {numerator}")
Numerator (Covariance numerator): 165.0

Calculate the components of the denominator

Sum of squared deviations for \(X\): \(sum((x_i - \bar{x})^2)\)

sum_sq_dev_x = sum((x_i - x_bar) ** 2 for x_i in X)
print(f"Sum of squared deviations for X: {sum_sq_dev_x}")
Sum of squared deviations for X: 82.5

Sum of squared deviations for \(Y\): \(sum((y_i - \bar{y})^2)\)

sum_sq_dev_y = sum((y_i - y_bar) ** 2 for y_i in Y)
print(f"Sum of squared deviations for Y: {sum_sq_dev_y}")
Sum of squared deviations for Y: 330.0

Calculate the denominator: \(\sqrt{\sum_{i=1}^{N} (x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{N} (y_i - \bar{y})^2}\)

denominator = math.sqrt(sum_sq_dev_x) * math.sqrt(sum_sq_dev_y)
print(f"Denominator: {denominator}")
Denominator: 165.0

Calculate the Pearson correlation coefficient (r)

pearson_correlation = numerator / denominator
print(f"Pearson Correlation Coefficient (r): {pearson_correlation}")
Pearson Correlation Coefficient (r): 1.0

Using SciPy

1
2
3
4
5
6
7
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

correlation, p_value = pearsonr(X, Y)

print(f"Pearson Correlation (r): {correlation:.4f}")
print(f"P-value: {p_value:.2e}")
Pearson Correlation (r): 1.0000

P-value: 1.70e-61

Spearman Rank Correlation

  • It's a non-parametric measure of the strength and direction of the association between two ranked variables.
  • Use with curves

\(\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\)

  • \(\rho\) (rho) is the Spearman rank correlation coefficient
  • \(d_i\) is the difference between the ranks of the \(i\)-th observation of \(x_i\) and \(y_i\)
  • \(n\) is the number of observations (or data points) in the sample
1
2
3
4
X = [10, 2, 8, 1, 5]
Y = [90, 50, 85, 40, 70]
n = len(X)
n
5

Create a list of (value, index) pairs for X, then sort by value

sorted_x_indexed = sorted([(val, i) for i, val in enumerate(X)])
sorted_x_indexed
[(1, 3), (2, 1), (5, 4), (8, 2), (10, 0)]

Create a list of (value, index) pairs for Y, then sort by value

sorted_y_indexed = sorted([(val, i) for i, val in enumerate(Y)])
sorted_y_indexed
[(40, 3), (50, 1), (70, 4), (85, 2), (90, 0)]

Rank_X: The rank for each element in the original X list

1
2
3
sorted_x = sorted(X)
rank_x = [sorted_x.index(x) + 1 for x in X]
rank_x
[5, 2, 4, 1, 3]

Rank_Y: The rank for each element in the original Y list

1
2
3
sorted_y = sorted(Y)
rank_y = [sorted_y.index(y) + 1 for y in Y]
rank_y
[5, 2, 4, 1, 3]
1
2
3
4
print(f"Original X: {X}")
print(f"Ranked X:   {rank_x}")
print(f"Original Y: {Y}")
print(f"Ranked Y:   {rank_y}")
Original X: [10, 2, 8, 1, 5]

Ranked X:   [5, 2, 4, 1, 3]

Original Y: [90, 50, 85, 40, 70]

Ranked Y:   [5, 2, 4, 1, 3]

Calculate the Sum of Squared Differences \(\sum_{i=1}^{N} d_i^2\)

sum_d2 = sum((rx - ry) ** 2 for rx, ry in zip(rank_x, rank_y))
sum_d2
0

Calculate Spearman's Rho (rho)

1
2
3
4
5
6
7
numerator = 6 * sum_d2
denominator = n * (n**2 - 1)
rho = 1 - (numerator / denominator)

print(f"Numerator (6 * sum_d2): {numerator}")
print(f"Denominator (n * (n^2 - 1)): {denominator}")
print(f"Spearman's Rho (ρ): {rho:.4f}")
Numerator (6 * sum_d2): 0

Denominator (n * (n^2 - 1)): 120

Spearman's Rho (ρ): 1.0000

Using Scipy

1
2
3
4
5
6
7
X = [10, 2, 8, 1, 5]
Y = [90, 50, 85, 40, 70]

correlation, p_value = spearmanr(X, Y)

print(f"Spearman Correlation (rho): {correlation:.4f}")
print(f"P-value: {p_value:.4f}")
Spearman Correlation (rho): 1.0000

P-value: 0.0000

Using Pandas

import pandas as pd

df = pd.DataFrame(
    {
        "X": [1, 2, 3, 4, 5],
        "Y": [90, 50, 85, 40, 70],
        "Z": [1, 5, 3, 2, 4],
    }
)

pearson_matrix = df.corr(method="pearson")

spearman_matrix = df.corr(method="spearman")

print("Pearson Correlation Matrix:\n", pearson_matrix)
print("\nSpearman Correlation Matrix:\n", spearman_matrix)
Pearson Correlation Matrix:

           X         Y         Z

X  1.000000 -0.364662  0.300000

Y -0.364662  1.000000 -0.364662

Z  0.300000 -0.364662  1.000000



Spearman Correlation Matrix:

      X    Y    Z

X  1.0 -0.5  0.3

Y -0.5  1.0 -0.4

Z  0.3 -0.4  1.0

Usage

It can be applied in machine learning on feature selection step. The more closer to 0 the correlation is, the less relevant the feature is.