Programming for beginners: Quick guide to correlation coefficient

Correlation coeffecient is a statistical measure used to quantify the strength and direction of the linear relationship between two variables. In general, a correlationc coeffecient is denoted by the symbol 'r'

There are three types of correlation.

Positive correlation (r > 0): Suppose there are two feature f1, and f2. When the features f1 and f2 are in Positive correlation, then one feature f1 increases, the other feature f2 tends to increase, vice versa.

Negative correlation (r < 0): Suppose there are two feature f1, and f2. When the features f1 and f2 are in Negative correlation, then one feature f1 increases, the other feautre f2 tends to decrease.

Zero correlation (r = 0): If two features f1 and f2 have zero correlation, then there is no linear relationship between the features.

Example

Suppose you are dealing with Employees salary data, where you have following features available with you for the evaluation.

name
designation
city
height
salary
weight etc.,

In the above example,

employee designation and experience might be positively correlated with employee salary.
Employee name and height might have zero correlation with employee salary.

How to calculate correlation between two variables?

There are many algorithms available to find the correlation between two variables.

Pearson Correlation Coefficient
Spearman Rank Correlation Coefficient
Kendall Tau Rank Correlation Coefficient
Point-Biserial Correlation Coefficient
Cramer's V
Biserial Correlation Coefficient
Distance Correlation:
Hoeffding's D

How to calculate correlation values in Pandas?

Using Dataframe 'corrwith' method, we can compute pairwise correlation. Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or DataFrame.

Using this method we can

Find the correlations between variables in two different DataFrames or
Find the correlation between a DataFrame's columns and a Series.

Signature

DataFrame.corrwith(other, axis=0, drop=False, method='pearson')

other: It can be a series of dataframe with which you want to compute the correlation.

axis: axis=0 calculates correlations along rows and axis=1 calculates correlations along columns of the DataFrame. Default value is 0.

drop: If set to True, corrwith method will drop the common rows or columns, depending on the axis between the two DataFrames before calculating the correlation.

method: By default, it calculates the pearson coeffecient. You can use

1. 'kendall' for Kendall Tau rank correlation

2. 'spearman' for Spearman rank correlation.

Example

correlations = X.corrwith(y)

Following application calculates the correlation of following features with respect to employees salary.

YearsExperience
Age
City
Designation
Height

Salary_Data.csv

YearsExperience,Age,Salary,City,Designation,Height
1.1,21,39343,Chennai,tester,5.1
1.3,21.5,46205,Chennai,tester,5.2
1.5,21.7,37731,Chennai,tester,5.3
2,22,43525,Chennai,tester,5.4
2.2,22.2,39891,Chennai,tester,5.5
2.9,23,56642,Hyderabad,Developer,5.6
3,23,60150,Hyderabad,Developer,5.7
3.2,23.3,54445,Hyderabad,Developer,5.8
3.2,23.3,64445,Hyderabad,Developer,5.9
3.7,23.6,57189,Hyderabad,Developer,4.9
3.9,23.9,63218,Hyderabad,Developer,4.10
4,24,55794,Hyderabad,Developer,5.0
4,24,56957,Hyderabad,Developer,5.1
4.1,24,57081,Hyderabad,Developer,5.1
4.5,25,61111,Hyderabad,Developer,5.1
4.9,25,67938,Hyderabad,Developer,5.1
5.1,26,66029,Hyderabad,Developer,5.2
5.3,27,83088,Hyderabad,Developer,5.2
5.9,28,81363,Hyderabad,Developer,5.2
6,29,93940,Bangalore,Developer,5.3
6.8,30,91738,Bangalore,Developer,5.3
7.1,30,98273,Bangalore,Developer,5.3
7.9,31,101302,Bangalore,Architect,5.4
8.2,32,113812,Bangalore,Architect,5.4
8.7,33,109431,Bangalore,Architect,4.9
9,34,105582,Bangalore,Architect,4.9
9.5,35,116969,Bangalore,Architect,5.5
9.6,36,112635,Bangalore,Architect,5.8
10.3,37,122391,Bangalore,Architect,5.10
10.5,38,121872,Bangalore,Architect,6.1

feature_slection.py

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

data = pd.read_csv('Salary_Data.csv')

label_encoder = LabelEncoder()

# Create a DataFrame
df = pd.DataFrame(data)

df['City'] = label_encoder.fit_transform(df['City'])
df['Designation'] = label_encoder.fit_transform(df['Designation'])

# Let's assume you have a target variable (e.g., 'a') and want to select features to predict it.
target_variable = 'Salary'

# Create a DataFrame of features (excluding the target variable)
X = df.drop(target_variable, axis=1)

# Create a Series for the target variable
y = df[target_variable]

# Perform feature selection using correlation
# For simplicity, let's select features that have a correlation coefficient greater than 0.3 with the target variable.
correlation_threshold = 0.2
correlations = X.corrwith(y)
print(correlations)
selected_features = X.columns[correlations.abs() > correlation_threshold]

# You can also perform feature selection using other techniques like statistical tests or domain knowledge.

# Print the selected features
print("\nSelected Features:")
print(selected_features)

Output

YearsExperience    0.978242
Age                0.974530
City              -0.723039
Designation       -0.891993
Height             0.152711
dtype: float64

Selected Features:
Index(['YearsExperience', 'Age', 'City', 'Designation'], dtype='object')

Previous Next Home

Programming for beginners

Sunday, 23 February 2025

Quick guide to correlation coefficient

No comments:

Post a Comment