Correlation coeffecient is a statistical measure used to quantify the strength and direction of the linear relationship between two variables. In general, a correlationc coeffecient is denoted by the symbol 'r'
There are three types of correlation.
Positive correlation (r > 0): Suppose there are two feature f1, and f2. When the features f1 and f2 are in Positive correlation, then one feature f1 increases, the other feature f2 tends to increase, vice versa.
Negative correlation (r < 0): Suppose there are two feature f1, and f2. When the features f1 and f2 are in Negative correlation, then one feature f1 increases, the other feautre f2 tends to decrease.
Zero correlation (r = 0): If two features f1 and f2 have zero correlation, then there is no linear relationship between the features.
Example
Suppose you are dealing with Employees salary data, where you have following features available with you for the evaluation.
- name
- designation
- city
- height
- salary
- weight etc.,
In the above example,
- employee designation and experience might be positively correlated with employee salary.
- Employee name and height might have zero correlation with employee salary.
How to calculate correlation between two variables?
There are many algorithms available to find the correlation between two variables.
- Pearson Correlation Coefficient
- Spearman Rank Correlation Coefficient
- Kendall Tau Rank Correlation Coefficient
- Point-Biserial Correlation Coefficient
- Cramer's V
- Biserial Correlation Coefficient
- Distance Correlation:
- Hoeffding's D
How to calculate correlation values in Pandas?
Using Dataframe 'corrwith' method, we can compute pairwise correlation. Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or DataFrame.
Using this method we can
- Find the correlations between variables in two different DataFrames or
- Find the correlation between a DataFrame's columns and a Series.
Signature
DataFrame.corrwith(other, axis=0, drop=False, method='pearson')
other: It can be a series of dataframe with which you want to compute the correlation.
axis: axis=0 calculates correlations along rows and axis=1 calculates correlations along columns of the DataFrame. Default value is 0.
drop: If set to True, corrwith method will drop the common rows or columns, depending on the axis between the two DataFrames before calculating the correlation.
method: By default, it calculates the pearson coeffecient. You can use
1. 'kendall' for Kendall Tau rank correlation
2. 'spearman' for Spearman rank correlation.
Example
correlations = X.corrwith(y)
Following application calculates the correlation of following features with respect to employees salary.
- YearsExperience
- Age
- City
- Designation
- Height
Salary_Data.csv
YearsExperience,Age,Salary,City,Designation,Height 1.1,21,39343,Chennai,tester,5.1 1.3,21.5,46205,Chennai,tester,5.2 1.5,21.7,37731,Chennai,tester,5.3 2,22,43525,Chennai,tester,5.4 2.2,22.2,39891,Chennai,tester,5.5 2.9,23,56642,Hyderabad,Developer,5.6 3,23,60150,Hyderabad,Developer,5.7 3.2,23.3,54445,Hyderabad,Developer,5.8 3.2,23.3,64445,Hyderabad,Developer,5.9 3.7,23.6,57189,Hyderabad,Developer,4.9 3.9,23.9,63218,Hyderabad,Developer,4.10 4,24,55794,Hyderabad,Developer,5.0 4,24,56957,Hyderabad,Developer,5.1 4.1,24,57081,Hyderabad,Developer,5.1 4.5,25,61111,Hyderabad,Developer,5.1 4.9,25,67938,Hyderabad,Developer,5.1 5.1,26,66029,Hyderabad,Developer,5.2 5.3,27,83088,Hyderabad,Developer,5.2 5.9,28,81363,Hyderabad,Developer,5.2 6,29,93940,Bangalore,Developer,5.3 6.8,30,91738,Bangalore,Developer,5.3 7.1,30,98273,Bangalore,Developer,5.3 7.9,31,101302,Bangalore,Architect,5.4 8.2,32,113812,Bangalore,Architect,5.4 8.7,33,109431,Bangalore,Architect,4.9 9,34,105582,Bangalore,Architect,4.9 9.5,35,116969,Bangalore,Architect,5.5 9.6,36,112635,Bangalore,Architect,5.8 10.3,37,122391,Bangalore,Architect,5.10 10.5,38,121872,Bangalore,Architect,6.1
feature_slection.py
import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder data = pd.read_csv('Salary_Data.csv') label_encoder = LabelEncoder() # Create a DataFrame df = pd.DataFrame(data) df['City'] = label_encoder.fit_transform(df['City']) df['Designation'] = label_encoder.fit_transform(df['Designation']) # Let's assume you have a target variable (e.g., 'a') and want to select features to predict it. target_variable = 'Salary' # Create a DataFrame of features (excluding the target variable) X = df.drop(target_variable, axis=1) # Create a Series for the target variable y = df[target_variable] # Perform feature selection using correlation # For simplicity, let's select features that have a correlation coefficient greater than 0.3 with the target variable. correlation_threshold = 0.2 correlations = X.corrwith(y) print(correlations) selected_features = X.columns[correlations.abs() > correlation_threshold] # You can also perform feature selection using other techniques like statistical tests or domain knowledge. # Print the selected features print("\nSelected Features:") print(selected_features)
Output
YearsExperience 0.978242 Age 0.974530 City -0.723039 Designation -0.891993 Height 0.152711 dtype: float64 Selected Features: Index(['YearsExperience', 'Age', 'City', 'Designation'], dtype='object')
Previous Next Home
No comments:
Post a Comment