Pearson Correlation vs Spearman correlation

correlation: Say we have 2 variables X and Y. if X value changes which also change Y then these two variables said to be in correlation. if Y increases with increase in value of X then these two said to have +ve correlation. if Y value decreases with increase of X then these two are in -ve correlation. 

correlation can help in determing other value , given a value.

Lets take a data set and calculate correlation between 2 variables. 
download using link here
OR
#wget https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

This is Cars Data set which has following description
    1. mpg:           continuous
    2. cylinders:     multi-valued discrete
    3. displacement:  continuous
    4. horsepower:    continuous
    5. weight:        continuous
    6. acceleration:  continuous
    7. model year:    multi-valued discrete
    8. origin:        multi-valued discrete
    9. car name:      string (unique for each instance)


Lets discuss 2 types of correlation here, Pearson correlation expects linear relationship between variables. Spearman correlation works on non-linear relationship. 

Linear relationship means Increase in value of X, increases/decreases Y in same ratio. 

Here is the program to determine whether weight influences the mpg factor. 

pandas has DataFrame which can give correlation between two variables. 
say if df is object of DataFrame class, then df['col1'].corr['col2'] gives correlation between col1 and col2 . df['weight'].corr(df['mpg']) gives how weight influences mpg. 
import pandas as pd
def correlation(df,col1,col2,method='pearson'):
   coeff = df[col1].corr(df[col2],method=method)
   return coeff
def main():
   df = pd.read_csv('auto-mpg.data', delim_whitespace=True)
   df.columns = ['mpg', 'cylinders', 'displacement' , 'horsepower', 'weight', 'acceleration', 'model year' , 'origin', 'car name']
   ''' lets calc correlation between mpg and weight
   as weight increases how does mpg effect '''
   coeff = correlation(df, 'weight', 'mpg', 'spearman')
   print("spearman correlation is %f" % (coeff))
   ''' now calc pearson correlation'''
   coeff = correlation(df, 'weight', 'mpg', 'pearson')
   print ("pearson correlation is %f" % (coeff))


if __name__ == '__main__':
   main()
~         

No comments: