Pearson Correlation vs Spearman correlation

correlation: Say we have 2 variables X and Y. if X value changes which also change Y then these two variables said to be in correlation. if Y increases with increase in value of X then these two said to have +ve correlation. if Y value decreases with increase of X then these two are in -ve correlation. 

correlation can help in determing other value , given a value.

Lets take a data set and calculate correlation between 2 variables. 
download using link here
OR
#wget https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

This is Cars Data set which has following description
    1. mpg:           continuous
    2. cylinders:     multi-valued discrete
    3. displacement:  continuous
    4. horsepower:    continuous
    5. weight:        continuous
    6. acceleration:  continuous
    7. model year:    multi-valued discrete
    8. origin:        multi-valued discrete
    9. car name:      string (unique for each instance)


Lets discuss 2 types of correlation here, Pearson correlation expects linear relationship between variables. Spearman correlation works on non-linear relationship. 

Linear relationship means Increase in value of X, increases/decreases Y in same ratio. 

Here is the program to determine whether weight influences the mpg factor. 

pandas has DataFrame which can give correlation between two variables. 
say if df is object of DataFrame class, then df['col1'].corr['col2'] gives correlation between col1 and col2 . df['weight'].corr(df['mpg']) gives how weight influences mpg. 
import pandas as pd
def correlation(df,col1,col2,method='pearson'):
   coeff = df[col1].corr(df[col2],method=method)
   return coeff
def main():
   df = pd.read_csv('auto-mpg.data', delim_whitespace=True)
   df.columns = ['mpg', 'cylinders', 'displacement' , 'horsepower', 'weight', 'acceleration', 'model year' , 'origin', 'car name']
   ''' lets calc correlation between mpg and weight
   as weight increases how does mpg effect '''
   coeff = correlation(df, 'weight', 'mpg', 'spearman')
   print("spearman correlation is %f" % (coeff))
   ''' now calc pearson correlation'''
   coeff = correlation(df, 'weight', 'mpg', 'pearson')
   print ("pearson correlation is %f" % (coeff))


if __name__ == '__main__':
   main()
~         

How to become Data Scientist

I have been playing with data science since few years. Here are steps to become data scientist .

1. Choose Python or R. (i am python developer since 7 years, so i am gonna recommend few ebooks here)
Python: Dive Into Python pdf. 
I would also recommend python class here from google


2. Once you are good at programming in Python , Learn statistics
ebook: Think Stats

3. Learn Numpy, Scipy, Pandas and machine learning
ebook: Practical machine learning with Python . 
you can pick up chapters related to Numpy, Pandas, Visualization techniques with pandas or seaborn and Machine learning in the above ebook.


Once you are done with these 3 ebooks, you should be able to do some data sets. try to sign up in www.kaggle.com and compete in data science problems. you can also download some data sets and practice.