test1

Introduction

To get started in Python, I would suggest that you install Anaconda. Anaconda installs most the necessary packages for you to get started off. You can use either Jupyter notebook or the Spyder IDE for Python development

This chapter focuses on some of the most Essential Python constructs that you will tend to use very often while performing data analysis. While Python for Datascience has several complex statements this chapter distils some of the most commonly used statements. If you become adept in these you can slowly build your Python vocabulary as you become more and more conversant with the language

1.1 Basic directory operations

#To perform file or directory operation import the 'os' module
import os
# Change to a specific directory
os.chdir("C:\\software\\RandPythonBook\\EssentialR\\EssentialR-master")

# Get current working directory
dir=os.getcwd()
print(dir)

## C:\software\RandPythonBook\EssentialR\EssentialR-master

import os
# List all files in a directory
files=os.listdir("C:\\software\\RandPythonBook\\EssentialR\\EssentialR-master")
print(files)

## ['EssentialR-1.html', 'EssentialR-1.Rmd', 'EssentialR-1_cache', 'EssentialR-1_files', 'EssentialR.pptx', 'essentialR.R', 'mytest2.R', 'README.md', 'RMarkdown.Rmd', 'tendulkar.csv', 'testdir', 'testRpackage', 'testShinyApp']

1.2. Tuples, lists and dictionaries

The 3 basic data types in Python are

Tuples
List
Dictionary

Tuples: Tuples are immutable python objects which are enclosed with paranthesis. Immutability implies that objects cannot be added or removed to tuples. Hence we cannot add or remove elements from tuples. However a tuple can be removed using the del() commands

List: Lista are a sequence of disimilar objects enclosed within square brackets. Objects can be added to lists using append() and deleted using remove()

Dictionary: Dictionaries are a name(key)-value pair enclosed within curly braces. The name- value pairs are separated using a ‘:’. The keys must be unique in the dictionary

The length of tuples, lists and dictionaries can be obtained with the len() # Tuples are enclosed in paranthesis

mytuple=(1,3,7,6,"test")
print(mytuple)

# Lists are enclosed in square bracket
mylist = [1, 2, 7, 4, 12 ]

#Dictionary - These are similar to name-value pairs
mydict={'Name':'Ganesh','Age':54,'Occupation':'Engineer'}
print(mydict)
print(mydict['Age'])

# No of elements in tuples, lists and dictionaries can be got with len()
print("Length of tuple=",len(mytuple))
print("Length of list =", len(mylist))
print("Length of dictionary =",len(mydict))

## (1, 3, 7, 6, 'test')
## {'Age': 54, 'Name': 'Ganesh', 'Occupation': 'Engineer'}
## 54
## ('Length of tuple=', 5)
## ('Length of list =', 5)
## ('Length of dictionary =', 3)

1.3. Accessing elements in tuples, lists and dictionaries

To access elements in tuples,lists and dictionaries use an index. Indices in tuples, lists and dictionaries start at 0.

1.3.1 Accessing tuples

# Accessing tuples
mytuple=(1,3,7,6,"test")
mytuple[0]
#Slices 2nd upto 4th
print(mytuple[2:4])

## (7, 6)

1.3.2 Accessing lists

# Accessing Lists
mylist = [1, 2, 7, 4, 12 ]
# Add an object to a list
mylist.append(20)
print(mylist)
# Print 3rd element . Index starts from 0
print(mylist[2])
# Print a slice from the 4th to 6th
print(mylist[3:6])
#Print the 2nd last object
print(mylist[-2])
print(mylist[-5:-2])

## [1, 2, 7, 4, 12, 20]
## 7
## [4, 12, 20]
## 12
## [2, 7, 4]

1.3.3 Accessing dictionaries

# Accessing Dictionaries
mydict={'Name':'Ganesh','Age':54,'Occupation':'Engineer','Education':'Masters'}
#Print all objects of mydict
print(mydict.items())
# Print the keys
print(mydict.keys())
#Print the value with key 'Age'
print(mydict['Age'])

## [('Age', 54), ('Education', 'Masters'), ('Name', 'Ganesh'), ('Occupation', 'Engineer')]
## ['Age', 'Education', 'Name', 'Occupation']
## 54

1.4. Type of a variable

To check a variable type use type()


#Create a real valued variable
a=5.4
# Print the type of 'a'
print(type(a))
# Create a string variable
b='A string'
# Print the type of b
print(type(b))
# Create a tuple
mytuple=(1,3,7,6,"test")
# Print the type of mytuple
print(type(mytuple))
#Create list
mylist = [1, 2, 7, 4, 12 ]
# Print the type of list 
print(type(mylist))
#Create mydict
mydict={'Name':'Ganesh','Age':54,'Occupation':'Engineer'}
# Print type
print(type(mydict))

## <type 'float'>
## <type 'str'>
## <type 'tuple'>
## <type 'list'>
## <type 'dict'>

1.5. Accessing help

To get help on any python command use help()

#Help
import pandas as pd
help(len)

#help(dict)

## Help on built-in function len in module __builtin__:
## 
## len(...)
##     len(object) -> integer
##     
##     Return the number of items of a sequence or collection.

1.6. Numpy

NumPy is one of the most fundamental package for scientific computing with Python. Numpy includes the support for handling large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

import numpy as np
#Create a 1d numpy array
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
print(arr1)

## [ 6.   7.5  8.   0.   1. ]

1.6.1 1D array

#Create numpy array in a single line
import numpy as np
arr1= np.array([6, 7.5, 8, 0, 1])
#Print the array
print(arr1)

## [ 6.   7.5  8.   0.   1. ]

1.6.2 2D array

#Create a 2d numpy array
import numpy as np
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
# Print the 2d array
print(arr2)

## [[1 2 3 4]
##  [5 6 7 8]]

1.6.3 Dimension and shaoe of an aray

import numpy as np
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
# Get the dimension of the array
print(arr2.ndim)

#Display the shape of the array
print(arr2.shape)

## 2
## (2L, 4L)

1.6.4 Create a matrix of zeros

import numpy as np

#Create matrix of 3 x6 matrix of zeros
print(np.zeros((3, 6)))

## [[ 0.  0.  0.  0.  0.  0.]
##  [ 0.  0.  0.  0.  0.  0.]
##  [ 0.  0.  0.  0.  0.  0.]]

1.6.5 Create a matrix of ones

import numpy as np
#Create matrix of 4 x 2 matrix of ones
print(np.ones((4,2)))

## [[ 1.  1.]
##  [ 1.  1.]
##  [ 1.  1.]
##  [ 1.  1.]]

1.6.6 Some operations on numpy arrays

import numpy as np
G=np.random.randn(2,3)
print(G)
# Print the mean of the array
print(G.mean())
#Print the variance
print(G.var())

## [[ 1.14456629  0.96770677  0.16422922]
##  [-0.27179377 -0.55363259  2.51734607]]
## 0.661403664579
## 1.06102381793

1.6.7 More operations on numpy arrays

import numpy as np
#Operations between numpy arrays
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
print(arr)
# Add  arrays
print(arr+arr)
# Subtract an array from another
print(arr - arr)
# Perform element wise multiplication of arrays
print(arr * arr)

## [[ 1.  2.  3.]
##  [ 4.  5.  6.]]
## [[  2.   4.   6.]
##  [  8.  10.  12.]]
## [[ 0.  0.  0.]
##  [ 0.  0.  0.]]
## [[  1.   4.   9.]
##  [ 16.  25.  36.]]

1.6.8 Slicing numpy arrays

import numpy as np
#Create an array from 0 to 10 using arange
arr = np.arange(10)
print(arr)

# Display the 6th element. Index starts at 0
print(arr[5])
#
#Display from 6th up to 8th
print(arr[5:8])

## [0 1 2 3 4 5 6 7 8 9]
## 5
## [5 6 7]

1.6.9 Math operations on numpy arrays

import numpy as np
#Create an array from 0 to 10 using arange
arr = np.arange(10)
arr[5:8] = 12
print(arr)
# You can apply operations over the entire array in a single command
print(np.sqrt(arr))
print(np.sin(arr))

## [ 0  1  2  3  4 12 12 12  8  9]
## [ 0.          1.          1.41421356  1.73205081  2.          3.46410162
##   3.46410162  3.46410162  2.82842712  3.        ]
## [ 0.          0.84147098  0.90929743  0.14112001 -0.7568025  -0.53657292
##  -0.53657292 -0.53657292  0.98935825  0.41211849]

1.6.10 Creating sequences with numpy arrays

import numpy as np
# Generate sequences from start to stop and increase by step
seq1=np.arange(2,12,3)
print(seq1)
# Generate a sequence  between a start abd stop value with 5 equally spaced values
seq2=np.linspace(start=2,stop=12,num=5)
print(seq2)

## [ 2  5  8 11]
## [  2.    4.5   7.    9.5  12. ]

1.6.11 Creating random arrays

import numpy as np
# This is very useful when trying to simulate certain conditions

#Generating random araays
# Generate random numbers from the uniform distribution
print(np.random.rand(2,4))

# Generate random numbers between 0 & 1 from the normal distribution
print(np.random.randn(2,4))

#Generate random integers
print(np.random.randint(3,5,size=6).reshape(2,3))

## [[ 0.29324042  0.8596953   0.71186787  0.32192382]
##  [ 0.51826054  0.55103282  0.20060528  0.47663985]]
## [[-0.41485532 -1.25092742 -1.00347057 -0.6682005 ]
##  [-1.50118065 -0.8756469  -0.58597376 -1.13507102]]
## [[4 3 4]
##  [3 4 4]]

import numpy as np
# Reshape as a 5 x 4 matrix
arr2d = np.arange(20)
print(arr2d)
arr2d = np.arange(20).reshape(5,4)
print(arr2d)
#Reshape same array as a 2 x 10 matrix
arr2d=arr2d.reshape(2,10)
print(arr2d.shape)
print(arr2d)

## [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
## [[ 0  1  2  3]
##  [ 4  5  6  7]
##  [ 8  9 10 11]
##  [12 13 14 15]
##  [16 17 18 19]]
## (2L, 10L)
## [[ 0  1  2  3  4  5  6  7  8  9]
##  [10 11 12 13 14 15 16 17 18 19]]

1.6.12 Indexing and slicing arays

import numpy as np
arr2d = np.arange(20)
print(arr2d)
arr2d = np.arange(20).reshape(5,4)
print(arr2d)
# Print the element from 2nd row and 3rd column
print(arr2d[1,2])

# Slicing arr[startRow:endRow,startColumn:endColumn] 

#Slice an array
print(arr2d[2:4,1:4])

#Slice from the 0th to 3 row and 2 column
print(arr2d[:3,2])

# Slice all rows but only columns 2 & 3
# Note if the row or column is not included it implues all rows or all columns
print(arr2d[:,1:3])

# Display all rows and all columns
print(arr2d[:,:])

## [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
## [[ 0  1  2  3]
##  [ 4  5  6  7]
##  [ 8  9 10 11]
##  [12 13 14 15]
##  [16 17 18 19]]
## 6
## [[ 9 10 11]
##  [13 14 15]]
## [ 2  6 10]
## [[ 1  2]
##  [ 5  6]
##  [ 9 10]
##  [13 14]
##  [17 18]]
## [[ 0  1  2  3]
##  [ 4  5  6  7]
##  [ 8  9 10 11]
##  [12 13 14 15]
##  [16 17 18 19]]

1.6.13 Computing sum, mean of arrays

import numpy as np
arr = np.random.randn(4, 8) # normally-distributed data
#Print the mean of the aray
print(arr.mean())
print(np.mean(arr))
print(arr.sum())

## -0.176506745027
## -0.176506745027
## -5.64821584087

1.7 Pandas

Pandas is a Python package which can handle labeled data, csv or tables extremely well.

1.7.1. Pandas Series

# Import the pandas module
import pandas as pd
obj = pd.Series([4, 7, -5, 3])
print(obj)
# Create a series and also set the indices
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
# Print the indices
print(obj2.index)
# Print the values
print(obj2.values)
print(obj2)

## 0    4
## 1    7
## 2   -5
## 3    3
## dtype: int64
## d    4
## b    7
## a   -5
## c    3
## dtype: int64
## Index([u'd', u'b', u'a', u'c'], dtype='object')
## [ 4  7 -5  3]
## d    4
## b    7
## a   -5
## c    3
## dtype: int64

1.7.2 Pandas dataframes

import numpy as np
import pandas as pd

# Create 3 arrays with state, year and population
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
print(data)
# Create a dataframe
frame = pd.DataFrame(data)
# The dataframe has 3 columns state, year and pop
print(frame)

#Create frame2
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
print(frame2)

## {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9], 'year': [2000, 2001, 2002, 2001, 2002]}
##    pop   state  year
## 0  1.5    Ohio  2000
## 1  1.7    Ohio  2001
## 2  3.6    Ohio  2002
## 3  2.4  Nevada  2001
## 4  2.9  Nevada  2002
##        year   state  pop debt
## one    2000    Ohio  1.5  NaN
## two    2001    Ohio  1.7  NaN
## three  2002    Ohio  3.6  NaN
## four   2001  Nevada  2.4  NaN
## five   2002  Nevada  2.9  NaN

1.7.3 Pandas dataframes from arrays

import numpy as np
import pandas as pd
# Create a dataframe from an array
arr2d = np.arange(20)
# Reshape array
arr = np.arange(20).reshape(4,5)
print(arr)
# Create dataframe
df=pd.DataFrame(arr)
print(df)
print(df.shape)

## [[ 0  1  2  3  4]
##  [ 5  6  7  8  9]
##  [10 11 12 13 14]
##  [15 16 17 18 19]]
##     0   1   2   3   4
## 0   0   1   2   3   4
## 1   5   6   7   8   9
## 2  10  11  12  13  14
## 3  15  16  17  18  19
## (4, 5)

1.7.4 Important commands on Pandas dataframes

There are 3 important commands which are used very often on dataframes

shape() – Get the shape of the dataframe
info() – Get the details of the dataframe
columns – Get the names of the columns

import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
print(frame2)
# Important commands on pandas
print(frame2.shape)
print(frame2.info())
print(frame2.columns)

##        year   state  pop debt
## one    2000    Ohio  1.5  NaN
## two    2001    Ohio  1.7  NaN
## three  2002    Ohio  3.6  NaN
## four   2001  Nevada  2.4  NaN
## five   2002  Nevada  2.9  NaN
## (5, 4)
## <class 'pandas.core.frame.DataFrame'>
## Index: 5 entries, one to five
## Data columns (total 4 columns):
## year     5 non-null int64
## state    5 non-null object
## pop      5 non-null float64
## debt     0 non-null object
## dtypes: float64(1), int64(1), object(2)
## memory usage: 200.0+ bytes
## None
## Index([u'year', u'state', u'pop', u'debt'], dtype='object')

1.7.5 Indexing and slicing dataframes

import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
print(data)
frame = pd.DataFrame(data)
# The iloc method allows you to use indices much like an array
# Display all rows and the 1st column
print(frame)
print(frame.iloc[1:3,1:3])
print(frame.shape)

# Display rows 2nd to 4th and column 3
print(frame.iloc[1:4,2:3])

#Display row with index=1
print(frame.loc[1:4])

## {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9], 'year': [2000, 2001, 2002, 2001, 2002]}
##    pop   state  year
## 0  1.5    Ohio  2000
## 1  1.7    Ohio  2001
## 2  3.6    Ohio  2002
## 3  2.4  Nevada  2001
## 4  2.9  Nevada  2002
##   state  year
## 1  Ohio  2001
## 2  Ohio  2002
## (5, 3)
##    year
## 1  2001
## 2  2002
## 3  2001
##    pop   state  year
## 1  1.7    Ohio  2001
## 2  3.6    Ohio  2002
## 3  2.4  Nevada  2001
## 4  2.9  Nevada  2002

1.7.6 Read an CSV file

# Read csv
import os
import pandas as pd
os.chdir("C:\software\RandPythonBook\python-final")
# Read an XL file
tendulkar=pd.read_csv('tendulkar.csv',encoding = "ISO-8859-1")
# Display the top 5 rows
print(tendulkar.head())

##    Unnamed: 0 Runs Mins   BF 4s 6s     SR Pos Dismissal Inns  Opposition  \
## 0           1   15   28   24  2  0  62.50   6    bowled    2  v Pakistan   
## 1           2  DNB    -    -  -  -      -   -         -    4  v Pakistan   
## 2           3   59  254  172  4  0  34.30   6       lbw    1  v Pakistan   
## 3           4    8   24   16  1  0  50.00   6   run out    3  v Pakistan   
## 4           5   41  124   90  5  0  45.55   7    bowled    1  v Pakistan   
## 
##        Ground   Start Date  
## 0     Karachi  15 Nov 1989  
## 1     Karachi  15 Nov 1989  
## 2  Faisalabad  23 Nov 1989  
## 3  Faisalabad  23 Nov 1989  
## 4      Lahore   1 Dec 1989

1.7.7 Read an Excel file

# Read an XL
import pandas as pd
car=pd.read_excel('gascar.xls',sheetname='cardata')
print(car.head())

##    CTY    YR  LN_Gas_Car  LN_Y_Pop  LN_Pmg_Pgdp  LN_Car_Pop
## 0    1  1960    4.173244 -6.474277    -0.334548   -9.766840
## 1    1  1961    4.100989 -6.426006    -0.351328   -9.608622
## 2    1  1962    4.073177 -6.407308    -0.379518   -9.457257
## 3    1  1963    4.059509 -6.370679    -0.414251   -9.343155
## 4    1  1964    4.037689 -6.322247    -0.445335   -9.237739

1.7.8 Common operations on dataframes

Included below are some of the most common operations on dataframes

head()
tail()
shape()
columns
info()

import os
import pandas as pd
tendulkar=pd.read_csv('tendulkar.csv',encoding = "ISO-8859-1")


# Display the shape of the dataframe - no of rows and no of columns
print(tendulkar.shape)

#Display the column names
print(tendulkar.columns)

# Describe the data frame. The columns and the data types of the columns
print(tendulkar.info())

## (347, 13)
## Index([u'Unnamed: 0',       u'Runs',       u'Mins',         u'BF',
##                u'4s',         u'6s',         u'SR',        u'Pos',
##         u'Dismissal',       u'Inns', u'Opposition',     u'Ground',
##        u'Start Date'],
##       dtype='object')
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 347 entries, 0 to 346
## Data columns (total 13 columns):
## Unnamed: 0    347 non-null int64
## Runs          347 non-null object
## Mins          347 non-null object
## BF            347 non-null object
## 4s            347 non-null object
## 6s            347 non-null object
## SR            347 non-null object
## Pos           347 non-null object
## Dismissal     347 non-null object
## Inns          347 non-null object
## Opposition    347 non-null object
## Ground        347 non-null object
## Start Date    347 non-null object
## dtypes: int64(1), object(12)
## memory usage: 35.3+ KB
## None

1.7.9 Common operations on dataframes

import pandas as pd
tendulkar=pd.read_csv('tendulkar.csv',encoding = "ISO-8859-1")
#Rename columns as you find appropriate
tendulkar.columns=['No', 'Runs', 'Mins', 'BF', '4s', '6s', 'SR', 'Pos',
       'Dismissal', 'Inns', 'Opposition', 'Ground', 'Start Date']
print(tendulkar.head(5))

##    No Runs Mins   BF 4s 6s     SR Pos Dismissal Inns  Opposition      Ground  \
## 0   1   15   28   24  2  0  62.50   6    bowled    2  v Pakistan     Karachi   
## 1   2  DNB    -    -  -  -      -   -         -    4  v Pakistan     Karachi   
## 2   3   59  254  172  4  0  34.30   6       lbw    1  v Pakistan  Faisalabad   
## 3   4    8   24   16  1  0  50.00   6   run out    3  v Pakistan  Faisalabad   
## 4   5   41  124   90  5  0  45.55   7    bowled    1  v Pakistan      Lahore   
## 
##     Start Date  
## 0  15 Nov 1989  
## 1  15 Nov 1989  
## 2  23 Nov 1989  
## 3  23 Nov 1989  
## 4   1 Dec 1989

1.8 Cleaning dataframes

import pandas as pd
tendulkar=pd.read_csv('tendulkar.csv',encoding = "ISO-8859-1")
# Cleanup of Runs column
# Remove rows which have DNB
tendulkar.Runs
print(tendulkar.shape)
# Check all rows in Runs which do not have 'DNB'
a=tendulkar.Runs !="DNB"
# Remove rows which have 'DNB'
tendulkar=tendulkar[a]
print(tendulkar.shape)
# Remove rows which have TDNB
b=tendulkar.Runs !="TDNB"
tendulkar=tendulkar[b]
print(tendulkar.shape)

# Remove the '-' character
c= tendulkar.BF != "-"
tendulkar=tendulkar[c]
# Remove the '*' character
tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","")
print(tendulkar.shape)
# Write to csv file
tendulkar.to_csv("tendulkar1.csv")

## (347, 13)
## (330, 13)
## (329, 13)
## (328, 13)

1.9 Filtering based on row values

import pandas as pd
tendulkar=pd.read_csv('tendulkar1.csv',encoding = "ISO-8859-1")
print(tendulkar.shape)

# Select specific columns from tendulkar dataframe
df1=tendulkar[['Runs','BF','Ground']]
print(df1.head())

# Select rwos that meet some condition
a=tendulkar['Ground']=='Karachi'
df2=tendulkar[a]
print(df2.head())

# Filter rows when Groud is Karachi
df2=tendulkar[tendulkar['Ground']=='Karachi']
print(df2.head())

# This line will give an error
b = tendulkar['Runs'] >50
tendulkar3 = tendulkar[b]
print(tendulkar3.head())

## (328, 14)
##    Runs   BF      Ground
## 0    15   24     Karachi
## 1    59  172  Faisalabad
## 2     8   16  Faisalabad
## 3    41   90      Lahore
## 4    35   51     Sialkot
##      Unnamed: 0  Unnamed: 0.1  Runs  Mins  BF  4s  6s     SR  Pos Dismissal  \
## 0             0             1    15    28  24   2   0  62.50    6    bowled   
## 203         216           217    23    49  29   5   0  79.31    4    bowled   
## 204         217           218    26    74  47   5   0  55.31    4    bowled   
## 
##      Inns  Opposition   Ground   Start Date  
## 0       2  v Pakistan  Karachi  15 Nov 1989  
## 203     2  v Pakistan  Karachi  29 Jan 2006  
## 204     4  v Pakistan  Karachi  29 Jan 2006  
##      Unnamed: 0  Unnamed: 0.1  Runs  Mins  BF  4s  6s     SR  Pos Dismissal  \
## 0             0             1    15    28  24   2   0  62.50    6    bowled   
## 203         216           217    23    49  29   5   0  79.31    4    bowled   
## 204         217           218    26    74  47   5   0  55.31    4    bowled   
## 
##      Inns  Opposition   Ground   Start Date  
## 0       2  v Pakistan  Karachi  15 Nov 1989  
## 203     2  v Pakistan  Karachi  29 Jan 2006  
## 204     4  v Pakistan  Karachi  29 Jan 2006  
##     Unnamed: 0  Unnamed: 0.1  Runs  Mins   BF  4s  6s     SR  Pos Dismissal  \
## 1            2             3    59   254  172   4   0  34.30    6       lbw   
## 5            6             7    57   193  134   6   0  42.53    6    caught   
## 8            9            10    88   324  266   5   0  33.08    6    caught   
## 12          14            15    68   216  136   8   0  50.00    6    caught   
## 13          15            16   119   225  189  17   0  62.96    6   not out   
## 
##     Inns     Opposition      Ground   Start Date  
## 1      1     v Pakistan  Faisalabad  23 Nov 1989  
## 5      3     v Pakistan     Sialkot   9 Dec 1989  
## 8      1  v New Zealand      Napier   9 Feb 1990  
## 12     2      v England  Manchester   9 Aug 1990  
## 13     4      v England  Manchester   9 Aug 1990

import pandas as pd
tendulkar=pd.read_csv('tendulkar1.csv',encoding = "ISO-8859-1")

# # 18. Filtering rows  and selecting columns
# Check the type of 'Runs' column (element 0)
print(type(tendulkar['Runs'][0]))
#Convert to numeric. Use tab to see options
tendulkar['Runs']=pd.to_numeric(tendulkar['Runs'])
tendulkar['BF']=pd.to_numeric(tendulkar['BF'])

# Check the type of 'Runs' column
print(type(tendulkar['Runs'][0]))
b=tendulkar['Runs']>50

# Select only rows where Tendulkar scored more than 50
df3=tendulkar[tendulkar['Runs']>50]
df3.head()
print(tendulkar.head(3))

## <type 'numpy.int64'>
## <type 'numpy.int64'>
##    Unnamed: 0  Unnamed: 0.1  Runs  Mins   BF  4s  6s     SR  Pos Dismissal  \
## 0           0             1    15    28   24   2   0  62.50    6    bowled   
## 1           2             3    59   254  172   4   0  34.30    6       lbw   
## 2           3             4     8    24   16   1   0  50.00    6   run out   
## 
##    Inns  Opposition      Ground   Start Date  
## 0     2  v Pakistan     Karachi  15 Nov 1989  
## 1     1  v Pakistan  Faisalabad  23 Nov 1989  
## 2     3  v Pakistan  Faisalabad  23 Nov 1989

2.1 Operating on dates in Pandas

import pandas as pd
tendulkar=pd.read_csv('tendulkar1.csv',encoding = "ISO-8859-1")

# Operations on dates
tendulkar['Start Date']=pd.to_datetime(tendulkar['Start Date'])
tendulkar.head()
a=tendulkar['Start Date'] > '01-01-2005'
tendulkar5K=tendulkar[tendulkar['Start Date'] > '01-01-2005']
print(tendulkar5K.head())

# iloc can be used for slicing. Similar to handling numpy arrays
print(tendulkar.iloc[1:4,2:6])
# .loc is used to select rows by index
print(tendulkar.loc[[2,5]])

##      Unnamed: 0  Unnamed: 0.1  Runs  Mins   BF  4s  6s     SR  Pos Dismissal  \
## 192         202           203    94   301  202  11   0  46.53    4    caught   
## 193         204           205    52   147  102   9   0  50.98    4    caught   
## 194         205           206    52   117   91   9   0  57.14    4    caught   
## 195         206           207    41    78   71   7   0  57.74    4    caught   
## 196         207           208    16   140   98   2   0  16.32    4    caught   
## 
##      Inns  Opposition     Ground Start Date  
## 192     2  v Pakistan     Mohali 2005-03-08  
## 193     1  v Pakistan    Kolkata 2005-03-16  
## 194     3  v Pakistan    Kolkata 2005-03-16  
## 195     2  v Pakistan  Bangalore 2005-03-24  
## 196     4  v Pakistan  Bangalore 2005-03-24  
##    Runs  Mins   BF  4s
## 1    59   254  172   4
## 2     8    24   16   1
## 3    41   124   90   5
##    Unnamed: 0  Unnamed: 0.1  Runs  Mins   BF  4s  6s     SR  Pos Dismissal  \
## 2           3             4     8    24   16   1   0  50.00    6   run out   
## 5           6             7    57   193  134   6   0  42.53    6    caught   
## 
##    Inns  Opposition      Ground Start Date  
## 2     3  v Pakistan  Faisalabad 1989-11-23  
## 5     3  v Pakistan     Sialkot 1989-12-09

2.2 Remove NA values in a dataframe

import pandas as pd
tendulkar=pd.read_csv('tendulkar1.csv',encoding = "ISO-8859-1")

#Further clean up
tendulkar2=tendulkar.dropna()
print(tendulkar2.shape)

## (328, 14)

2.3 Compute the mean of a column

import pandas as pd
tendulkar=pd.read_csv('tendulkar1.csv',encoding = "ISO-8859-1")
#Compute mean of column
print(tendulkar['Runs'].mean())

## 48.506097561

2.4 Group rows by condition, compute mean and then sort

import pandas as pd
tendulkar=pd.read_csv('tendulkar1.csv',encoding = "ISO-8859-1")
# Group by ground and compute mean
a=tendulkar[['Runs','BF','Ground']].groupby('Ground').mean()

# Sort by ascending Runs
b=a.sort_values('Runs',ascending=False)
print(b.head(3))
tendulkar[['Runs','BF','Ground']].groupby('Ground').mean().sort_values('Runs',ascending=False)

print(tendulkar.head(3))

##                 Runs     BF
## Ground                     
## Multan         194.0  348.0
## Leeds          193.0  330.0
## Colombo (RPS)  143.0  247.0
##    Unnamed: 0  Unnamed: 0.1  Runs  Mins   BF  4s  6s     SR  Pos Dismissal  \
## 0           0             1    15    28   24   2   0  62.50    6    bowled   
## 1           2             3    59   254  172   4   0  34.30    6       lbw   
## 2           3             4     8    24   16   1   0  50.00    6   run out   
## 
##    Inns  Opposition      Ground   Start Date  
## 0     2  v Pakistan     Karachi  15 Nov 1989  
## 1     1  v Pakistan  Faisalabad  23 Nov 1989  
## 2     3  v Pakistan  Faisalabad  23 Nov 1989

2.5 Group rows by condition, compute mean and then sort

import pandas as pd
tendulkar=pd.read_csv('tendulkar1.csv',encoding = "ISO-8859-1")
# Group rows by some criteria and perform an operation Groupby
a=tendulkar[['Runs','BF','Ground']].groupby('Ground').mean()
b=a.sort_values('Runs',ascending=False)
print(b.head(4))

# Group by Ground, compute mean and sort ascending
c=tendulkar[['Runs','BF','Ground']].groupby('Ground').mean().sort_values('Runs',ascending=False)
print(c.head(3))

# Group by Opposition, compute mean and sort ascending
d=tendulkar[['Runs','BF','Opposition']].groupby('Opposition').mean().sort_values('Runs',ascending=False)
print(d.head(3))
# You can add all the commands in a single line 
f=tendulkar[['Runs','BF','Ground']].groupby('Ground').mean().sort_values('Runs',ascending=False)
print(f.head(3))

#Compute mean and average of Runs and Balls faced
g=tendulkar[['Runs','BF','Ground']].groupby('Ground').agg(['sum','mean','count'])
print(g.head(3))

##                 Runs     BF
## Ground                     
## Multan         194.0  348.0
## Leeds          193.0  330.0
## Colombo (RPS)  143.0  247.0
## Lucknow        142.0  224.0
##                 Runs     BF
## Ground                     
## Multan         194.0  348.0
## Leeds          193.0  330.0
## Colombo (RPS)  143.0  247.0
##                    Runs          BF
## Opposition                         
## v Bangladesh  91.111111  143.777778
## v Zimbabwe    65.571429  114.071429
## v Sri Lanka   56.685714  104.971429
##                 Runs     BF
## Ground                     
## Multan         194.0  348.0
## Leeds          193.0  330.0
## Colombo (RPS)  143.0  247.0
##           Runs                  BF               
##            sum    mean count   sum     mean count
## Ground                                           
## Adelaide   326  32.600    10   584  58.4000    10
## Ahmedabad  642  40.125    16  1281  80.0625    16
## Auckland     5   5.000     1    13  13.0000     1

2.6 Lambda operations

Lambda operations allow you to create small anonymous function which compute something. We can then apply these ‘lambda’ function on a series or columns of a dataframes

# Python - Operations on list
a =[5,2,3,1,7]
b =[1,5,4,6,8]

# Create a lambda function to add 2 numbers
add=lambda x,y:x+y
# Add all elements of lists a and b
print(list(map(add,a,b)))

#or
#Element wise addition with map & lambda
print(list(map(lambda x,y: x+y,a,b)))
#Element wise subtraction
print(list(map(lambda x,y: x-y,a,b)))
#Element wise product
print(list(map(lambda x,y: x*y,a,b)))
# Exponentiating the elements of a list
print(list(map(lambda x: x**2,a)))

sum = lambda x, y : x + y
print(sum(3,4))

# using lamda to compute a sauare
items = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, items))   
print(squared)

## [6, 7, 7, 7, 15]
## [6, 7, 7, 7, 15]
## [4, -3, -1, -5, -1]
## [5, 10, 12, 6, 56]
## [25, 4, 9, 1, 49]
## 7
## [1, 4, 9, 16, 25]

2.7 Lambda operations on an entire column of a data frame

import pandas as pd
tendulkar=pd.read_csv('tendulkar1.csv',encoding = "ISO-8859-1")
tendulkar['4s']=pd.to_numeric(tendulkar['4s'])
tendulkar['4s'].apply(lambda x:4*x)

2.8 Lambda operations to convert from Celsius to Fahrenheit

# Convert Celsius to Fahrenheit
Celsius = [39.2, 36.5, 37.3, 37.8]
Fahrenheit = list(map(lambda x: (float(9)/5)*x + 32, Celsius))
print(Fahrenheit)

## [102.56, 97.7, 99.14, 100.03999999999999]

2.9 a Python functions

# Use the def key word to define function
def product(x,y):
    value=x*y
    return(value)

# Invoke the function
product(8,9)

3.1 Plotting a scatterplot

import matplotlib.pyplot as plt
# Scatter plot
import pandas as pd
tendulkar=pd.read_csv('tendulkar1.csv',encoding = "ISO-8859-1")

plt.scatter(tendulkar.BF,tendulkar.Runs)
# Set the title of plot
plt.suptitle('Tendulkars Runs vs Balls faced', fontsize=20)
# Set x and y axis labels
plt.xlabel('Balls faced', fontsize=18)
plt.ylabel('Runs', fontsize=16)
plt.savefig('fig1.png', bbox_inches='tight')
#plt.show()

3.2 Plotting a histogram

 #Histogram
import matplotlib.pyplot as plt
# Scatter plot
import pandas as pd
tendulkar=pd.read_csv('tendulkar1.csv',encoding = "ISO-8859-1")

plt.hist(tendulkar['Runs'])
plt.suptitle('Tendulkars histogram of Runs ', fontsize=20)
plt.xlabel('Frequency', fontsize=18)
plt.ylabel('Runs', fontsize=16)
plt.savefig('fig2.png', bbox_inches='tight')
#plt.show()