Exercise 3

Pandas library

  • Python library for data manipulation and analysis
  • Based on numpy, matplotlib
  • Makes some common data analysis tasks very easy
  • Documentation: pandas.pydata.org
In [2]:
import pandas as pd

pd._version.get_versions()
Out[2]:
{'dirty': False,
 'error': None,
 'full-revisionid': '171c71611886aab8549a8620c5b0071a129ad685',
 'version': '0.25.1'}
In [3]:
import urllib.request
print(urllib.request.urlopen(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data").read().decode('ascii')[:100], '...')
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2, ...
In [49]:
print('...', urllib.request.urlopen(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names").read().decode('ascii')
      [2410:2620], '...', sep='\n')
...

7. Attribute Information:
   1. sepal length in cm
   2. sepal width in cm
   3. petal length in cm
   4. petal width in cm
   5. class: 
      -- Iris Setosa
      -- Iris Versicolour
      -- Iris Virginica

...
In [3]:
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   header=None,
                   names=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Species'])   
In [4]:
type(df)
Out[4]:
pandas.core.frame.DataFrame

Pandas DataFrames

  • Row/Column datatype
  • Essentially a matrix with labels and indexes
  • Many useful operations
In [5]:
df.head(5)
Out[5]:
Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
In [6]:
df.head(5).values  ## extract numpy array
Out[6]:
array([[5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
       [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
       [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'],
       [4.6, 3.1, 1.5, 0.2, 'Iris-setosa'],
       [5.0, 3.6, 1.4, 0.2, 'Iris-setosa']], dtype=object)
In [7]:
df['Petal Length']  ## extract entire columns by name
Out[7]:
0      1.4
1      1.4
2      1.3
3      1.5
4      1.4
      ... 
145    5.2
146    5.0
147    5.2
148    5.4
149    5.1
Name: Petal Length, Length: 150, dtype: float64
In [9]:
df.head(5).Species
Out[9]:
0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Species, dtype: object
In [10]:
df.head(5)[['Species', 'Petal Length']]  ## extract more than one column
Out[10]:
Species Petal Length
0 Iris-setosa 1.4
1 Iris-setosa 1.4
2 Iris-setosa 1.3
3 Iris-setosa 1.5
4 Iris-setosa 1.4
In [11]:
df.T  ## Transpose with labels
Out[11]:
0 1 2 3 4 5 6 7 8 9 ... 140 141 142 143 144 145 146 147 148 149
Sepal Length 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
Sepal Width 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... 3.1 3.1 2.7 3.2 3.3 3 2.5 3 3.4 3
Petal Length 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... 5.6 5.1 5.1 5.9 5.7 5.2 5 5.2 5.4 5.1
Petal Width 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... 2.4 2.3 1.9 2.3 2.5 2.3 1.9 2 2.3 1.8
Species Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa ... Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica

5 rows × 150 columns

In [12]:
df.index  ## DataFrame rows are indexed
Out[12]:
RangeIndex(start=0, stop=150, step=1)
In [13]:
df.iloc[4]  ## Extract individual rows by their index
Out[13]:
Sepal Length              5
Sepal Width             3.6
Petal Length            1.4
Petal Width             0.2
Species         Iris-setosa
Name: 4, dtype: object
In [14]:
df.iloc[[4, 17, 23]]
Out[14]:
Sepal Length Sepal Width Petal Length Petal Width Species
4 5.0 3.6 1.4 0.2 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
23 5.1 3.3 1.7 0.5 Iris-setosa
In [9]:
df.head(8)
Out[9]:
Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
In [19]:
#(df.head(8)['Sepal Length'] < 5) #| 
(df.head(8)['Sepal Length'] > 4.6)
Out[19]:
0     True
1     True
2     True
3    False
4     True
5     True
6    False
7     True
Name: Sepal Length, dtype: bool
In [20]:
df[~(df['Sepal Length'] < 5)].head(3) ## Boolean indexing
Out[20]:
Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
In [24]:
df2 = df.copy()  ## Cloning DataFrames

all_species = list(df.Species.unique())
print(all_species)

## Adding colums          ## Transforming existing columns
df2['Class'] = df.Species.map(lambda s: all_species.index(s))
df2.tail(5)
['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
Out[24]:
Sepal Length Sepal Width Petal Length Petal Width Species Class
145 6.7 3.0 5.2 2.3 Iris-virginica 2
146 6.3 2.5 5.0 1.9 Iris-virginica 2
147 6.5 3.0 5.2 2.0 Iris-virginica 2
148 6.2 3.4 5.4 2.3 Iris-virginica 2
149 5.9 3.0 5.1 1.8 Iris-virginica 2
In [25]:
df.describe().round(2) ## Quick summary statistics
Out[25]:
Sepal Length Sepal Width Petal Length Petal Width
count 150.00 150.00 150.00 150.00
mean 5.84 3.05 3.76 1.20
std 0.83 0.43 1.76 0.76
min 4.30 2.00 1.00 0.10
25% 5.10 2.80 1.60 0.30
50% 5.80 3.00 4.35 1.30
75% 6.40 3.30 5.10 1.80
max 7.90 4.40 6.90 2.50

Note: (25th percentile == x) $\leftrightarrow$ (25% of the data are $\leq$ x) etc.

In [29]:
df.groupby("Species").mean().T
Out[29]:
Species Iris-setosa Iris-versicolor Iris-virginica
Sepal Length 5.006 5.936 6.588
Sepal Width 3.418 2.770 2.974
Petal Length 1.464 4.260 5.552
Petal Width 0.244 1.326 2.026
In [41]:
df2['dot_color'] = df2.Species.map(lambda s: {'Iris-setosa': 'red', 
                                           'Iris-virginica': 'green',
                                           'Iris-versicolor': 'blue'}[s])
## Quick plotting
df2.plot('Petal Width', 'Sepal Length', kind='scatter', c=df2.dot_color, legend=True);

More plots: "Box-and-whisker" plot

In [42]:
df.groupby('Species').plot.box();

LMS with Iris data

Create subset for 1d-regression

In [45]:
## join multiple boolean dataframes with | (or), & (and)
subset = df2[(df2.Species == 'Iris-setosa') | (df2.Species == 'Iris-virginica')]
## alternative: negate with ~
# subset = df2[~(df2.Species == 'Iris-versicolor')]
subset = subset[['Sepal Length', 'Class'] ]
subset
Out[45]:
Sepal Length Class
0 5.1 0
1 4.9 0
2 4.7 0
3 4.6 0
4 5.0 0
... ... ...
145 6.7 2
146 6.3 2
147 6.5 2
148 6.2 2
149 5.9 2

100 rows × 2 columns

Prepare data for regression

In [46]:
## make separate vectors for x and y, and make the class -1 or 1
x = subset['Sepal Length']
y = subset['Class'] - 1
plt.scatter(x, y);
In [48]:
## Put everything in one matrix, with constant column prepended
import numpy as np
examples = np.vstack([np.ones_like(x), x, y])
examples = examples.T

print(examples.shape)
(100, 3)
In [51]:
examples[:5]
Out[51]:
array([[ 1. ,  5.1, -1. ],
       [ 1. ,  4.9, -1. ],
       [ 1. ,  4.7, -1. ],
       [ 1. ,  4.6, -1. ],
       [ 1. ,  5. , -1. ]])

The LMS Algorithm

We implement the basic LMS Algorithm, using RSS to measure the error.

Design decisions:

  • Store examples in a (n, p+1)-matrix
  • Last column is the response variable (class)
  • Omit convergence check
$$ RSS(w) = \sum_{i=1}^n (y_i - y(x_i))^2 = \sum_{i=1}^n (y_i - w^Tx_i)^2 $$
In [54]:
## using numpy:
y_true = np.array([1,   -1,   1,  1])
y_pred = np.array([.8, -.5, -.4, .6])
y_true - y_pred
Out[54]:
array([ 0.2, -0.5,  1.4,  0.4])
In [55]:
(y_true - y_pred)**2
Out[55]:
array([0.04, 0.25, 1.96, 0.16])
In [56]:
((y_true - y_pred)**2).sum()
Out[56]:
2.4099999999999997
In [57]:
def rss(examples, w):
    """Compute the residual sum of squares for a linear model.
       
       Arguments:
       examples -- (n, p + 1)-matrix of predictors and response
       w        -- p-vector of linear model weights
    """
    x = examples[:,:-1]    ### First p columns: shape = (n, p)
    y = examples[:,-1]     ### Last column: shape = (n,)
    y_pred = w.dot(x.T)
    rss = (y - y_pred)**2
    ### Note: w.dot(x) has shape = (n,)
    return rss.sum()
In [59]:
def lms(examples, eta, iterations, print_every=1000):
    np.random.seed(2)
    rows, columns = examples.shape
    p = columns - 1 ### last column is the response variable
    w = np.random.uniform(low=-1.0, high=1.0, size=p)
    
    for iteration in range(iterations):
        rand = np.random.randint(0, rows)  ### select random index
        x = examples[rand,:-1]   ### Everything but the last column
        c = examples[rand,-1:]   ### The last column
        
        y = w.dot(x)
        error = c - y            ### Error in the single chosen example
        w += (eta * error * x) 
        
        if iteration % print_every == 0 or iteration == (iterations-1):
            err = rss(examples, w)
            print(f"Iteration: {iteration} RSS: {err:.2f}")
    return w

Fitting the Model

In [60]:
%%time
w = lms(examples, eta=0.05, iterations=10000)
Iteration: 0 RSS: 188.48
Iteration: 1000 RSS: 5033.94
Iteration: 2000 RSS: 212.52
Iteration: 3000 RSS: 856.83
Iteration: 4000 RSS: 170.41
Iteration: 5000 RSS: 583.72
Iteration: 6000 RSS: 732.37
Iteration: 7000 RSS: 468.00
Iteration: 8000 RSS: 45.65
Iteration: 9000 RSS: 45.88
Iteration: 9999 RSS: 42.48
CPU times: user 514 ms, sys: 30.1 ms, total: 544 ms
Wall time: 503 ms

Examining our regression model

In [61]:
print(w, rss(examples, w))
[-5.64992028  1.03311639] 42.48252281358642
In [182]:
line_x = np.array([min(examples[:,1]), max(examples[:,1])])
line_y = w.dot(np.array([np.ones_like(line_x), line_x]))

print('w =', w)
print('line_x =', line_x)
print('line_y =', line_y)
w = [-5.64992028  1.03311639]
line_x = [4.3 7.9]
line_y = [-1.20751981  2.51169918]

Visualizing the line of best fit

In [171]:
plt.scatter(examples[:,1], examples[:,2])
plt.xlabel("Sepal Length")
plt.ylabel("Species");
plt.ylim((-1.1,1.1))
plt.axhline(0, c='k', ls='--')
plt.axvline(-w[0]/w[1], c='g', ls='--')
plt.plot(line_x, line_y, 'r');
In [172]:
(np.sign(w.dot(examples[:,:2].T)) == examples[:,2]).sum()
Out[172]:
94

Seaborn: Statistical Data Visualization

In [66]:
import seaborn as sb
In [154]:
## Just put in a DataFrame object,
## seaborn does the right thing automatically:
sb.pairplot(df, hue='Species', height=3, diag_kind='hist');
In [174]:
 # lmplot() Does linear regression automatically
sb.lmplot(x='Sepal Width', y='Petal Length', data=df[~(df.Species == 'Iris-setosa')]);