# Back to basics with Pandas

In the spirit of our last post, let’s take things back to basics with a simple end-to-end data analysis example.

# Back to basics with `pandas`

, `plotnine`

, and `sklearn`

In this colab we do a quick review of the ‘data science basics’:

- Download a (small) dataset from an online repository.
- Use
`pandas`

to store and manipulate the data. - Use
`plotnine`

to generate plots/visualizations. - Use
`scikit-learn`

to fit a simple logistic regression model.

```
# @title Imports
import numpy as np
import pandas as pd
import plotnine as gg
import requests
import sklearn
```

## Let there be data

Let’s fetch the famous Iris dataset from the UCI machine learning datasets repository.

```
# The URL for the raw data in text form.
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# Column names/metadata.
feature_columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
target_columns = ['target']
```

```
# Fetch the raw data from the website.
response = requests.get(url=data_url)
rows = response.text.split('\n')
data = [r.split(',') for r in rows]
# Turn it into a dataframe and drop any null values.
df = pd.DataFrame(data, columns=feature_columns + target_columns)
df.dropna(inplace=True)
# Fix up the data types: features are numeric, targets are categorical.
df[feature_columns] = df[feature_columns].astype('float')
df[target_column] = df[target_column].astype('category')
```

## Take a quick look at the data

```
# Look at the raw data to check it looks sensible.
df.head()
```

sepal_length | sepal_width | petal_length | petal_width | target | |
---|---|---|---|---|---|

0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |

1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |

2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |

3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |

4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |

```
# Look at some summary statistics.
df.describe()
```

sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|

count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |

mean | 5.843333 | 3.054000 | 3.758667 | 1.198667 |

std | 0.828066 | 0.433594 | 1.764420 | 0.763161 |

min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |

25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |

50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |

75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |

max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |

```
# 6.1 KB ... now that's some #bigdata ;)
df.info()
```

```
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length 150 non-null float64
sepal_width 150 non-null float64
petal_length 150 non-null float64
petal_width 150 non-null float64
target 150 non-null category
dtypes: category(1), float64(4)
memory usage: 6.1 KB
```

```
# Let's take a visual look at the data now, projecting down to 2 dimensions.
gg.theme_set(gg.theme_bw())
p = (gg.ggplot(df)
+ gg.aes(x='sepal_length', y='petal_length', color='target')
+ gg.geom_point(size=2)
+ gg.ggtitle('Iris dataset is not linearly separable')
)
p
```

## Fitting a simple linear model

Let’s get ‘old school’ and fit a simple logistic regression classifier using the raw features.

In the binary case, this is a linear probabilistic model with Bernoulli likelihood given by

where .

Assuming the dataset is iid, the negative log-likelihood yields the familiar cross-entropy loss:

This is a convex and trivially differnetiable objective that we can optimize with gradient-based methods. We’ll defer this to `scikit-learn`

(which also does some other tricks + regularization).

```
# Create a logistic regression model.
model = sklearn.linear_model.LogisticRegression(solver='lbfgs',
multi_class='multinomial')
# Extract the features and targets as numpy arrays.
X = df[feature_columns].values
y = df[target_column].cat.codes # Note that we 'code' the categories as integers.
# Fit the model.
model.fit(X, y)
```

```
# Compute the model accuracy over the training set (we don't have a test set).
pred_df = df.copy()
pred_df['prediction'] = model.predict(X).astype('int8')
pred_df['correct'] = pred_df.prediction == pred_df.target.cat.codes
accuracy = pred_df.correct.sum() / len(pred_df)
print('Accuracy: {:.3f}'.format(accuracy))
```

```
Accuracy: 0.973
```

```
# Let's look again at the data, highlighting the examples which we mis-classified.
p = (gg.ggplot(pred_df)
+ gg.aes(x='sepal_length', y='petal_length', shape='target', color='correct')
+ gg.geom_point(size=2)
)
p
```