# Lab 7: Data Preparation – Part 1

## Working with Missing Values, Noisy Data and Detecting Outliers

For this Tutorial, you need to install the following libraries:

```
* matplotlib
* sklearn
```

You will also need to download the oil price data from here. We start by importing the most important libraries.

### Understanding the data

Before you can work on the data, you will need to upload the file on Google Colab (if you are using it). After that, you can read the csv as follows:

```
# TODO: read the csv file pid
df_oil = pd.read_csv('oil.csv', header = 0,
quotechar = '"',sep = ",",
na_values = ['na', '-', '.', ''])
```

The table contains two columns (date and dcoilwtico). ‘dcoilwtico’ contains missing values. Let us display the dataframe

We can also visualize the data using the matplotlib library

```
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
'''
TODO: define a variable X to hold the value {0, 1, 2, ..., len(df_oil) - 1}
Y = the values in the dcoilwtico
Plot Y against X as a scatter plot.
'''
```

Recall that summary statistics and arithmetic with missing data is natively supported by pandas.The mean is computed normally and missing values are ignored:

```
'''
TODO: Compute and print the mean value of dcoilwtico using a function from the numpy library
'''
```

Min, Max, STD and Variance all work even when values are missing:

### Handling the missing values

First, deleting the records that contain missing values

Pandas has a function that allows deleting the rows in a dataframe (or elements in a series) that contain missing values.

However, very often you may wish to fill in those missing values rather than simply dropping them. Of course, pandas also has that functionality. For example, we could fill missing values with a constant. **We do this for exercise only but you will learn later about more sophisticated techniques for data imputation.**

That actually works with any data type.

Moreover, we can use this functionality to fill in the gaps with the average value computed across the non-missing values.

Since the values of dcoilwtico represent a time series reading, we can fill the missing values with the closest non-missing value before or after. We can do that in Python using the flags *pad* / *ffill* and *bfill* / *backfill*

```
'''
TODO: fill the missing values in dcoilwtico with the values in the previous/furture records (no limit)
'''
```

We can set a limit if we only want to replace consecutive gaps.

#### Arithmetic operations on data with missing values

We can also perform element-wise arithmetic operations between series with missing data.

Remember: by definition, the result of any operation that involves missing values is NaN.

### Handling the noisy data

In this exercise, we will apply the techniques that we learned about handling noisy data using the benning technique. First, we sort the data and partition it into (equal-frequency) bins. Then data can be smoothed by bin means, bin median, or bin boundaries.

### Outlier detection

For this exercise, you will use the Pima Indians Diabetes Database.

```
df_pid = pd.read_csv('diabetes.csv', header = 0,
quotechar = '"',sep = ",",
na_values = ['na', '-', '.', ''])
df_pid
```

We start by detecting the outliers using a statistical method. We assume that the values are extracted from a normal distribution. We estimate the parameters of the normal distribution from the available data and use the condition that a value is considered outlier if the distance from the value to the mean is greater than 3 times the standard deviation.

Now, we use the k-nearest neighbours method to detect the outliers

Now, we use the LOF method to detect the outliers