Data preparation (1)

Working with missing values, and noisy data

For this Tutorial, you need to install the following libraries:

* matplotlib
* sklearn

You will also need to download the oil price data from here. We start by importing the most important libraries.

Understanding the data

Data profiling

import pandas as pd
import numpy as np

Before you can work on the data, you will need to upload the file on Google Colab (if you are using it). After that, you can read the csv as follows:

# TODO: read the csv file pid
df_oil = pd.read_csv('oil.csv', header = 0,
                 quotechar = '"',sep = ",",
                 na_values = ['na', '-', '.', ''])

The table contains two columns (date and dcoilwtico). ‘dcoilwtico’ contains missing values. Let us display the dataframe

'''
TODO: display the dataframe
'''

We can also visualize the data using the matplotlib library

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
'''
TODO: define a variable X to hold the value {0, 1, 2, ..., len(df_oil) - 1}
Y = the values in the dcoilwtico
Plot Y against X as a scatter plot. 
'''

'''
TODO: 
Plot the first 200 values of Y as a scatter plot. 
'''

Recall that summary statistics and arithmetic with missing data is natively supported by pandas.The mean is computed normally and missing values are ignored:

'''
TODO: Compute and print the mean value of dcoilwtico
'''

'''
TODO: Compute and print the mean value of dcoilwtico using a function from the numpy library
'''

Min, Max, STD and Variance all work even when values are missing:

'''
TODO: Compute and display the min, max, standard deviation and variance of dcoilwtico
'''

'''
TODO: Compute and display the min, max, standard deviation and variance of dcoilwtico
    using the functions from numpy. Compare the values with those from the previous 
    exercise. What is different and why?
'''

Incomplete data

Handling the missing values

First, deleting the records that contain missing values

Pandas has a function that allows deleting the rows in a dataframe (or elements in a series) that contain missing values.

'''
TODO: create a new dataframe from the df_oil after deleting the records with missing values
'''

However, very often you may wish to fill in those missing values rather than simply dropping them. Of course, pandas also has that functionality. For example, we could fill missing values with a constant. We do this for exercise only but you will learn later about more sophisticated techniques for data imputation.

'''
TODO: replace the missing value with -99 in attribute dcoilwtico
'''

That actually works with any data type.

'''
TODO: fill the missing values with the string 'unknown' in attribute dcoilwtico
'''

Moreover, we can use this functionality to fill in the gaps with the average value computed across the non-missing values.

'''
TODO: replace the missing values with the average of the non-missing values
'''

Since the values of dcoilwtico represent a time series reading, we can fill the missing values with the closest non-missing value before or after. We can do that in Python using the flags pad / ffill and bfill / backfill

'''
TODO: fill the missing values in dcoilwtico with the values in the previous/furture records (no limit)
'''

We can set a limit if we only want to replace consecutive gaps.

'''
TODO: fill the missing values in dcoilwtico with the value in the next record 
(the value of a record can be used in the previous record only)
'''

Arithmetic operations on data with missing values

We can also perform element-wise arithmetic operations between series with missing data.

Remember: by definition, the result of any operation that involves missing values is NaN.

'''
TODO: Perform element-wise addition (+) between the values in original dcoilwtico 
and the one that is filled with the values from previous records. Use the addition operator (+)
'''

'''
TODO: Find which function in the numpy library that can perform 
the pairwise addition between two lists and use it. 
Compare the results with the results from the previous exercise. 
'''

Noisy data

Handling the noisy data

In this exercise, we will apply the techniques that we learned about handling noisy data using the binning technique. First, we sort the data and partition it into (equal-frequency) bins. Then data can be smoothed by bin means, bin median, or bin boundaries.

'''
TODO: 
    1. extract the first 200 values from the attribute dcoilwtico and sort the values
    2. partition the data into different number of bin (10, 20, 25 and 40)
    3. for every partitioning, smooth the values using 
        the three different techniques and plot the smoothed curve
'''

Outlier detection

For this exercise, you will use the Pima Indians Diabetes Database.

df_pid = pd.read_csv('diabetes.csv', header = 0,
                 quotechar = '"',sep = ",",
                 na_values = ['na', '-', '.', ''])
df_pid

The statistical approach

We start by detecting the outliers using a statistical method. We assume that the values are extracted from a normal distribution. We estimate the parameters of the normal distribution from the available data and use the condition that a value is considered outlier if the distance from the value to the mean is greater than 3 times the standard deviation.

'''
TODO: Ignore the label column. 
    1. assume that the values in each column are extracted from a normal distribution, then estimate the parameters
    2. report the ouliers in each column using the statistical method
    3. plot the data after highlighting the outliers (normal in blue, outliers in red)
'''

Outlier detection using KNN

Now, we use the k-nearest neighbours method to detect the outliers

from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
%matplotlib inline
'''
TODO: 
    1. define a function to find the outliers and 
    2. plot the data after highlighting the outliers
'''

'''
TODO: 
    use different values for the paramters and compare the results 
'''

Outlier detection using Local Outlier Factor

Now, we use the LOF method to detect the outliers

'''
TODO: Apply LOF to find outliers in the data 
'''
from sklearn.neighbors import LocalOutlierFactor
%matplotlib inline

'''
TODO: 
    use different values for the paramters and compare the results 
    which set of parameters gives reasonable results 
'''