Data preparation (1)
Working with missing values, and noisy data
For this Tutorial, you need to install the following libraries:
* matplotlib
* sklearn
You will also need to download the oil price data from here. We start by importing the most important libraries.
Understanding the data
Before you can work on the data, you will need to upload the file on Google Colab (if you are using it). After that, you can read the csv as follows:
# TODO: read the csv file pid
df_oil = pd.read_csv('oil.csv', header = 0,
quotechar = '"',sep = ",",
na_values = ['na', '-', '.', ''])
The table contains two columns (date and dcoilwtico). ‘dcoilwtico’ contains missing values. Let us display the dataframe
We can also visualize the data using the matplotlib library
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
'''
TODO: define a variable X to hold the value {0, 1, 2, ..., len(df_oil) - 1}
Y = the values in the dcoilwtico
Plot Y against X as a scatter plot.
'''
Recall that summary statistics and arithmetic with missing data is natively supported by pandas.The mean is computed normally and missing values are ignored:
'''
TODO: Compute and print the mean value of dcoilwtico using a function from the numpy library
'''
Min, Max, STD and Variance all work even when values are missing:
Incomplete data
First, deleting the records that contain missing values
Pandas has a function that allows deleting the rows in a dataframe (or elements in a series) that contain missing values.
However, very often you may wish to fill in those missing values rather than simply dropping them. Of course, pandas also has that functionality. For example, we could fill missing values with a constant. We do this for exercise only but you will learn later about more sophisticated techniques for data imputation.
That actually works with any data type.
Moreover, we can use this functionality to fill in the gaps with the average value computed across the non-missing values.
Since the values of dcoilwtico represent a time series reading, we can fill the missing values with the closest non-missing value before or after. We can do that in Python using the flags pad / ffill and bfill / backfill
'''
TODO: fill the missing values in dcoilwtico with the values in the previous/furture records (no limit)
'''
We can set a limit if we only want to replace consecutive gaps.
We can also perform element-wise arithmetic operations between series with missing data.
Remember: by definition, the result of any operation that involves missing values is NaN.
Noisy data
In this exercise, we will apply the techniques that we learned about handling noisy data using the binning technique. First, we sort the data and partition it into (equal-frequency) bins. Then data can be smoothed by bin means, bin median, or bin boundaries.
Outlier detection
For this exercise, you will use the Pima Indians Diabetes Database.
df_pid = pd.read_csv('diabetes.csv', header = 0,
quotechar = '"',sep = ",",
na_values = ['na', '-', '.', ''])
df_pid
We start by detecting the outliers using a statistical method. We assume that the values are extracted from a normal distribution. We estimate the parameters of the normal distribution from the available data and use the condition that a value is considered outlier if the distance from the value to the mean is greater than 3 times the standard deviation.
Now, we use the k-nearest neighbours method to detect the outliers
Now, we use the LOF method to detect the outliers