Lab 8: Data Preparation – Part 2
Data Transformation, Data Integration and Data Reduction
It is highly recommended to install py_stringmatching to compute the similarities for the data integration part. The source code on GitHub could be found here. We assume that you work on Google Colab. If you are using a different IDE, then you should by attention to the parts that needs modifications before you write your code.
For most of the parts, we are going to use the Pima Indians Diabetes Database. We will also use the preprocessing module from sklearn package for performing most of the tasks.
The solution assumes that you downloaded the data from Kaggle and uploaded your data to Google Colab. If you are running the code on your own machine, then you should change the location of the data accordingly.
!pip install py_stringmatching
Now, we import the required libraries that will be used to solve the exercises.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler as ss
from sklearn.preprocessing import MinMaxScaler as mms
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
Data Transformation (Normalization or Standardization)
We begin by defining a pandas dataframe that contains some cells with missing values. Note that pandas, in addition to allowing us to create dataframes from a variety of files, also supports explicit declaration.
Read the data into a dataframe.
= pd.read_csv('sample_data/diabetes.csv', header = 0,
df_pid = '"',sep = ",",
quotechar = ['na', '-', '.', ''])
na_values df_pid
Data integration
We will use the py_stringmatching library as it contains implementation of a set of similarity measures. Since we haven’t used the GAP similarity before during the lab, we start by simple exercise about using it from the py_stringmatching library.
# After installing the library, you can import the similarity measures
from py_stringmatching import similarity_measure as sm
Data reduction
For the following exercises, we will use the df_pid
dataframe of the the Pima Indians Diabetes Database. We start by re-reading the dataframe.
= pd.read_csv('data/diabetes.csv', header = 0,
df_pid = '"',sep = ",",
quotechar = ['na', '-', '.', ''])
na_values df_pid
Data discretization
We BloodPressure attribute in the df_pid
dataframe and create histogram to discretize the data.