Data Extraction in Python

During this lab, you will practice on working Pandas dataframes and how to extract part of the data using Python APIs. You will also practice on connecting to an SQLite database, send SQL queries, and extract the results to a Pandas dataframe. There are parts where the code is available so you need to run the code, understand and comment on the output. There are also a set of exercises where you need to write the code yourself.

First, we need to import the libraries that are required for our code. You can run the code on Google Colab or using your favorite IDE for Python.

import pandas as pd
import csv

Run the code and comment on the output. After that, solve the exercises with the TODO.

Creating dataframes

creating a dataframe from pandas series
data = {'State': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'Year': [2000, 2001, 2002, 2001, 2002, 2003],
 'Population': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df = pd.DataFrame(data)
df
creating a dataframe from data matrix
data = [['Ohio', 2000, 1.5], ['Ohio', 2001, 1.7], ['Ohio', 2002, 3.6],
        ['Nevada', 2001, 2.4],['Nevada', 2002, 2.9], ['Nevada', 2003, 3.2]]
cols = ['State', 'Year', 'Population']
df = pd.DataFrame(data, columns = cols)
df
printing the number of columns and the number of rows in a dataframe
print("number of columns = ", len(df.columns))
print('number of rows = ', len(df))
creating a dataframe using data stored in a csv file
df = pd.read_csv(filepath_or_buffer = 'sample_data/california_housing_train.csv', 
                 delimiter=',', doublequote=True, quotechar='"',
                 na_values = ['na', '-', '.', ''], 
                 quoting=csv.QUOTE_ALL, encoding = "ISO-8859-1")
df

You may also use your own dataset or the movies dataset.

Extracting data from

  • Extracting rows and blocks
df.info() # index & data types
n = 4
dfh = df.head(n)             # get first n rows
dft = df.tail(n)             # get last n rows
top_left_corner_df = df.iloc[:5, :5]
# TODO: display the content of dfh, dft, top_left_corner_df
  • Extracting columns
col_set = df.iloc[:, 5:9]
col_set

This code extracts columns 6, 7, 8, 9 assuming the first column is numbered 1 not as the python index 0.

For columns that are not in the same range, we separate the columns indexes with a comma.

cols_set2 = df[df.columns[[5, 6, 8]]][:] 
cols_set2
  • Extracting rows
# TODO: write the code to extract ROWS 11, 13, and 15 from the dataframe.  
  • Extracting rows that satisfy a specific condition
df.loc[(df['total_rooms'] > 5000).values, ['longitude', 'latitude', 'total_rooms',  'median_house_value']]

Another way

df.iloc[(df['total_rooms'] > 5000).values, [0, 1, 3, 8]]

Profiling the dataframes

  • Displaying the names of the attributes
# TODO: print the names of the columns (attributes) of the dataframe.
  • Displaying the datatype of each column.
dataTypeSeries = df.dtypes
for col_idx in range(len(df.columns)):
    print(df.columns[col_idx], 'has type (', dataTypeSeries[col_idx], ')')
  • Displaying the names of the columns and the number of unique values in each column
for col in  df.columns:
    print(col, ' has (', len(df[col].unique()), ') unique values')
  • displaying quantities of interest such as min, max, …
# TODO: print the min, max, average and standard deviation of the attributes with numerical datatype
  • counting the missing values
s = df['color'].isnull()
s.sum()
# TODO: print the name of each column in the dataframe and 
#   the number of missing values in that column.  

You may try another dataset such as the movies dataset (you can find in blackboard) as the california_housing_train dataset is complete.

Connecting to an SQLite DB

First, we need to import the required library. The sqlite3 library is installed by default on Google Colab.

import sqlite3
from sqlite3 import Error
  • Creating connection to the databse. The function receives the name of the database file as input and returns a connector to the database.
def create_connection(db_file):
    """ create a database connection to the SQLite database
        specified by the db_file
    :param db_file: database file
    :return: Connection object or None
    """
    conn = None
    try:
        conn = sqlite3.connect(db_file)
    except Error as e:
        print(e)

    return conn
  • Sending the query to the database and receiving the resulting relation. The function receives a connector and a query and returns the results of running the query.
def run_query(conn, query):
    """
    Query all rows in the tasks table
    :param conn: the Connection object
    :return:
    """
    # Create a cursor
    cur = conn.cursor()    
    # Send the query to the database
    cur.execute(query)
    # Extract the results of the query
    results = cur.fetchall()
    # Return the results
    return results
  • Creating a dataframe from the resulting relation.

The function receives a connector and table name, then it sends two queries to get the names of the columns and the data. After receiving the contents, it creates a Pandas dataframe and returns it.

def convert_db_table_to_DF(conn, table):
    # get the names of the attributes in the database table
    header_query = "SELECT name FROM pragma_table_info('" + table + "') ORDER BY cid;"
#     print (header_query)
    cols_init = run_query(conn, header_query)
    cols = [cols_init[i][0] for i in range(len(cols_init))]
#     print(cols)
    # get the records of the table
    content_query = "Select * from " + table
    data = run_query(conn, content_query)
    df = pd.DataFrame(data, columns = cols)
    return df
  • Example for querying the database and getting the names of the tables in the database.
database = "sample_data/univdb-sqlite.db"

# create a database connection
conn = create_connection(database)    
with conn:
    query = 'SELECT name FROM sqlite_schema WHERE type="table"'
    data = run_query(conn, query)
    print(data)
  • Writing your own query
'''
TODO: Write the code to read the data in the `student` table, store it in 
a dataframe and display the contents of the dataframe.
''' 

Merging dataframes

This is equivalent to the union and join operator in RA and SQL. ::: callout-note - Finding the union of two tables

'''
 TODO: read the contents of the files `california_housing_train.csv` and 
 `california_housing_test.csv` and store them in `df_train` and  `df_test`. 
 Find the union the two dataframes and store the results in a dataframe `df_cal_housing`.
''' 
  • Joining two tables
'''
TODO: connect to the University database and read the data in the `instructor` 
and `department` relations and store their contents in two dataframes `df_inst` 
and `df_dept` and join them into `df_in_dep` dataframe
'''
'''
TODO: join `df_inst` and `df_dept` into `df_in_dep2` 
dataframe but specify the joining attribute this time.
'''