During this lab, you will practice on working Pandas dataframes and how to extract part of the data using Python APIs. You will also practice on connecting to an SQLite database, send SQL queries, and extract the results to a Pandas dataframe. There are parts where the code is available so you need to run the code, understand and comment on the output. There are also a set of exercises where you need to write the code yourself.
First, we need to import the libraries that are required for our code. You can run the code on Google Colab or using your favorite IDE for Python.
import pandas as pdimport csv
Run the code and comment on the output. After that, solve the exercises with the TODO.
# TODO: print the names of the columns (attributes) of the dataframe.
Displaying the datatype of each column.
dataTypeSeries = df.dtypesfor col_idx inrange(len(df.columns)):print(df.columns[col_idx], 'has type (', dataTypeSeries[col_idx], ')')
Displaying the names of the columns and the number of unique values in each column
for col in df.columns:print(col, ' has (', len(df[col].unique()), ') unique values')
displaying quantities of interest such as min, max, …
# TODO: print the min, max, average and standard deviation of the attributes with numerical datatype
counting the missing values
s = df['color'].isnull()s.sum()
# TODO: print the name of each column in the dataframe and # the number of missing values in that column.
You may try another dataset such as the movies dataset (you can find in blackboard) as the california_housing_train dataset is complete.
Connecting to an SQLite DB
First, we need to import the required library. The sqlite3 library is installed by default on Google Colab.
import sqlite3from sqlite3 import Error
Creating connection to the databse. The function receives the name of the database file as input and returns a connector to the database.
def create_connection(db_file):""" create a database connection to the SQLite database specified by the db_file :param db_file: database file :return: Connection object or None """ conn =Nonetry: conn = sqlite3.connect(db_file)except Error as e:print(e)return conn
Sending the query to the database and receiving the resulting relation. The function receives a connector and a query and returns the results of running the query.
def run_query(conn, query):""" Query all rows in the tasks table :param conn: the Connection object :return: """# Create a cursor cur = conn.cursor() # Send the query to the database cur.execute(query)# Extract the results of the query results = cur.fetchall()# Return the resultsreturn results
Creating a dataframe from the resulting relation.
The function receives a connector and table name, then it sends two queries to get the names of the columns and the data. After receiving the contents, it creates a Pandas dataframe and returns it.
def convert_db_table_to_DF(conn, table):# get the names of the attributes in the database table header_query ="SELECT name FROM pragma_table_info('"+ table +"') ORDER BY cid;"# print (header_query) cols_init = run_query(conn, header_query) cols = [cols_init[i][0] for i inrange(len(cols_init))]# print(cols)# get the records of the table content_query ="Select * from "+ table data = run_query(conn, content_query) df = pd.DataFrame(data, columns = cols)return df
Example for querying the database and getting the names of the tables in the database.
database ="sample_data/univdb-sqlite.db"# create a database connectionconn = create_connection(database) with conn: query ='SELECT name FROM sqlite_schema WHERE type="table"' data = run_query(conn, query)print(data)
Writing your own query
'''TODO: Write the code to read the data in the `student` table, store it in a dataframe and display the contents of the dataframe.'''
Merging dataframes
This is equivalent to the union and join operator in RA and SQL. ::: callout-note - Finding the union of two tables
'''TODO: read the contents of the files `california_housing_train.csv` and `california_housing_test.csv` and store them in `df_train` and `df_test`. Find the union the two dataframes and store the results in a dataframe `df_cal_housing`.'''
Joining two tables
'''TODO: connect to the University database and read the data in the `instructor` and `department` relations and store their contents in two dataframes `df_inst` and `df_dept` and join them into `df_in_dep` dataframe'''
'''TODO: join `df_inst` and `df_dept` into `df_in_dep2` dataframe but specify the joining attribute this time.'''