Data Extraction in Python
During this lab, you will practice on working Pandas dataframes and how to extract part of the data using Python APIs. You will also practice on connecting to an SQLite
database, send SQL
queries, and extract the results to a Pandas dataframe. There are parts where the code is available so you need to run the code, understand and comment on the output. There are also a set of exercises where you need to write the code yourself.
First, we need to import the libraries that are required for our code. You can run the code on Google Colab or using your favorite IDE for Python.
Run the code and comment on the output. After that, solve the exercises with the TODO.
Creating dataframes
Extracting data from
- Extracting rows and blocks
df.info() # index & data types
n = 4
dfh = df.head(n) # get first n rows
dft = df.tail(n) # get last n rows
top_left_corner_df = df.iloc[:5, :5]
You can display the content of dfh, dft, top_left_corner_df using:
However, the output will be hard to read. The output will be like:
It would be better to display each dataframe alone. Use multiple cells to dispay them (one cell per dataframe)
You will see the output as
Similarly, dft will be displayed as:
- Extracting columns
This code extracts columns 6, 7, 8, 9 assuming the first column is numbered 1 not as the python index 0.
For columns that are not in the same range, we separate the columns indexes with a comma.
- Extracting rows
You can use the code:
Note that Python starts the index from 0.
- Extracting rows that satisfy a specific condition
df.loc[(df['total_rooms'] > 5000).values, ['longitude', 'latitude', 'total_rooms', 'median_house_value']]
Another way
- Displaying the names of the attributes
You can access the name of the attributes using df.columns
OR
- Displaying the datatype of each column.
dataTypeSeries = df.dtypes
for col_idx in range(len(df.columns)):
print(df.columns[col_idx], 'has type (', dataTypeSeries[col_idx], ')')
- Displaying the names of the columns and the number of unique values in each column
- displaying quantities of interest such as
min
,max
, …
# TODO: print the min, max, average and standard deviation of the attributes with numerical datatype
dataTypeSeries = df.dtypes
for col_idx in range(len(df.columns)):
if (not (dataTypeSeries[col_idx] == 'object')):
print(df.columns[col_idx], 'has Min = ', df[df.columns[col_idx]].min(),
'Max = ', df[df.columns[col_idx]].max(),
'Average = ', df[df.columns[col_idx]].mean(),
'Standard Deviation = ', df[df.columns[col_idx]].std())
- counting the missing values
# TODO: print the name of each column in the dataframe and
# the number of missing values in that column.
s = df['longitude'].isnull()
s.sum()
You can use the following code:
for column in df.columns:
s = df[column].isnull().sum()
print(column, 'has (', s, ') missing values')
You may try another dataset such as the movies dataset as this dataset is complete.
Connecting to an SQLite DB
- Creating connection to the databse. The function receives the name of the database file as input and returns a connector to the database.
def create_connection(db_file):
""" create a database connection to the SQLite database
specified by the db_file
:param db_file: database file
:return: Connection object or None
"""
conn = None
try:
conn = sqlite3.connect(db_file)
except Error as e:
print(e)
return conn
- Sending the query to the database and receiving the resulting relation. The function receives a connector and a query and returns the results of running the query.
def run_query(conn, query):
"""
Query all rows in the tasks table
:param conn: the Connection object,
query: the query that should be sent to the database
:return: The results of executing the query
"""
# Create a cursor
cur = conn.cursor()
# Send the query to the database
cur.execute(query)
# Extract the results of the query
results = cur.fetchall()
# Return the results
return results
- Creating a dataframe from the resulting relation. The function receives a connector and table name, then it sends two queries to get the names of the columns and the data. After receiving the contents, it creates a Pandas dataframe and returns it.
def convert_db_table_to_DF(conn, table):
# get the names of the attributes in the database table
header_query = "SELECT name FROM pragma_table_info('" + table + "') ORDER BY cid;"
cols_init = run_query(conn, header_query)
cols = [cols_init[i][0] for i in range(len(cols_init))]
# get the records of the table
content_query = "Select * from " + table
data = run_query(conn, content_query)
df = pd.DataFrame(data, columns = cols)
return df
- Example for querying the database and getting the names of the tables in the database.
# TODO: Write the code to read the data in the `student` table, store it in a dataframe and display the contents of the dataframe.
We will use the code:
database = "sample_data/univdb-sqlite.db"
# create a database connection
conn_uni = create_connection(database)
with conn_uni:
tab_name = 'student'
student = convert_db_table_to_DF(conn_uni, tab_name)
In the next Figure you can see how we wrote the code and showed the results
This is equivalent to the union
and join
operator in RA and SQL.
- Finding the union of two tables
'''
TODO: read the contents of the files `california_housing_train.csv` and
`california_housing_test.csv` and store them in `df_train` and `df_test`.
Find the union the two dataframes and store the results in a dataframe `df_cal_housing`.
'''
df_train = pd.read_csv(filepath_or_buffer = 'sample_data/california_housing_train.csv',
delimiter=',', doublequote=True, quotechar='"',
na_values = ['na', '-', '.', ''],
quoting=csv.QUOTE_ALL, encoding = "ISO-8859-1")
df_test = pd.read_csv(filepath_or_buffer = 'sample_data/california_housing_test.csv',
delimiter=',', doublequote=True, quotechar='"',
na_values = ['na', '-', '.', ''],
quoting=csv.QUOTE_ALL, encoding = "ISO-8859-1")
union_df = pd.concat([df_train, df_test], axis=1)
union_df
- Joining two tables
'''
TODO: connect to the University database and read the data in the `instructor`
and `department` relations and store their contents in two dataframes `df_inst`
and `df_dept` and join them into `df_in_dep` dataframe
'''
database = "sample_data/univdb-sqlite.db"
# create a database connection
conn = create_connection(database)
with conn:
inst_tab = 'instructor'
df_inst = convert_db_table_to_DF(conn, inst_tab)
dept_tab = 'department'
df_dept = convert_db_table_to_DF(conn, dept_tab)
df_merge_col = pd.merge(df_inst, df_dept)
df_merge_col