INFOMDWR – Assignment 1: Data Extraction & Integration

Introduction

You are working in a bank. The manager approached you to redesign the bank database. They have already 3 relations that represent 3 main entities:

  branch(branch_id, branch_name, street, building_number, city, ...)
  customer(customer_id, street, house_number, city, country, ...)
  loan(loan_number, branch_id, amount, ...)

Task 1: Database (re)design

  1. Add more attributes to the given relations (branch, customer, loan).
  2. Add a minimum of two relations that represent entities.
  3. Draw the schema chart for the database.
  4. Identify the cardinalities of the relationships between the entities and explain how you can represent the relationships in your design.
  5. List all the primary-key-foreign-key constraints in the database.
  6. Write the SQL code that creates the whole database in SQLite and save the file as .sql that can be run directly from the sqlite3 bash using the command .read my_file.sql
  7. Pick one of the relations (tables) and explain if it is in the BCNF or not and why.

Task 2: Querying the database

Using the database from Task 1, write queries in natural language, relational algebra and SQL that contains:

  1. Left outer join.
  2. Aggregate function.
  3. Nested query.

To make sure that you queries can be run on the database, you need to enter a few records in the tables that are involved in your queries.

Task 3: Data extraction and entity resolution using Python

Using the database from Task 1, perform the following tasks using Python code:

  1. Write a Python code that allows you to connect to the database file, send SQL queries to the database and extract the results.
  2. Read the content of the customer relation (table) into Pandas DataFrame.
  3. Using a similarity function that compares two records (similar to the one in the tutorial), report the customers with similarity > 0.7.

Submission

You need to submit a zip file that includes:

  1. A pdf file that reports and explains how you solved the questions.
  2. An SQL (with extension .sql) file that contains all the code for creating the database, adding records in the selected relations and querying the database.
  3. A Python notebook that include the code for solving Task 3. The explanation of the code should be included in the report (the pdf file).

Submission should be done on blackboard and only one submission per group is needed. Each group should have 3 - 4 members. You can submit as many versions as you like but only the last submission before the deadline will be graded.