Assignment 1: Data extraction & integration
Introduction
You are working in a bank. The manager approached you to redesign the database. They have already 3 relations that represent 3 main entities:
branch(branch_id, branch_name, street, building_number, city, ...)
customer(customer_id, street, house_number, city, country, ...)
loan(loan_number, branch_name, amount, ...)
Task 1: Database (re)design
- Add more attributes to the given relations (branch, customer, loan).
- Add a minimum of two relations that represent entities.
- Draw the schema chart for the database.
- Identify the cardinalities of the relationships between the entities and explain how you can represent the relationships in your design.
- List all the primary-key-foreign-key constraints in the database.
- Write the SQL code that creates the whole database in SQLite and save the file as
.sql
that can be run directly from the sqlite3 bash using the command.read my_file.sql
- Pick one of the relations (tables) and explain if it is in the BCNF or not and why.
Task 2: Querying the database
Using the database from Task 1, write queries in natural language, relational algebra and SQL that contains:
- Left outer join
- Aggregate function
- Nested query
To make sure that you queries can be run on the database, you need to enter a few records in the tables that are involved in your query.
Task 3: Data extraction and entity resolution using Python
Using the database from Task 1, perform the following tasks using Python code:
- Write a Python code that allows you to connect to the database file, send SQL queries to the database and extract the results.
- Read the content of the customer relation (table) into Pandas DataFrame.
- Using a similarity function that compares two records (similar to the one in the tutorial), report the customers with similarity > 0.7.
Submission
You need to submit a zip file that includes:
- A pdf file that reports and explains how you solved the questions.
- An SQL (with extension
.sql
) file that contains all the code for creating the database, adding records in the selected relations and querying the database. - A Python notebook that includes the code for solving Task 3. The explanation of the code should be included in the report (the pdf file).
Submission should be done on BrightSpace and only one submission per group is needed. You can submit as many versions as you like but only the last submission before the deadline will be graded.
Deadline: as specified in the course website (2024-09-23 10:00AM).