Assignment 2: Data integration & preparation

Introduction

In this assignment, you will work on different tasks that are related to data preparation including data profiling, finding records that refer to same entity, and computing the correlation between different attributes. The datasets that you need to work on are available online (links are provided).

Task 1: Profiling relational data

For this task, download and read the paper about profiling relational data, select a set of summary statistics about the data (minimum of 10 different values) and write Python code to compute these quantities for a dataset of your choice. Preferably, you can use one of the csv files from the road safety dataset. Explain the importance of each summary statistic that you selected in understanding the characteristics of the dataset.

Note: Computing the same statistical quantity on multiple columns of the dataset will be counted only once.

Task 2: Entity resolution

For this task, use the DBLP-ACM.

Part 1:

Write a Python code to compare every single record in the dataset (ACM.csv) with all the records in (DBLP2.csv) and find the similar records (records that represent the same publication). To compare two records, follow the steps:

Ignore the pub_id.
Change all alphabetical characters into lowercase.
Convert multiple spaces to one.
Use Levenshtein similarity (\(L_{sim}(S_1, S_2) = 1 - \frac{MED(S_1, S_2)}{MAX(|S_1|,|S_2|)})\) for comparing the values in the title attribute and compute the score (\(s_t\)). (MED refers to the minimum edit distance and \(|S_i|\) is the number of characters in string \(S_i\)).
Use Jaro similarity to compare the values in the authors field and compute (\(s_a\)).
Use a modified version of the affine similarity that is scaled to the interval [0, 1] for the venue attribute (\(s_c\)).
Use Match (1) / Mismatch (0) for the year (\(s_y\)).
Use the formula \(rec\_sim = \omega_1 * s_t + \omega_2 * s_a + \omega_3 * s_c + \omega_4 * s_y\) to combine the scores and compute the final score, where \(\sum_{i=1}^4 \omega_i = 1\).
Report the records with rec_sim > 0.7 as duplicate records by storing the ids of both records in a list.
In the table DBLP-ACM_perfectMapping.csv, you can find the actual mappings (the ids of the correct duplicate records). Compute the precision of this method by counting the number of duplicate records that you discovered correctly. That is, among all the reported similar records by your method, how many pairs exist in the file DBLP-ACM_perfectMapping.csv.
Record the running time of the method. You can observe that the program takes a long time to get the results. What can you do to reduce the running time? (Just provide clear discussion – no need for implementing the ideas.)

Part 2:

For this part, we will use the code about LSH from the tutorial.

Concatenate the values in each record into one single string.
Change all alphabetical characters into lowercase.
Convert multiple spaces to one.
Combine the records from both tables into one big list as we did during the lab.
Use the functions in the tutorials from lab 5 to compute the shingles, the minhash signature and the similarity.
Extract the top \(2224\) candidates from the LSH algorithm, compare them to the actual mappings in the file DBLP-ACM_perfectMapping.csv and compute the precision of the method.
Record the running time of the method.
Compare the precision and the running time in Parts 1 and 2.

Task 3: Data preparation

For this task, use the Pima Indians Diabetes Database.

Compute the correlation between the different columns after removing the outcome column.
Remove the disguised values from the table. We need to remove the values that equal to 0 from columns BloodPressure, SkinThickness and BMI as these are missing values but they have been replaced by the value 0. Remove the value but keep the record (i.e.) change the value to null.
Fill the cells with null using the mean values of the records that have the same class label.
Compute the correlation between the different columns.
Compare the values from this step with the values in the first step (just mention the most important changes (if any)) and comment on your findings.

Submission

You need to submit a zip file that includes:

A pdf file that reports and explains how you answered the questions.
A Python notebook that includes the code with markdown boxes to specify the question under consideration. Use comments to explain the important steps in your code. The explanation of the methods in general with discussion about the most important steps in the code should be included in the report (the pdf file).

Submission should be done on BrightSpace and only one submission per group is needed. You can submit as many versions as you like but only the last submission before the deadline will be graded.

Deadline: as specified in the course website (2024-10-07 10:00AM).