Lab 5: Heterogeneous Data Analysis & String Similarity

In this lab, you will practice to compute the similairty between different strings and sets. You will use different atomic similarity measures. Moreover, you will practice to find the similarity between different documents using min-hash and locality sensitive hashing beside the traditional ones (Jaccard similarity, Sørensen Coefficient, Tversky Index and the Overlap Coefficient)

Atomic String Similarity

In the following exercises, you will learn about finding the simiarity between atomic values using Edit-Distance, Jaro Similarity, Jaro-Winkler Similarity and Soundex). We will start by importing the required libraries.

import pandas as pd
import csv
import numpy as np
Minimum Edit Distance (Levenshtein)
'''
TODO: find an implementation of the Levenshtein minimum edit distance and use it 
to find the similarity between INTENTION and EXECUTION
'''
def ldist(s, t):
    
    return None

print("Minimum edit disctance = ", ldist("INTENTION", "EXECUTION"))
'''
TODO: Modify the ldist function to accept a new parameter which specifies the substitution cost
Test your implementation with the call mod_ldist("Intention", "execution", 1) -- You should get 5
and mod_ldist("Intention", "execution", 2) -- You should get 8
'''
def mod_ldist(s, t, sub_cost):
    
    return None
  

print("Minimum edit disctance = ", mod_ldist("INTENTION", "EXECUTION", 2))
Jaro Similarity
'''
TODO: find an implementation of the Jaro similarity algorithm and use it 
to find the similarity between the follwing pairs of strings:
('nichlseon','niechlson'),
('MARTHA', 'MARHTA'),
('DIXON', 'DICKSONX'),
('JELLYFISH', 'SMELLYFISH'),
('cunningham', 'cunnigham'), 
('shackleford', 'shackelford'),
('nichleson', 'nichulson'), 
('arnab', 'urban'),
('DATASCIENCE','SCIENCEDATA')
'''
def jaro(s, t):
    
    return None
Jaro-Winkler Similarity
'''
TODO: find an implementation of the Jaro-Winkler similarity algorithm and use it 
to find the similarity between the pairs of strings in the previous exercise. 
Compare the results and comment on them.
'''
def jaro_winkler(s, t, p = 0.1):
    
    return None
Soundex
'''
TODO: find an implementation of the soundex algorithm and use it 
to find the representation of EDUCATE and EVALUATE
'''

Similarity for Sets

For this exercise, you may use a tokenizing function that convert the following sentences into sets of words (words are separated by spaces). After that, apply Jaccard, Sørensen Coefficient, Tversky Index, Overlap Coefficient).

D1 = People wearing costumes are gathering in a forest and are looking in the same direction    
D2 = People wearing costumes are scattering in a forest and are looking in different directions
D3 = People wearing costumes are gathering in a forest and are looking in the same direction    
D4 = Masked people are looking in the same direction in a forest

Compute the pairwise similarity between the documents using the four measures.

'''
TODO: first step is to tokenize the documents 
'''
D1 = 'People wearing costumes are gathering in a forest and are looking in the same direction'
D2 = 'People wearing costumes are scattering in a forest and are looking in different directions'
D3 = 'People wearing costumes are gathering in a forest and are looking in the same direction'
D4 = 'Masked people are looking in the same direction in a forest'
S1 = ??? tokenize(D1)
S2 = ??? tokenize(D2)
S3 = ??? tokenize(D3)
S4 = ??? tokenize(D4)
Jaccard Similarity
'''
TODO: Compute the pairwise Jaccard similarity for the sets S1, S2, S3, and S4. 
'''
Sørensen Coefficient
'''
TODO: Compute the pairwise Sørensen Coefficient for the sets S1, S2, S3, and S4. 
'''
Tversky Index
'''
TODO: Compute the pairwise Tversky Index for the sets S1, S2, S3, and S4. 
'''
Overlap Coefficient
'''
TODO: Compute the pairwise Overlap Coefficient for the sets S1, S2, S3, and S4. 
'''

Locality Sensitive Hashing (LSH)

In the realm of large datasets and high-dimensional spaces, searching for items can be computationally challenging. Specifically, if you want to find items that are similar to a given item (known as the “query item”), conventional search methods might take an impractical amount of time. This is more observable in tasks that involve finding similar documents in a vast collection corpus or similar image in a massive image database.

Applying LSH for finding similar documents will require comparing the query with the documents in a specific bucket instead of comparing it with every single document in the collection. At its heart, LSH is a method for hashing data points in a way such that similar data points will, with high probability, end up in the same “bucket” or hash bin. Conversely, dissimilar items will, with high probability, end up in different bins.