Indexing and data integration
Introduction
In databases, indexes are used to speed up the retrieval of rows by creating a data structure that allows the database engine to quickly locate the required data. There are several types of indexe:
Index that is automatically created by the Database Management System (DBMS) when you define a PRIMARY KEY or UNIQUE constraint on a table.
Creating and deleting indexes
In this exercise, you will practice on creating and deleting indexes in a database. We can use the university database univdb-sqlite.db
from here.
Query Performance with and without Indexes
For this exercise, we will create a database with only one big table of synthetic data about employees.
Copy this code to Google colab and run it (You don’t have to fully understand how it works).
import sqlite3
import random
import datetime
NUM_ROWS = 5_000_000
conn = sqlite3.connect(f'employees_{NUM_ROWS}.db')
cur = conn.cursor()
cur.execute('''
CREATE TABLE IF NOT EXISTS Employees (
EmployeeID INTEGER PRIMARY KEY,
FirstName TEXT,
LastName TEXT,
Email TEXT UNIQUE,
DepartmentID char(6),
Salary REAL,
HireDate DATE
);
''')
def random_date(start, end):
return start + datetime.timedelta(
days=random.randint(0, int((end - start).days)))
first_names = [
"John", "Jane", "Michael", "Emily", "James", "Patricia", "Robert", "Linda", "William", "Barbara",
"David", "Elizabeth", "Joseph", "Susan", "Charles", "Margaret", "Thomas", "Dorothy", "Christopher", "Nancy",
"Daniel", "Karen", "Matthew", "Betty", "Anthony", "Helen", "Mark", "Sandra", "Paul", "Ashley",
"Steven", "Deborah", "Andrew", "Jessica", "Kenneth", "Sarah", "Joshua", "Laura", "Kevin", "Anna",
"Brian", "Kimberly", "George", "Melissa", "Edward", "Michelle", "Ronald", "Emily", "Timothy", "Amanda",
"Jason", "Angela", "Jeffrey", "Rebecca", "Ryan", "Cynthia", "Jacob", "Stephanie", "Gary", "Katherine",
"Nicholas", "Shirley", "Eric", "Brenda", "Jonathan", "Catherine", "Stephen", "Christine", "Larry", "Marie",
"Justin", "Janet", "Scott", "Samantha", "Brandon", "Carolyn", "Frank", "Rachel", "Gregory", "Heather",
"Raymond", "Diane", "Benjamin", "Virginia", "Patrick", "Julie", "Alexander", "Joyce", "Jack", "Victoria",
"Dennis", "Olivia", "Jerry", "Christina", "Tyler", "Kelly", "Aaron", "Emma", "Jose", "Lauren"
]
last_names = [
"Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia", "Miller", "Davis", "Rodriguez", "Martinez",
"Hernandez", "Lopez", "Gonzalez", "Wilson", "Anderson", "Thomas", "Taylor", "Moore", "Jackson", "Martin",
"Lee", "Perez", "Thompson", "White", "Harris", "Sanchez", "Clark", "Ramirez", "Lewis", "Robinson",
"Walker", "Young", "Allen", "King", "Wright", "Scott", "Torres", "Nguyen", "Hill", "Flores", "Green",
"Adams", "Nelson", "Baker", "Hall", "Rivera", "Campbell", "Mitchell", "Carter", "Roberts", "Gomez",
"Phillips", "Evans", "Turner", "Diaz", "Parker", "Cruz", "Edwards", "Collins", "Reyes", "Stewart",
"Morris", "Morales", "Murphy", "Cook", "Rogers", "Gutierrez", "Ortiz", "Morgan", "Cooper", "Peterson",
"Bailey", "Reed", "Kelly", "Howard", "Ramos", "Kim", "Cox", "Ward", "Richardson", "Watson",
"Brooks", "Chavez", "Wood", "James", "Bennett", "Gray", "Mendoza", "Ruiz", "Hughes", "Price",
"Alvarez", "Castillo", "Sanders", "Patel", "Myers", "Long", "Ross", "Foster", "Jimenez", "Powell"
]
domains = ['example.com', 'test.com', 'demo.com']
start_date = datetime.date(2000, 1, 1)
end_date = datetime.date(2023, 12, 31)
locations = ['AM', 'UT', 'RT', 'EN']
for i in range(NUM_ROWS):
first_name = f'{random.choice(first_names)}_{i%10000}'
last_name = f'{random.choice(last_names)}_{i%10000}'
email = f"{first_name.lower()}.{last_name.lower()}{i}@{random.choice(domains)}"
department_id = locations[random.randint(0, 3)] + str(random.randint(1, 100))
salary = round(random.uniform(30000, 150000), 2)
hire_date = random_date(start_date, end_date).isoformat()
cur.execute('''
INSERT INTO Employees (FirstName, LastName, Email, DepartmentID, Salary, HireDate)
VALUES (?, ?, ?, ?, ?, ?);
''', (first_name, last_name, email, department_id, salary, hire_date))
if i % 100000 == 0:
print(f"inserted numberd of rows {i}")
conn.commit()
conn.commit()
conn.close()
print(f"Database with {NUM_ROWS} rows created successfully!")
Download the generated database and upload it to DB-Browser.
For each of the following queries perform the following task:
- Run the query and record the running time.
- Create an index on the column that is specified in the query.
- Run the query again with the indexed column and record the running time.
The queries:
Since the salary is randomly generated, this value might not be in your database. You may start by displaying a few records from the Employees
table and pick one of the values in the salary column.
Similar to the previous queries, you may display a few records first and select values that are in the same record for the DepartmentID
and the Salary
.
It can be clearly observed that indexing improves the query time. For example:
For the first query, running time before indexing = 602 ms and after indexing 15 ms.
For the second query, running time before indexing = 613 ms and after indexing 16 ms.
For the third query, running time before indexing = 666 ms and after indexing 14.3 ms.
For the fourth query, running time before indexing = 1809 ms and after indexing 345 ms.