Indexing and data integration
Introduction
In databases, indexes are used to speed up the retrieval of rows by creating a data structure that allows the database engine to quickly locate the required data. There are several types of indexes:
Index that is automatically created by the Database Management System (DBMS) when you define a PRIMARY KEY or UNIQUE constraint on a table.
Creating and deleting indexes
In this exercise, you will practice on creating and deleting indexes in a database. We can use the university database univdb-sqlite.db
from here.
Query Performance with and without indexes
For this exercise, we will create a database with only one big table of synthetic data about employees.
Copy this code to Google colab and run it (You don’t have to fully understand how it works).
import sqlite3
import random
import datetime
NUM_ROWS = 5_000_000
conn = sqlite3.connect(f'employees_{NUM_ROWS}.db')
cur = conn.cursor()
cur.execute('''
CREATE TABLE IF NOT EXISTS Employees (
EmployeeID INTEGER PRIMARY KEY,
FirstName TEXT,
LastName TEXT,
Email TEXT UNIQUE,
DepartmentID char(6),
Salary REAL,
HireDate DATE
);
''')
def random_date(start, end):
return start + datetime.timedelta(
days=random.randint(0, int((end - start).days)))
first_names = [
"John", "Jane", "Michael", "Emily", "James", "Patricia", "Robert", "Linda", "William", "Barbara",
"David", "Elizabeth", "Joseph", "Susan", "Charles", "Margaret", "Thomas", "Dorothy", "Christopher", "Nancy",
"Daniel", "Karen", "Matthew", "Betty", "Anthony", "Helen", "Mark", "Sandra", "Paul", "Ashley",
"Steven", "Deborah", "Andrew", "Jessica", "Kenneth", "Sarah", "Joshua", "Laura", "Kevin", "Anna",
"Brian", "Kimberly", "George", "Melissa", "Edward", "Michelle", "Ronald", "Emily", "Timothy", "Amanda",
"Jason", "Angela", "Jeffrey", "Rebecca", "Ryan", "Cynthia", "Jacob", "Stephanie", "Gary", "Katherine",
"Nicholas", "Shirley", "Eric", "Brenda", "Jonathan", "Catherine", "Stephen", "Christine", "Larry", "Marie",
"Justin", "Janet", "Scott", "Samantha", "Brandon", "Carolyn", "Frank", "Rachel", "Gregory", "Heather",
"Raymond", "Diane", "Benjamin", "Virginia", "Patrick", "Julie", "Alexander", "Joyce", "Jack", "Victoria",
"Dennis", "Olivia", "Jerry", "Christina", "Tyler", "Kelly", "Aaron", "Emma", "Jose", "Lauren"
]
last_names = [
"Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia", "Miller", "Davis", "Rodriguez", "Martinez",
"Hernandez", "Lopez", "Gonzalez", "Wilson", "Anderson", "Thomas", "Taylor", "Moore", "Jackson", "Martin",
"Lee", "Perez", "Thompson", "White", "Harris", "Sanchez", "Clark", "Ramirez", "Lewis", "Robinson",
"Walker", "Young", "Allen", "King", "Wright", "Scott", "Torres", "Nguyen", "Hill", "Flores", "Green",
"Adams", "Nelson", "Baker", "Hall", "Rivera", "Campbell", "Mitchell", "Carter", "Roberts", "Gomez",
"Phillips", "Evans", "Turner", "Diaz", "Parker", "Cruz", "Edwards", "Collins", "Reyes", "Stewart",
"Morris", "Morales", "Murphy", "Cook", "Rogers", "Gutierrez", "Ortiz", "Morgan", "Cooper", "Peterson",
"Bailey", "Reed", "Kelly", "Howard", "Ramos", "Kim", "Cox", "Ward", "Richardson", "Watson",
"Brooks", "Chavez", "Wood", "James", "Bennett", "Gray", "Mendoza", "Ruiz", "Hughes", "Price",
"Alvarez", "Castillo", "Sanders", "Patel", "Myers", "Long", "Ross", "Foster", "Jimenez", "Powell"
]
domains = ['example.com', 'test.com', 'demo.com']
start_date = datetime.date(2000, 1, 1)
end_date = datetime.date(2023, 12, 31)
locations = ['AM', 'UT', 'RT', 'EN']
for i in range(NUM_ROWS):
first_name = f'{random.choice(first_names)}_{i%10000}'
last_name = f'{random.choice(last_names)}_{i%10000}'
email = f"{first_name.lower()}.{last_name.lower()}{i}@{random.choice(domains)}"
department_id = locations[random.randint(0, 3)] + str(random.randint(1, 100))
salary = round(random.uniform(30000, 150000), 2)
hire_date = random_date(start_date, end_date).isoformat()
cur.execute('''
INSERT INTO Employees (FirstName, LastName, Email, DepartmentID, Salary, HireDate)
VALUES (?, ?, ?, ?, ?, ?);
''', (first_name, last_name, email, department_id, salary, hire_date))
if i % 100000 == 0:
print(f"inserted numberd of rows {i}")
conn.commit()
conn.commit()
conn.close()
print(f"Database with {NUM_ROWS} rows created successfully!")
Download the generated database and upload it to DB-Browser.
For each of the following queries perform the following list of tasks:
- Run the query and record its running time.
- Create an index on the column that is specified in the query.
- Run the query again with the indexed column and record the running time.
The queries:
Since the salary is randomly generated, this value might not be in your database. You may start by displaying a few records from the Employees
table and pick one of the values in the salary column. You can use the code SELECT * FROM Employees LIMIT 100
to display the frst 100 records.
Similar to the previous query, you may display a few records first and select values that are in the same record for the DepartmentID
and the Salary
.
Additionally, try more queries in your own.