INFOMDWR: Course Syllabus 2024

Authors

Erik-Jan van Kesteren

Hakim Qahtan

Ayoub Bagheri

Published

September 5, 2024

1 Introduction

Data do not fall from heaven, but are created, manipulated, transformed, and cleaned - in any data analysis, therefore, the treatment of the data itself is just as important as the modeling techniques applied to them. In this course, you will get acquainted with and implement a variety of techniques to go from raw data to analyses, visualizations and insights for science and business applications. This is an overview course designed to give you the tools and skills to use and evaluate data science methods.

The course consists of two parts, data wrangling and data analysis, which are intertwined.

Prerequisites

We assume that students who will join the course will have knowledge of statistics up to regression and analysis of variance, as well as some experience in programming in languages such as R and Python.

Objectives

At the end of this course, students have attained the following objectives:

Know, explain, and apply data retrieval from existing relational and nonrelational databases, including text, using queries built from primitives such as select, subset, and join both directly in, e.g., SQL and through the python & R programming languages.
Know, explain, and apply common data clean-up procedures, including missing data and the appropriate imputation methods and feature selection.
Know, explain, and apply methodology to properly set-up data analysis experiments, such as train, validate, and test and the bias/variance trade-off.
Know, explain, and apply supervised machine learning algorithms, both for classification and regression purposes as well as their related quality measures, such as AUC and Brier scores.
Know, explain, and apply non-supervised learning algorithms, such as clustering and (other) matrix factorization techniques that may or may not result in lower-dimensional data representations.
Be able to choose between the different techniques learned in the course and be able to explain why the chosen technique fits both the data and the research question best.

2 Course Policy

This course is worth 14 ECTS, which means it is designed to give a full-time workload.

Weekly course flow

A regular week in this course consists of three lectures (Monday-Wednesday morning) and three lab sessions (Monday-Wednesday afternoon). The material is introduced on a theoretical level in the lectures and then put into practice in the lab sessions. The practical work done in these labs is drawn from real life situations that allow the students to experience how to solve data science problems.

In addition, students will spend time during each week on bi-weekly take-home group assignments.

The lectures are in-person. The required readings should be read before the lecture. These are not optional.
The lab sessions are in-person interactive sessions in which you apply the methods you learn about in the lectures. The answers to the exercises in the labs are discussed at the end of each session.
The skills acquired in the lectures and the labs provide the basis for doing the bi-weekly take-home assignment. This assignment is made in groups of 3-5 students and handed in via Brightspace.

Synchronous course policy

INFOMDWR is an offline-first course, with mostly in-person lectures and lab sessions.
We find it important for interactive and collaborative learning that the course is offline-first.
If you miss a session, e.g., due to sickness, you should catch up in the regular way:
- Read the readings
- Go through the lecture slides
- Do the practicals
- Ask your peers if you have questions
- (after the above) ask the lab teacher for further explanation

Who to ask what

There are many teachers in this course. If you have questions, first ensure the answer isn’t in this syllabus and then follow the table below:

Question type	How to ask
Course proceedings	Email course coordinators
Content - general	Email / ask lab teachers
Practical content	Email / ask lab teachers
Assignment content	Email / ask lab teachers
Lecture content	Email Lecturer (Hakim Qahtan / Daniel Oberski)

Grading policy

Your final grade in the course consists of the following grading components:

Biweekly assignments (20%): every two weeks, there is a group assignment. Each assignment is graded and worth 5% of the final grade.
Midterm exam (40%): Halfway through the course, there is a midterm exam. The content of this exam pertains to the first half of the course.
Final exam (40%): At the end of the course, there is a final exam. The content of this exam emphasizes non-exclusively the teaching material of the second half of the course.

To pass the course:

The weighted final grade of all grading components should be greater than or equal to 5.5.

Resit:

If you obtain a final grade between 4.0 and 5.4, you are eligible for the resit.
You can only retake one of the two exams. By default, this will be the one with the lowest grade. If you obtained the same grade for both, you will redo the final exam.
The grade attained in the resit will replace the grade from the selected exam.

3 Course materials

Required Software

In this course, we will use a variety of software, but mainly SQLite, Python and R. Try to install both on your computer by the start of the course; we will also have a set-up computer lab on the first day to help you with this process.

Installing DB Browser for SQLite For the SQL parts, we recommend installing DB Browser for SQLite. Installation instructions for mac, windows, and linux can be found here.

Installing Python & Jupyter For the python parts of the course, we will use Google Colab, which is an interactive online notebook environment; this means no installation is necessary! However, you do need a google account, so make sure you have one (or make one specifically for the course).

Installing R & RStudio First, install the latest version of R for your system (see https://cran.r-project.org/). Then, install the latest (desktop open source) version of the RStudio integrated development environment (link). We will make extensive use of the tidyverse suite of packages, which can be installed from within R using the command install.packages("tidyverse").

Required readings

Freely available sections from the following books:

Book	Title (Authors)	URL
`DBSC`	Database System Concepts (Silberschatz, Korth, Sudarshan)	db-book.com
`MMDS`	Mining Massive Datasets (Leskovec, Rajaraman, Ullman)	mmds.org
`PDA`	Python for Data Analysis, 3E (Wes McKinney)	wesmckinney.com/book/
`DMCT`	Data Mining: Concepts and Techniques (Han, Kamber, Pei)	Find it online for free.
`R4DS`	R for Data Science (Grolemund & Wickham)	r4ds.hadley.nz
`ISLR`	Introduction to Statistical Learning (James et al.)	statlearning.com
`DLBK`	Deep Learning (Goodfellow, Bengio, Courville)	deeplearningbook.org
`FIMD`	Flexible Imputation of Missing Data (van Buuren)	stefvanbuuren.name/fimd
`SLP3`	Speech and language processing (Jurafsky & Martin)	stanford.edu/~jurafsky/slp3
`TTMR`	Text mining with R: A tidy approach (Silge & Robinson)	tidytextmining.com/
`OMNG`	Operations Management 4th ed. (Reid & Sanders)	Find it online for free.

And (parts of) the following references:

Oberski, D.L. (2016). Mixture models: Latent profile and latent class analysis. Modern statistical methods for HCI, 275-287. URL
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6), 1-35. URL
Hennig, C. (2015). Clustering strategy and method selection. arXiv preprint arXiv:1503.02059. URL
Bouveyron, C., Celeux, G., Murphy, T. B., & Raftery, A. E. (2019). Model-based clustering and classification for data science: with applications in R (Vol. 50). Cambridge University Press.
Some other freely available articles & chapters

4 Class Schedule

You can find the up-to-date class schedule with locations on mytimetable.uu.nl.

Week	Date	Topic	Type	Reading
1	2024-09-05	Introduction to the course	Lecture	Syllabus
1	2024-09-05	Introduction lab: setting up your computer	Lab	R4DS 3, 5, 7
1	2024-09-06	Data models	Lecture	DBSC 1, 3.2, 3.9
1	2024-09-06	Data models	Lab
2	2024-09-09	Data extraction with SQL	Lecture	DBSC 2.6, 3.3 - 3.8
2	2024-09-09	Data extraction with SQL	Lab
2	2024-09-10	Integrity constraints in databases	Lecture	DBSC 3.2.1, 4.4, 6.1, 6.2
2	2024-09-10	Integrity constraints in databases	Lab
2	2024-09-11	Functional dependency	Lecture	DBSC 7.1-7.4.1
2	2024-09-11	Functional dependency	Lab
3	2024-09-16	Indexing & data integration	Lecture	DBSC 14.1 - 14.7
3	2024-09-16	Indexing & data integration	Lab
3	2024-09-17	Hetero. data analysis & string similarity	Lecture	MMDS 3.1-3.5
3	2024-09-17	Hetero. data analysis & string similarity	Lab
	2024-09-17	Social Event (De vagant)	Social
3	2024-09-18	Data extraction in Python	Lecture	PDA 5, 6.1
3	2024-09-18	Data extraction in Python	Lab
4	2024-09-23	Data preparation 1	Lecture	DMCT 3.1, 3.2, 12.1 - 12.4 , PDA 7.1
4	2024-09-23	Data preparation 1	Lab
4	2024-09-23	10:00AM assignment 1	Deadline
4	2024-09-24	Data preparation 2	Lecture	DMCT 3.3-3.5, PDA 7.2
4	2024-09-24	Data preparation 2	Lab
4	2024-09-25	Data visualization	Lecture	R4DS 3, 5, 7 (optionally 2-8)
4	2024-09-25	Data visualization using ggplot	Lab
5	2024-09-30	Exploratory data analysis	Lecture	R4DS 3, 5, 7 (optionally 2-8)
5	2024-09-30	Exploratory data analysis in R	Lab
5	2024-10-01	Supervised learning: Regression	Lecture	ISLR 1, 2.1, 2.2, 3.1, 3.2, 3.3, 3.5
5	2024-10-01	Supervised learning: Regression models in R	Lab
5	2024-10-02	Q&A	Lecture
5	2024-10-02	No lab, time to study	Lab
5	2024-10-04	Midterm exam	Exam
6	2024-10-07	10:00AM assignment 2	Deadline
6	2024-10-07	Supervised learning: model evaluation	Lecture	ISLR 5.1, 8.1, 8.2.1, 8.2.2, 8.2.3
6	2024-10-07	Supervised learning: model evaluation	Lab
6	2024-10-08	Supervised learning: classification	Lecture	ISLR 4.1, 4.2, 4.3, 4.4.1, 4.4.2
6	2024-10-08	Supervised learning: classification	Lab
6	2024-10-09	Deep learning	Lecture	ISLR 10, DLBK 11, (optionally 6)
6	2024-10-09	Deep learning	Lab
7	2024-10-14	Missing data 1: Mechanisms	Lecture	FIMD 1.1, 1.2, 1.3, 1.4
7	2024-10-14	Missing data mechanisms	Lab
7	2024-10-15	Missing data 2: Solutions	Lecture	FIMD 1.1, 1.2, 1.3, 1.4
7	2024-10-15	Imputation methods	Lab
7	2024-10-16	Clustering	Lecture	ISLR 12.1, 12.4
7	2024-10-16	Clustering	Lab
8	2024-10-21	10:00AM assignment 3	Deadline
8	2024-10-21	Model-based clustering	Lecture	Oberski (2016), optional Hennig (2016), Bouveyron et al. (2019)
8	2024-10-21	Model-based clustering using MClust	Lab
8	2024-10-22	Text mining 1	Lecture	SLP3 2.1, 2.4, 6.2, 6.3, 6.5, 6.8; TTMR 3
8	2024-10-22	Text mining 1	Lab
8	2024-10-23	Text mining 2	Lecture	SLP3 2.1, 2.4, 6.2, 6.3, 6.5, 6.8; TTMR 3
8	2024-10-23	Text mining 2	Lab
	2024-10-23	Inspecting the mid-term exam	Exam Inspection
9	2024-10-28	Time series	Lecture	OMNG pages 265-294 (forecasting)
9	2024-10-28	Time series	Lab
9	2024-10-29	Data streams	Lecture	MMDS 4.1 – 4.4
9	2024-10-29	Data streams	Lab
9	2024-10-30	Algorithmic fairness	Lecture	Mehrabi et al. (2021)
9	2024-10-30	Algorithmic fairness	Lab
10	2024-11-04	10:00AM assignment 4	Deadline
10	2024-11-05	No lecture: study time	Lecture
10	2024-11-05	No lab: study time	Lab
10	2024-11-06	Q&A	Lecture
10	2024-11-06	No lab: study time	Lab
10	2024-11-08	Final exam	Exam
10	2025-01-06	Resit exam	Exam