Week | Date | Topic | Type | Reading |
---|---|---|---|---|

1 | 2023-09-07 | Introduction to the course | Lecture | Syllabus |

1 | 2023-09-07 | Introduction lab: setting up your computer | Lab | PDSH 1 and R4DS 3, 5, 7 |

1 | 2023-09-08 | Data models | Lecture | DBSC 1, 3.2, 3.9 |

1 | 2023-09-08 | Data models | Lab | |

2 | 2023-09-11 | Data extraction with SQL | Lecture | DBSC 2.6, 3.3 - 3.8 |

2 | 2023-09-11 | Data extraction with SQL | Lab | |

2 | 2023-09-12 | Integrity constraints in databases | Lecture | DBSC 3.2.1, 4.4, 6.1, 6.2 |

2 | 2023-09-12 | Integrity constraints in databases | Lab | |

2 | 2023-09-13 | Functional Dependency & Data Integration | Lecture | DBSC 7.1-7.4.1 |

2 | 2023-09-13 | Functional Dependency & Data Integration | Lab | |

3 | 2023-09-18 | Hetero. data analysis & string similarity | Lecture | MMDS 3.1-3.5 |

3 | 2023-09-18 | Hetero. data analysis & string similarity | Lab | |

3 | 2023-09-19 | Data extraction in Python | Lecture | PDSH 3 |

3 | 2023-09-19 | Data extraction in Python | Lab | |

3 | 2023-09-20 | Data preparation 1 | Lecture | DMCT 3.1, 3.2, 12.1 - 12.4 |

3 | 2023-09-20 | Data preparation 1 | Lab | |

4 | 2023-09-25 | 10:00AM assignment 1 | Deadline | |

4 | 2023-09-25 | Data preparation 2 | Lecture | DMCT 3.3-3.5 |

4 | 2023-09-25 | Data preparation 2 | Lab | |

4 | 2023-09-26 | Guest lecture on cloud computing | Lecture | TBD |

4 | 2023-09-26 | Cloud computing | Lab | |

4 | 2023-09-27 | Data visualization | Lecture | R4DS 3, 5, 7 (optionally 2-8) |

4 | 2023-09-27 | Data visualization using ggplot | Lab | |

5 | 2023-10-02 | Exploratory data analysis | Lecture | R4DS 3, 5, 7 (optionally 2-8) |

5 | 2023-10-02 | Exploratory data analysis in R | Lab | |

5 | 2023-10-03 | Supervised learning: Regression | Lecture | ISLR 1, 2.1, 2.2, 3.1, 3.2, 3.3, 3.5 |

5 | 2023-10-03 | Supervised learning: Regression models in R | Lab | |

5 | 2023-10-04 | Q&A | Lecture | |

5 | 2023-10-04 | No lab, time to study | Lab | |

5 | 2023-10-06 | Midterm exam | Exam | |

6 | 2023-10-09 | 10:00AM assignment 2 | Deadline | |

6 | 2023-10-09 | Supervised learning: model evaluation | Lecture | ISLR 5.1, 8.1, 8.2.1, 8.2.2, 8.2.3 |

6 | 2023-10-09 | Supervised learning: model evaluation | Lab | |

6 | 2023-10-10 | Supervised learning: classification | Lecture | ISLR 4.1, 4.2, 4.3, 4.4.1, 4.4.2 |

6 | 2023-10-10 | Supervised learning: classification | Lab | |

6 | 2023-10-11 | Deep learning | Lecture | ISLR 10, DLBK 11, (optionally 6) |

6 | 2023-10-11 | Deep learning | Lab | |

7 | 2023-10-16 | Missing data 1: Mechanisms | Lecture | FIMD 1.1, 1.2, 1.3, 1.4 |

7 | 2023-10-16 | Missing data mechanisms | Lab | |

7 | 2023-10-17 | Missing data 2: Solutions | Lecture | FIMD 1.1, 1.2, 1.3, 1.4 |

7 | 2023-10-17 | Imputation methods | Lab | |

7 | 2023-10-18 | Clustering | Lecture | ISLR 12.1, 12.4 |

7 | 2023-10-18 | Clustering | Lab | |

8 | 2023-10-23 | 10:00AM assignment 3 | Deadline | |

8 | 2023-10-23 | Model-based clustering | Lecture | Oberski (2016), optional Hennig (2016), Bouveyron et al. (2019) |

8 | 2023-10-23 | Model-based clustering using MClust | Lab | |

8 | 2023-10-24 | Text mining 1 | Lecture | SLP3 2.1, 2.4, 6.2, 6.3, 6.5, 6.8; TTMR 3 |

8 | 2023-10-24 | Text mining 1 | Lab | |

8 | 2023-10-25 | Text mining 2 | Lecture | SLP3 2.1, 2.4, 6.2, 6.3, 6.5, 6.8; TTMR 3 |

8 | 2023-10-25 | Text mining 2 | Lab | |

9 | 2023-10-30 | Time series | Lecture | OMNG pages 265-294 (forecasting) |

9 | 2023-10-30 | Time series | Lab | |

9 | 2023-10-31 | Data streams | Lecture | MMDS 4.1 – 4.4 |

9 | 2023-10-31 | Data streams | Lab | |

9 | 2023-11-01 | Algorithmic fairness | Lecture | Mehrabi et al. (2021) |

9 | 2023-11-01 | Algorithmic fairness | Lab | |

10 | 2023-11-06 | 10:00AM assignment 4 | Deadline | |

10 | 2023-11-07 | No lecture: study time | Lecture | |

10 | 2023-11-07 | No lab: study time | Lab | |

10 | 2023-11-08 | Q&A | Lecture | |

10 | 2023-11-08 | No lab: study time | Lab | |

10 | 2023-11-10 | Final exam | Exam | |

10 | 2024-01-15 | Resit exam | Exam |

# INFOMDWR: Course Syllabus 2023

# 1 Introduction

Data do not fall from heaven, but are created, manipulated, transformed, and cleaned - in any data analysis, therefore, the treatment of the data itself is just as important as the modeling techniques applied to them. In this course, you will get acquainted with and implement a variety of techniques to go from raw data to analyses, visualizations and insights for science and business applications. This is an overview course designed to give you the tools and skills to use and evaluate data science methods.

The course consists of two parts, data wrangling and data analysis, which are intertwined.

## Prerequisites

We assume that students who will join the course will have knowledge of statistics up to regression and analysis of variance, as well as some experience in programming in languages such as R and Python.

## Objectives

At the end of this course, students have attained the following objectives:

- Know, explain, and apply data retrieval from existing relational and nonrelational databases, including text, using queries built from primitives such as select, subset, and join both directly in, e.g., SQL and through the python & R programming languages.
- Know, explain, and apply common data clean-up procedures, including missing data and the appropriate imputation methods and feature selection.
- Know, explain, and apply methodology to properly set-up data analysis experiments, such as train, validate, and test and the bias/variance trade-off.
- Know, explain, and apply supervised machine learning algorithms, both for classification and regression purposes as well as their related quality measures, such as AUC and Brier scores.
- Know, explain, and apply non-supervised learning algorithms, such as clustering and (other) matrix factorization techniques that may or may not result in lower-dimensional data representations.
- Be able to choose between the different techniques learned in the course and be able to explain why the chosen technique fits both the data and the research question best.

# 2 Course Policy

This course is worth 14 ECTS, which means it is designed to give a full-time workload.

## Weekly course flow

A regular week in this course consists of three lectures (Monday-Wednesday morning) and three lab sessions (Monday-Wednesday afternoon). The material is introduced on a theoretical level in the lectures and then put into practice in the lab sessions. The practical work done in these labs is drawn from real life situations that allow the students to experience how to solve data science problems.

In addition, students will spend time during each week on bi-weekly take-home group assignments.

- The
**lectures**are in-person. The required readings should be read before the lecture. These are*not*optional. - The
**lab sessions**are in-person interactive sessions in which you apply the methods you learn about in the lectures. The answers to the exercises in the labs are discussed at the end of each session. - The skills acquired in the lectures and the labs provide the basis for doing the bi-weekly
**take-home assignment**. This assignment is made in groups of 3-5 students and handed in via Blackboard.

## Synchronous course policy

- INFOMDWR is an offline-first course, with mostly in-person lectures and lab sessions.
- We find it important for interactive and collaborative learning that the course is offline-first.
- Due to a lack of space at UU, some lectures will be hybrid with maximum of 50-60 in-person spots. You will receive a teams meeting invitation for these sessions so you can follow online if you prefer. Note: these lectures will still be synchronous, so no recordings will be made available as per university policy.
- If you miss a session, e.g., due to sickness, you should catch up in the regular way:
- Read the readings
- Go through the lecture slides
- Do the practicals
- Ask your peers if you have questions
- (after the above) ask the lab teacher for further explanation

## Who to ask what

There are many teachers in this course. If you have questions, first **ensure the answer isn’t in this syllabus** and then follow the table below:

Question type | How to ask |
---|---|

Course proceedings | Email course coordinators |

Content - general | Email / ask lab teachers |

Practical content | Email / ask lab teachers |

Assignment content | Email / ask lab teachers |

Lecture content | Email Lecturer (Hakim Qahtan / Daniel Oberski) |

## Grading policy

Your final grade in the course consists of the following parts:

- Biweekly assignments (20%): every two weeks, there is a group assignment. Each assignment is graded and worth 5%.
- Midterm exam (40%): Halfway through the course, there is a midterm exam. The content of this exam is everything in the first half of the course.
- Final exam (40%): At the end of the course, there is a final exam. The content of this exam is focussed on the second half of the course (but note that the material builds on the first half).

To pass the course:

- The final grade should be greater than or equal to 5.5.

Resit:

- If you obtain a grade between 4.0 and 5.4 on the midterm or the final exam, you are eligible for the resit. The grade attained in the resit will replace the grade from the exam.
- You can only retake one of the two exams. By default this will be the one with the lowest grade (between 4.0 and 5.4). If you obtained the same grade for both, you will redo the final exam.

# 3 Course materials

## Required Software

In this course, we will use a variety of software, but mainly SQLite, Python and R. Try to install both on your computer by the start of the course; we will also have a set-up computer lab on the first day to help you with this process.

**Installing DB Browser for SQLite** For the SQL parts, we recommend installing DB Browser for SQLite. Installation instructions for mac, windows, and linux can be found here.

**Installing Python & Jupyter** For the python parts of the course, we will use Google Colab, which is an interactive online notebook environment; this means no installation is necessary! However, you do need a google account, so make sure you have one (or make one specifically for the course).

**Installing R & RStudio** First, install the latest version of R for your system (see `https://cran.r-project.org/`

). Then, install the latest (desktop open source) version of the RStudio integrated development environment (`link`

). We will make extensive use of the `tidyverse`

suite of packages, which can be installed from within `R`

using the command `install.packages("tidyverse")`

.

## Required readings

Freely available sections from the following books:

Book | Title (Authors) | URL |
---|---|---|

`DBSC` |
Database System Concepts (Silberschatz, Korth, Sudarshan) | db-book.com |

`MMDS` |
Mining Massive Datasets (Leskovec, Rajaraman, Ullman) | mmds.org |

`PDSH` |
Python Data Science Handbook (VanderPlas) | jakevdp.github.io/PythonDataScienceHandbook |

`DMCT` |
Data Mining: Concepts and Techniques (Han, Kamber, Pei) | Find it online for free. |

`R4DS` |
R for Data Science (Grolemund & Wickham) | r4ds.hadley.nz |

`ISLR` |
Introduction to Statistical Learning (James et al.) | statlearning.com |

`DLBK` |
Deep Learning (Goodfellow, Bengio, Courville) | deeplearningbook.org |

`FIMD` |
Flexible Imputation of Missing Data (van Buuren) | stefvanbuuren.name/fimd |

`SLP3` |
Speech and language processing (Jurafsky & Martin) | stanford.edu/~jurafsky/slp3 |

`TTMR` |
Text mining with R: A tidy approach (Silge & Robinson) | tidytextmining.com/ |

`OMNG` |
Operations Management 4th ed. (Reid & Sanders) | Find it online for free. |

And (parts of) the following references:

- Oberski, D.L. (2016). Mixture models: Latent profile and latent class analysis. Modern statistical methods for HCI, 275-287. URL
- Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6), 1-35. URL
- Hennig, C. (2015). Clustering strategy and method selection. arXiv preprint arXiv:1503.02059. URL
- Bouveyron, C., Celeux, G., Murphy, T. B., & Raftery, A. E. (2019). Model-based clustering and classification for data science: with applications in R (Vol. 50). Cambridge University Press.
- Some other freely available articles & chapters

# 4 Class Schedule

You can find the up-to-date class schedule with locations on mytimetable.uu.nl.