Supervised learning: deep learning

Published

October 9, 2025

Introduction

In this practical, we will create a feed-forward neural network as well as an optional convolutional neural network to analyze the famous MNIST dataset.

library(tidyverse)
library(keras3)

Let’s set the seed value and use the same number as below to reproduce the same results.

set.seed(45)

In this section, we will develop a deep feed-forward neural network for MNIST.

Data preparation

1. Load the built-in MNIST dataset by running the following code. Then, describe the structure and contents of the mnist object.

mnist <- dataset_mnist()

# str() function for inspecting the STRucture of an object
str(mnist)

List of 2
 $ train:List of 2
  ..$ x: int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ y: int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
 $ test :List of 2
  ..$ x: int [1:10000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ y: int [1:10000(1d)] 7 2 1 0 4 1 4 9 5 9 ...

# mnist is a list of 2 objects:
# - train (N=60000)
# - test (N=10000)

range(mnist$train$x)

[1]   0 255

table(mnist$train$y)


   0    1    2    3    4    5    6    7    8    9 
5923 6742 5958 6131 5842 5421 5918 6265 5851 5949

# in each object, there is an X with integers between 0-255 in a 28*28 matrix for each observation
# and a y which is a single integer (0-9) labelling each matrix

2. Plotting is very important when working with image data. We have defined a convenient plotting function for you. Use the plot_img() function below to plot the first training image. The img parameter has to be a matrix with dimensions (28, 28).

Indexing in 3-dimensional arrays works the same as indexing in matrices, but you need an extra comma x[,,].

plot_img <- function(img, col = gray.colors(255, start = 1, end = 0), ...) {
  image(t(img), asp = 1, ylim = c(1.1, -0.1), col = col, bty = "n", axes = FALSE, ...)
}

# get first training image (observation number is in first dimension)
im <- mnist$train$x[1,,]
plot_img(im)

It is usually a good idea to normalize your features to have a manageable, standard range before entering them in neural networks.

3. As a preprocessing step, ensure the brightness values of the images in the training and test set are in the range (0, 1).

# remember that the range of training image values is 0 to 255
mnist$train$x <- mnist$train$x / 255
mnist$test$x <- mnist$test$x / 255

Multi-layer perceptron: multinomial logistic regression

The simplest neural network model is a multi-layer perceptron where we have no hidden layers and only input and output layers. We can call this a multinomial logistic regression model, where we have no hidden layers and 10 outputs (0-1) for our mnist data. That model is shown below.

multinom  <- 
  # initialize a sequential model
  keras_model_sequential(input_shape = c(28, 28)) |> 
  # flatten 28*28 matrix into single vector
  layer_flatten() |>
  # softmax outcome == probability for each of 10 outputs
  layer_dense(10, activation = "softmax")

multinom$compile(
  loss = "sparse_categorical_crossentropy", # loss function for multinomial outcome
  optimizer = "adam", # we use this optimizer because it works well
  metrics = list("accuracy") # we want to know training accuracy in the end
)

4. Display a summary of the multinomial model using the summary() function. Describe why this model has 7850 parameters.

summary(multinom)

Model: "sequential"
┌───────────────────────────────────┬──────────────────────────┬───────────────
│ Layer (type)                      │ Output Shape             │       Param # 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ flatten (Flatten)                 │ (None, 784)              │             0 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ dense (Dense)                     │ (None, 10)               │         7,850 
└───────────────────────────────────┴──────────────────────────┴───────────────
 Total params: 7,850 (30.66 KB)
 Trainable params: 7,850 (30.66 KB)
 Non-trainable params: 0 (0.00 B)

# parameters: 784*10 weights + 10 biases at output nodes = 7850

5. Train the model for 5 epochs using the code below. What accuracy do we obtain in the validation set?

multinom |> 
  fit(x = mnist$train$x, 
      y = mnist$train$y, 
      epochs = 5, 
      validation_split = 0.2, 
      verbose = 1)

# about 92.69% accuracy on the validation set

6. Train the model for another 5 epochs. What accuracy do we obtain in the validation set?

multinom |> 
  fit(x = mnist$train$x, 
      y = mnist$train$y, 
      epochs = 5, 
      validation_split = 0.2, 
      verbose = 1)

# about 92.98% accuracy on the validation set. Slightly improved but not much!

Deep feed-forward neural networks

7. Create and compile a feed-forward neural network with the following properties. Ensure that the model has 50890 parameters.

sequential model
flatten layer
dense layer with 64 hidden units and “relu” activation function
dense output layer with 10 units and softmax activation function

You may reuse code from the multinomial model.

ffnn  <- 
  # initialize a sequential model
  keras_model_sequential(input_shape = c(28, 28)) |> 
  # flatten 28*28 matrix into single vector
  layer_flatten() |>
  # this is the hidden layer!
  layer_dense(64, activation = "relu") |>
  # softmax outcome == probability for each of 10 outputs
  layer_dense(10, activation = "softmax")

ffnn$compile(
  loss = "sparse_categorical_crossentropy", # loss function for multinomial outcome
  optimizer = "adam", # we use this optimizer because it works well
  metrics = list("accuracy") # we want to know training accuracy in the end
)

summary(ffnn)

Model: "sequential_1"
┌───────────────────────────────────┬──────────────────────────┬───────────────
│ Layer (type)                      │ Output Shape             │       Param # 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ flatten_1 (Flatten)               │ (None, 784)              │             0 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ dense_1 (Dense)                   │ (None, 64)               │        50,240 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ dense_2 (Dense)                   │ (None, 10)               │           650 
└───────────────────────────────────┴──────────────────────────┴───────────────
 Total params: 50,890 (198.79 KB)
 Trainable params: 50,890 (198.79 KB)
 Non-trainable params: 0 (0.00 B)

8. Train the model for 10 epochs. What do you see in terms of validation accuracy, also compared to the multinomial model?

ffnn |> fit(x = mnist$train$x, y = mnist$train$y, epochs = 10, validation_split = 0.2, verbose = 1)
# in my case, higher accuracy! 97.12%

9. Create predictions for the test data using the two trained models (using the function below). Create a confusion matrix and compute test accuracy for these two models. Write down any observations you have.

class_predict <- function(model, x_train) predict(model, x = x_train) |> apply(1, which.max) - 1

pred_multinom <- class_predict(multinom, x = mnist$test$x)

313/313 - 0s - 739us/step

pred_ffnn     <- class_predict(ffnn, x = mnist$test$x)

313/313 - 0s - 858us/step

(ctab_multinom <- table(pred = pred_multinom, true = mnist$test$y))

    true
pred    0    1    2    3    4    5    6    7    8    9
   0  949    0    3    1    1    6    7    1    7   10
   1    0 1115    8    0    1    2    3    6    7    7
   2    3    3  935   20    6    4    9   19    9    1
   3    1    2   17  925    1   38    1    8   25   13
   4    0    0   10    0  916   10    7    5    9   23
   5    7    1    4   18    0  770   10    0   26    5
   6   14    4   13    2    9   16  917    0    9    0
   7    5    2   12   14    7   10    2  962   15   34
   8    1    8   27   21    6   29    2    2  857    3
   9    0    0    3    9   35    7    0   25   10  913

(ctab_ffnn     <- table(pred = pred_ffnn, true = mnist$test$y))

    true
pred    0    1    2    3    4    5    6    7    8    9
   0  967    1    4    1    2    2    4    1    7    3
   1    1 1118    0    0    0    0    3    3    1    4
   2    0    5 1015    8    1    1    4   14    9    1
   3    1    4    0  986    1   16    3    1    6    5
   4    1    0    1    0  960    1    3    0    4    9
   5    2    1    0    5    0  864    8    1    6    6
   6    4    1    2    0    4    2  931    0    2    0
   7    1    2    6    6    2    0    0 1004    3    6
   8    3    3    4    4    1    6    2    0  932   11
   9    0    0    0    0   11    0    0    4    4  964

sum(diag(ctab_multinom)) / sum(ctab_multinom)

[1] 0.9259

sum(diag(ctab_ffnn)) / sum(ctab_ffnn)

[1] 0.9741

10. Create and estimate (10 epochs) a deep feed-forward model with the following properties. Compare this model to the previous models on the test data.

sequential model
flatten layer
dense layer with 128 hidden units and “relu” activation function
dense layer with 64 hidden units and “relu” activation function
dense output layer with 10 units and softmax activation function

dffnn  <- 
  keras_model_sequential(input_shape = c(28, 28)) |> # initialize a sequential model
  layer_flatten() |> # flatten 28*28 matrix into single vector
  layer_dense(128, activation = "relu") |> # this is the hidden layer!
  layer_dense(64, activation = "relu") |> # this is the hidden layer!
  layer_dense(10, activation = "softmax") # softmax outcome == logistic regression for each of 10 outputs

dffnn$compile(
  loss = "sparse_categorical_crossentropy", # loss function for multinomial outcome
  optimizer = "adam", # we use this optimizer because it works well
  metrics = list("accuracy") # we want to know training accuracy in the end
)

summary(dffnn)

dffnn |> fit(x = mnist$train$x, y = mnist$train$y, epochs = 10, validation_split = 0.2, verbose = 1)

pred_dffnn <- class_predict(dffnn, x = mnist$test$x)

313/313 - 0s - 950us/step

(ctab_dffnn <- table(pred = pred_dffnn, true = mnist$test$y))

    true
pred    0    1    2    3    4    5    6    7    8    9
   0  969    0    2    0    0    3    3    0    3    0
   1    1 1124    0    1    1    0    3    6    3    4
   2    0    2  999    3    1    0    0    8    2    0
   3    0    2    7  966    0    2    0    7    2    5
   4    0    0    4    1  967    2    3    4    1   11
   5    0    0    0   17    0  866    2    0    0    7
   6    3    5    8    0    6   12  945    1    5    0
   7    1    0    4    1    0    1    0  986    2    3
   8    4    2    8   15    1    4    1    4  951    2
   9    2    0    0    6    6    2    1   12    5  977

sum(diag(ctab_dffnn)) / sum(ctab_dffnn)

[1] 0.975

# this one does even better!

OPTIONAL: convolutional neural network

Convolution layers in Keras need a specific form of data input.

For each example, they need a (width, height, channels) array (tensor). For a colour image with 28*28 dimension, that shape is usually (28, 28, 3), where the channels indicate red, green, and blue. MNIST has no colour info, but we still need the channel dimension to enter the data into a convolution layer with shape (28, 28, 1). The training dataset x_train should thus have shape (60000, 28, 28, 1).

11. Add a “channel” dimension to the training and test data using the following code. Plot an image using the first channel of the 314th training example (this is a 9).

# add channel dimension to input (required for convolution layers)
dim(mnist$train$x) <- c(dim(mnist$train$x), 1)
dim(mnist$test$x)  <- c(dim(mnist$test$x), 1)

plot_img(mnist$train$x[314,,,1])

12. Create and compile a convolutional neural network using the following code. Describe the different layers in your own words.

cnn <- 
  keras_model_sequential(input_shape = c(28, 28, 1)) |> 
  layer_conv_2d(filters = 6, kernel_size = c(5, 5)) |> 
  layer_max_pooling_2d(pool_size = c(4, 4)) |>
  layer_flatten() |> 
  layer_dense(units = 32, activation = "relu") |> 
  layer_dense(10, activation = "softmax")

cnn |> 
  compile(
    loss = "sparse_categorical_crossentropy",
    optimizer = "adam", 
    metrics = c("accuracy")
  )

summary(cnn)

Model: "sequential_3"
┌───────────────────────────────────┬──────────────────────────┬───────────────
│ Layer (type)                      │ Output Shape             │       Param # 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ conv2d (Conv2D)                   │ (None, 24, 24, 6)        │           156 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ max_pooling2d (MaxPooling2D)      │ (None, 6, 6, 6)          │             0 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ flatten_3 (Flatten)               │ (None, 216)              │             0 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ dense_6 (Dense)                   │ (None, 32)               │         6,944 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ dense_7 (Dense)                   │ (None, 10)               │           330 
└───────────────────────────────────┴──────────────────────────┴───────────────
 Total params: 7,430 (29.02 KB)
 Trainable params: 7,430 (29.02 KB)
 Non-trainable params: 0 (0.00 B)

# First, we have the input layer which gets the images and the first channel (28, 28, 1)
# then, there is a 2d convolution layer with 6 filters, and a kernel size of 5 (in each direction)
# then, we max-pool the resulting 6 maps to reduce their size by 4 in each direction
# afterwards, we flatten
# then comes a dense hidden layer with 32 units and a relu activation function
# lastly, the output layer is the same as before

13. Fit this model on the training data (10 epochs) and compare it to the previous models.

cnn |> fit(x = mnist$train$x, y = mnist$train$y, epochs = 10, validation_split = 0.2, verbose = 1)

pred_cnn <- class_predict(cnn, x = mnist$test$x)

313/313 - 0s - 1ms/step

(ctab_cnn <- table(pred = pred_cnn, true = mnist$test$y))

    true
pred    0    1    2    3    4    5    6    7    8    9
   0  978    0    2    1    1    2    6    0   12    3
   1    0 1128    4    0    0    0    5    2    1    3
   2    0    2 1007    1    1    1    0    8    4    1
   3    0    1    9 1005    0   12    0    3   10    3
   4    0    2    2    0  973    0    5    4    7   10
   5    0    0    0    0    0  865    1    0    6    2
   6    0    2    0    0    1    8  941    0    4    0
   7    1    0    4    1    1    1    0 1005    6    5
   8    1    0    3    2    0    1    0    2  916    1
   9    0    0    1    0    5    2    0    4    8  981

sum(diag(ctab_cnn)) / sum(ctab_cnn)

[1] 0.9799

# has about the same performance as the deep feed-forward model (mind you - with way fewer parameters!)

14. Create another CNN which has better performance within 10 epochs. Compare your validation or test accuracy to that of your peers.

Here are some things you could do:

Reduce the convolution filter size & the pooling size and add a second convolutional & pooling layer with double the number of filters
Add a dropout layer after the flatten layer
Look up on the internet what works well and implement it!

cnn_2 <- 
  keras_model_sequential(input_shape = c(28, 28, 1)) |> 
  layer_conv_2d(filters = 6, kernel_size = c(3, 3)) |> 
  layer_max_pooling_2d(pool_size = c(2, 2)) |>
  layer_conv_2d(filters = 12, kernel_size = c(3, 3)) |> 
  layer_max_pooling_2d(pool_size = c(2, 2)) |>
  layer_flatten() |> 
  layer_dropout(rate = 0.2) |> 
  layer_dense(units = 32, activation = "relu") |> 
  layer_dense(10, activation = "softmax")

cnn_2 |> 
  compile(
    loss = "sparse_categorical_crossentropy",
    optimizer = "adam", 
    metrics = c("accuracy")
  )

cnn_2 |> fit(x = mnist$train$x, y = mnist$train$y, epochs = 10, validation_split = 0.2, verbose = 1)

pred_cnn_2 <- class_predict(cnn_2, x = mnist$test$x)

313/313 - 0s - 1ms/step

(ctab_cnn_2 <- table(pred = pred_cnn_2, true = mnist$test$y))

    true
pred    0    1    2    3    4    5    6    7    8    9
   0  973    1    4    2    0    2    6    1    2    3
   1    0 1104    1    0    0    0    1    1    0    1
   2    1    2 1009    1    4    1    0   14    0    0
   3    0    5    5 1000    0    3    1    3    2    1
   4    0    0    1    0  972    0    5    1    1    8
   5    0    1    0    3    0  879    4    0    1    2
   6    1    4    0    0    0    2  941    0    1    0
   7    0    1    4    1    0    1    0  999    2    1
   8    4   17    8    3    0    4    0    3  963    4
   9    1    0    0    0    6    0    0    6    2  989

sum(diag(ctab_cnn_2)) / sum(ctab_cnn_2)

[1] 0.9829