9.10. What is a convolutional neural network?

Notes by Alberto Garcia (2021)

A convolutional neural network (CNN) is a machine learning algorithm that is useful for classifying images. It takes in an image as an input, assigns importance (through weights) to various features of the image such as objects, and learns how to differentiate between the images.

../../_images/CNN_example.jpg

Image taken from useful links #1.

Why are traditional neural networks not enough?

Although neural networks (NNs) are a great tool for making predictions, there are a few reasons why it is not wise to use NNs when dealing with images:

  • Multi-layer perceptrons use one perceptron per input. For images, one input would correspond to one pixel. If the image is 224x224, that is ~50,000 inputs. If the image is in color, that is x3 pixels, so 224x224x3 ~150,000. This means we’ll have ~150,000 weights per neuron that need to be trained…can lead to overfitting and slow training process.

  • NNs are not translationally invarient. If an important feature changes location in pictures, the NN will try to correct for that change. This leads to weights not being trained properly.

Why use convolutional neural networks?

  • The influence of neighboring pixels is analyzed by something called a filter. This serves in reducing the dimensions of the weights by picking the important overall feature in sections of the image.

  • The pooling layer also serves as a way to reduce the dimensions.

  • CNNs are translationally invarient since they do not care exactly where the feature is in the image, but if the feature exists.

Different layers of a CNN

Convolutional layer:

This layer is used to extract features from the input image by use of a filter whose elements are weights that undergo a training process. These filters are typically 3x3 or 5x5 matrices that get convoluted with the image pixels. Even number filters are avoided since we want our feature maps to end up with a center cell. If it doesn’t there can be problems moving to the next layer. We will discuss greyscale images.

  • The size of the filters are much smaller than the size of the input image. The process of obtaining feature maps involves taking the scalar (dot) product between the image and the filter. By applying the same filter to an image, it allows the filter to discover features anywhere in that image (the translation invariance that was mentioned above). We can conclude that the filter allows us to see if a feature is present as opposed to where it is in the image.

  • The filters themselves are built from random numbers (from my current understanding) and get updated as the network is trained. There are certain filters that correspond to certain operations like edge detection and image sharpening. The output of a filter is a feature map. These are used to predict the class of each image. The number of filters used can be chosen. An example of the convolution operation is

\[\begin{split} \mbox{image} = \begin{pmatrix} 3 & 0 & 1 & 5 \\ 2 & 6 & 2 & 4 \\ 2 & 4 & 1 & 0 \\ 2 & 3 & 1 & 4 \end{pmatrix}, \quad \quad \mbox{filter} = \begin{pmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{pmatrix} \end{split}\]
  • First matrix element:

\[ (3)(-1) + (0)(0) + (1)(1) + (2)(-2) + (6)(0) + (2)(2) + (2)(-1) +(4)(0) + (1)(1) = -3 \]
  • The outputted feature map is:

\[\begin{split} \mbox{feature map} = \mbox{image} * \mbox{filter} = \begin{pmatrix} -3 & -3 \\ -3 & -9 \end{pmatrix} \end{split}\]
  • Once the convolution happens, the feature maps are stacked up. This represents the new image. These feature maps are then passed through an activation function which decides whether a certain feature is present anywhere in the image.

  • If we think about tradicional neural networks and the transition from input to hidden layer, the operation is

\[ \vec{X}_{\mbox{train}} \cdot \vec{W} + b_0 w_0. \]
  • We have a similar operation in the convolutional layer of a CNN. In this case, the elements of \(\vec{X}_{\mbox{train}}\) is composed of the pixels of the input image and the elements of \(\vec{W}\) are the values of the filter. These get multipled in form of a scalar product, get summed up, and get passed into the activation function. In turn, each element of the feature map will go through the activation function. This amounts to the operation

\[ h_{\mbox{out}} = f \big( \vec{X}_{\mbox{train}} \cdot \vec{W} + b_0 w_0 \big). \]
  • As mentioned above, there are certain filters that pick out certain features. We do not concern ourselves with picking any specific filter. The point of the convolutional layer is to train the filters to recognize feature. The network is forced to learn how to properly extract features in order to minimize the loss. This is similar to neural networks where we compare the prediction to the true value. The error has to be below some threshold in order for the training to be complete. The completed “product” is a CNN with many filters that built in such a way that they pick out the most important features of an image in tandem to properly classify it.

How do we choose parameters for the convolutional layer?

Examples:
network.add(layers.Conv2D(32,(3,3), activation=‘relu’, input_shape=(64,64,3)))
network.add(layers.Conv2D(64, (5,5), activation=‘relu’))

Activation function

  • Once we have a stack of feature maps, each value of the map is passed through a function known as the activation function. It is important to choose a non-linear activation function since the data one is passing into the network is usually non-linear. This will allow for the CNN to generalize better.

  • A function is chosen in order to set boundaries on the values passed through. This forces the values into a certain range and reduces the chances of the weights blowing up. The most used activation function in CNN (according to multiple articles) is the ReLU (rectified linear unit). The main reason for their usage is that they are cheap computationally and they throw out negative numbers. You’ll either get a \(0\) or \(1\) when computing the gradients, which make the training portion a lot faster compared to using sigmoid or tanh.

\[\begin{split} \mbox{ReLU function} \; \longrightarrow \; f(x) = \begin{cases} 0, \quad \mbox{for} \quad x<0 \\ x, \quad \mbox{for} \quad x \ge 0 \end{cases} \end{split}\]

Pooling layer

  • The function of the pooling layer is to reduce the size of the feature maps in order to reduce the amount of parameters needed, thus decreasing the computational time. These layers will operate on each feature map independently. Values from the feature maps are selected and are used as inputs for the subsequent layers. There are different ways to select these values. The common way of doing it is max pooling. This grabs the largest value using a pre-determined size. In addition, you can pick a stride. This tells the filter how to move across the feature map. For example, let’s consider a pooling layer with size 2x2 and stride 2

\[\begin{split} \mbox{feature map} = \begin{pmatrix} 1 & 1 & 2 & 4 \\ 5 & 6 & 7 & 8 \\ 3 & 2 & 1 & 0 \\ 1 & 2 & 3 & 4 \end{pmatrix}, \quad \longrightarrow 2x2 \; \mbox{pooling filter} \longrightarrow \quad \mbox{output} = \begin{pmatrix} 6 & 8 \\ 3 & 4 \end{pmatrix} \end{split}\]

How do we choose the size of the pooling layer?

  • The size of the pooling layer is usually chosen to be 2x2 or 3x3. The reason is because the point of the pooling layer is to reduce the size of the feature maps. As we increase our pooling size, we decrease the resolution. That’s why it is best to stick with 2x2 or 3x3 pooling size to not lose too many details.

Different types of pooling

Multiple color channels

  • So far we have only discussed a greyscale picture. All pictures are three-dimenional inputs since they have pixels along the image and a color. For the case of greyscale the dimensions would be (for example) 32x32x1. This means it is effectively two-dimensional. When we have an image that is in color, the filter is three-dimensional, i.e. 32x32x3. So the CNN operate over some volume. This means that the filter will also be three-dimensional, as well as the rest of the pooling layer.

Flattening layer

  • The flatten layer prepares a vector to be passed into the fully connected layer by transforming a two-dimensional matrix into a vector that is then fed into the dense layers. For example,

\[\begin{split} \mbox{output} = \begin{pmatrix} 6 & 8 \\ 3 & 4 \end{pmatrix} \quad \longrightarrow \mbox{flattening layer} \longrightarrow \quad \mbox{vector} = \begin{pmatrix} 6 \\ 8 \\ 3 \\ 4 \end{pmatrix} \end{split}\]
  • This will now be the inputs for a neural network.

Dense layer

  • A dense layer is another phrase for a fully connected layer. This layer is a fully connected neural network. Each neuron in the network receives an input from all the neurons present in the previous layer (hence why they are called dense). A dense layer provides features from all the combinations of features from the previous layer.

Output layer

  • This layer outputs the prediction. The function used in classification problems is usually the Softmax function. Softmax makes output sum to one so we obtain probabilities.

\[ \sigma(z)_i = \frac{e^{z_i}}{\sum^{K}_{j=1} e^{z_j}} \]

Categorical crossentropy loss

  • Defined as

\[ \mbox{Loss} = - \sum^{\mbox{output size}}_{i = 1} y_i \ln(\hat{y}_i) \]

where \(y_i\) is the target value and \(\hat{y}_i\) is the scalar value in the model output.

  • For example

\[\begin{split} \mbox{target} = \begin{pmatrix} 1 \\ 0 \\ 0 \\ \end{pmatrix}, \quad \quad \mbox{prediction} = \begin{pmatrix} 0.5 \\ 0.3 \\ 0.2 \\ \end{pmatrix} \quad \rightarrow \mbox{corresponding} \rightarrow \quad \begin{pmatrix} \mbox{dog} \\ \mbox{cat} \\ \mbox{monkey} \\ \end{pmatrix} \end{split}\]
  • Loss total = loss for dog + loss for cat + loss for monkey

\[ \mbox{Loss} = -1*\ln(0.5) - 0*\ln(0.3) - 0*\ln(0.2) \]