HoCo Hour of Code

In Deep Learning Part 1, we introduced deep learning and the fundamentals of neural networks. In this article, we will learn more about how neural networks work.

In a neural network, data is passed from layer to layer in several steps:

Input is multiplied by corresponding weights
Weighted inputs are summed
The sum is converted to another value using a math transformation (activation function)

There are several functions that can be applied in step 3.

Activation functions define how a weighted sum of the input is transformed into an output from a node. Think of it as an additional transformation performed on the data. Below are three main examples of activation functions: Sigmoid, Tanh, and ReLU.

The most basic form of a neural network is the perceptron. It is a linear machine learning algorithm for the purpose of binary classification (classifying input into one of two classes).

You might be wondering why activation functions are necessary. To give a simple answer, they help introduce nonlinearity into our model while also regulating the input. You might notice that the range of tanh is from -1 to 1. This means anything plugged into tanh will get scaled down from -1 to 1, which ensures regularity among data. Similarly, sigmoid scales things down from 0 to 1. As for introducing nonlinearity, without activation functions, our data would constantly get multiplied by weights and added to biases. That would basically be mimicking linear regression. The activation functions introduce nonlinearity to help the model learn more complex patterns in our data.

Now, let’s discuss how the neural network passes information from the input layer to the hidden layers until the output layer.

At each layer, the data is transformed in three steps.

The weighted input (input X weight) is summed
An activation function is applied to this sum
The result is passed to the neurons in the next layer

When the output layer is reached, the final prediction of the neural network is made. This entire process is called Forward Propagation (Forward Pass).

Example: assume we had an input X1 that went through a neuron and came out as X2. In order to transform X1 to X2, we would multiply it by a weight and add it to a bias. This could be represented by the equation X2 = W*X1 + B, where Wis our weight and B is our bias.

In the diagram below, we are taking the sum of each input multiplied by its weight and then applying an activation function to give our output. This might be represented by tanh(W1X1 + W2X2 + … + WNXN), where tanh is our activation function and each input XN corresponds to a weight WN.

After the forward pass is run on the network, we want to test how well the model is doing. To do this, we can use a loss function, a function that calculates the error of a model. Loss is defined as the difference between the model’s predicted output and the actual output. A common loss function is mean squared error (MSE). Our goal is to minimize the loss, leading us to the next topic.

Let’s try to understand this equation. Imagine we have a set S = {(x_1, y_1), (x_2, y_2), … (x_n, y_n)} of length n. We go through each element and find the model’s predicted y-value for that x value. That predicted y value is the y_i with the little hat on top. We then subtract it from the expected y value and square the result. After adding all of these errors, we divide by the set length to get the average. You might wonder why we square the terms. It is simply to make bigger errors get higher weights and to ensure every error is positive. Other formulas like Mean Absolute Error also exist, where the absolute value is taken instead of squaring, but MSE tends to be the most common.

Backpropagation (Backward Pass) is the process of learning that the neural network uses to adjust the network weights and minimize error.

Initially, weights in a network are set to random numbers, so the initial model is not accurate with high error. Backpropagation is needed to improve this accuracy progressively.

We need to find out how much each node in a layer contributes to the overall error. This can actually be done through backpropagation. The idea is to adjust the weights in proportion to their error contribution in order to eventually reach an optimal set of weights. Bigger errors get adjusted more drastically than smaller errors. This is typically done through a learning rate, which will be covered later in the article. The gradient descent algorithm is what is used to adjust the weights.

The goal of gradient descent is to find the set of weights that minimize the loss function. Optimization functions calculate a gradient to do this. A gradient is the partial derivative of the loss function with respect to the weights. In other words, it is how much a small increase in the weights will affect the loss function. Weights are adjusted in the opposite direction of the calculated direction. Think of this as taking a step towards the local minimum of a function f(x). This cycle is repeated until we reach a minimum of the loss function.

In this scenario, a common analogy used to visualize gradient descent is picturing a ball rolling down the hill. However, the steps gradient descent takes aren’t always smooth or in the same direction, so a better example might be a drunk man stumbling down said hill.

Note: Gradient descent is not guaranteed to find the absolute minimum in a function and it could sometimes land on a relative minimum instead.

The learning rate is a hyperparameter that determines the step size (the amount by which weights are updated each time). A high learning rate can jump over minima, which is not ideal. A low learning rate will approach the minima too slowly, requiring many iterations of the model. We can try out different learning rates with trial and error to improve results.

Unfortunately, deep neural networks are prone to overfitting the training data and producing poor accuracies on the test data. One technique to handle overfitting is batch normalization, which consists of normalizing inputs. This can improve the performance and stability of neural networks.

Another technique is dropout, which is randomly shutting down a fraction of the layer’s training step (resetting their values to 0). This prevents the neurons from having too much dependency on each other. Below is a visual of the dropout method.