Deep Learning Part 2
20pts
Python
Java
C++
Syntax
Second part of the Deep Learning Series
In Deep Learning Part 1, we introduced deep learning and the
fundamentals of neural networks. In this article, we will learn more
about how neural networks work.
In a neural network, data is passed from layer to layer in several
steps:
- Input is multiplied by corresponding weights
- Weighted inputs are summed
-
The sum is converted to another value using a math
transformation (activation function)
There are several functions that can be applied in step 3.
Activation Functions
Activation functions define how a weighted sum of the input is
transformed into an output from a node. Think of it as an additional
transformation performed on the data. Below are three main examples
of activation functions: Sigmoid, Tanh, and ReLU.
The most basic form of a neural network is the perceptron. It is a
linear machine learning algorithm for the purpose of binary
classification (classifying input into one of two classes).
You might be wondering why activation functions are necessary. To
give a simple answer, they help introduce nonlinearity into our
model while also regulating the input. You might notice that the
range of tanh is from -1 to 1. This means anything plugged into tanh
will get scaled down from -1 to 1, which ensures regularity among
data. Similarly, sigmoid scales things down from 0 to 1. As for
introducing nonlinearity, without activation functions, our data
would constantly get multiplied by weights and added to biases. That
would basically be mimicking linear regression. The activation
functions introduce nonlinearity to help the model learn more
complex patterns in our data.
Forward Propagation (Forward Pass)
Now, let’s discuss how the neural network passes information from
the input layer to the hidden layers until the output layer.
At each layer, the data is transformed in three steps.
-
The weighted input (input X weight) is summed
-
An activation function is applied to this sum
-
The result is passed to the neurons in the next layer
When the output layer is reached, the final prediction of the neural
network is made. This entire process is called
Forward Propagation (Forward Pass).
Example: assume we had an input X1 that went through a neuron
and came out as X2. In order to transform X1 to
X2, we would multiply it by a weight and add it to a bias.
This could be represented by the equation X2 = W*X1 + B,
where Wis our weight and B is our bias.
In the diagram below, we are taking the sum of each input multiplied
by its weight and then applying an activation function to give our
output. This might be represented by
tanh(W1X1 + W2X2 + … + WNXN), where tanh is our activation
function and each input XN corresponds to a weight WN.
Lost/Cost Functions
After the forward pass is run on the network, we want to test how
well the model is doing. To do this, we can use a loss function, a
function that calculates the error of a model. Loss is defined as
the difference between the model’s predicted output and the actual
output. A common loss function is mean squared error (MSE). Our goal
is to minimize the loss, leading us to the next topic.
Let’s try to understand this equation. Imagine we have a set
S = {(x_1, y_1), (x_2, y_2), … (x_n, y_n)} of length
n. We go through each element and find the model’s predicted
y-value for that x value. That predicted y value is the
y_i with the little hat on top. We then subtract it from the
expected y value and square the result. After adding all of these
errors, we divide by the set length to get the average. You might
wonder why we square the terms. It is simply to make bigger errors
get higher weights and to ensure every error is positive. Other
formulas like Mean Absolute Error also exist, where the absolute
value is taken instead of squaring, but MSE tends to be the most
common.
Backpropagation (Backward Pass)
Backpropagation (Backward Pass) is the process of learning that the
neural network uses to adjust the network weights and minimize
error.
Initially, weights in a network are set to random numbers, so the
initial model is not accurate with high error. Backpropagation is
needed to improve this accuracy progressively.
We need to find out how much each node in a layer contributes to the
overall error. This can actually be done through backpropagation.
The idea is to adjust the weights in proportion to their error
contribution in order to eventually reach an optimal set of weights.
Bigger errors get adjusted more drastically than smaller errors.
This is typically done through a learning rate, which will be
covered later in the article. The gradient descent algorithm is what
is used to adjust the weights.
Gradient Descent
The goal of gradient descent is to find the set of weights that
minimize the loss function. Optimization functions calculate a
gradient to do this. A gradient is the partial derivative of
the loss function with respect to the weights. In other words, it is
how much a small increase in the weights will affect the loss
function. Weights are adjusted in the opposite direction of the
calculated direction. Think of this as taking a step towards the
local minimum of a function f(x). This cycle is repeated until we
reach a minimum of the loss function.
In this scenario, a common analogy used to visualize gradient
descent is picturing a ball rolling down the hill. However, the
steps gradient descent takes aren’t always smooth or in the same
direction, so a better example might be a drunk man stumbling down
said hill.
Note: Gradient descent is not guaranteed to find the absolute
minimum in a function and it could sometimes land on a relative
minimum instead.
The learning rate is a hyperparameter that determines the
step size (the amount by which weights are updated each time). A
high learning rate can jump over minima, which is not ideal. A low
learning rate will approach the minima too slowly, requiring many
iterations of the model. We can try out different learning rates
with trial and error to improve results.
Overfitting
Unfortunately, deep neural networks are prone to overfitting the
training data and producing poor accuracies on the test data. One
technique to handle overfitting is batch normalization, which
consists of normalizing inputs. This can improve the performance and
stability of neural networks.
Another technique is dropout, which is randomly shutting down
a fraction of the layer’s training step (resetting their values to
0). This prevents the neurons from having too much dependency on
each other. Below is a visual of the dropout method.