Ann: The ‘Crux’ of Deep Learning

K. Sai Chaitanya
14 min readAug 1, 2020

--

Getting to know what back propagation is and to what extent does it empower ANN.

In this article, I’ve made sure that I include each and every crucial detail of Artificial Neural Network from scratch, for the readers to get a comprehensive view of this concept in a simplistic interpretation.

What is a Neuron?

I believe most of them were perplexed when confronted with this word ‘Neuron’ for the first time, but let me assure you, it’s pretty interesting and not that confusing you assume it be.

A ‘neuron’ in deep learning can be analogically referred to a ‘neuron’ in biology. Look at this, this is a typical biological neuron.

A typical biological neuron

Why Neuron?

When we touch a hot cup with our index finger, we reflex right?. This is to say that we felt something on the tip of the finger which involuntarily rebounded our entire hand, So how do we know it is hot? and what caused this reflex action?.

When we touch the hot cup, that particular neuron which is present at the tip of our finger gets activated and starts sending electrical signals to the brain through a chain of neurons and the brain finally forces our body to cause reflexes.

Biological Neural Network

So now we can consider a neuron as a source of communication to our brain through electrical pulses.

What is a Neuron in Deep Learning?

A neuron in deep learning is called as an Artificial Neuron and correspondingly a chain of neurons is referred to as a Artificial Neural Network. The structure of Neural Network in Deep learning is given below.

Yes. It is pretty complicated. Understanding the process which is happening in the background is similarly sophisticated, so lets break it apart and analyze the structure of a single neuron. Now Look:

A single neuron or a node

x1, x2, x3 and so on upto xn are features that are acting as inputs to the machine, the one in between is a node which is responsible for summing up the product of weights and input features and applying activation function on the resultant quantity, finally the third layer is the output layer which predicts the outcome.

Layers in a Neural Network:

Input Layer: This layer in technological terms can be compared to eyes in biological terms. Just as we discern the features of a subject by sight(Let’s say classifying an animal as a dog or a cat. We see and distinguish them based on their attributes or features, for instance: ears, mouth, tail, fur, eyes, teeth, nose etc.), the input layer of the Artificial Neuron is responsible for discerning the features of a subject. So the input layer must contain all the attributes or features used for classification.

Here, xi represents the individual features

Hidden Layer: This layer can be compared to nucleus of a neuron or a node in biological terms. Just as it activates and transfers the electrical signals to the brain, the machine like neuron similarly sums up the product of weights and input features and then passes it through an activation function which activates that particular neuron. Here weights play a major role, lets put this aside, we will get back to this with a detailed explanation.

Function of each node in a hidden layer:

  1. Multiplying each feature to the corresponding weight of that branch.
  2. Summing up all the products of weights and features.
  3. Applying activation function on the resultant output

Output Layer: This layer is analogous to our brain in biological terms. Here as in material world we classify the subjects, the output layer in the machine like world is responsible for segregating subjects with ‘unique’ features and giving the output in the binary format.

Activation functions:

An Activation function solely decides whether or not to activate a neuron. for instance: if the input to the activation function is greater than the threshold value, then the neuron will be activated, else it remains dormant.

  1. Binary Step Function

Here, If the input to the activation function is greater than 0, then the neuron is activated, else it is deactivated, i.e. its output is not considered for the next hidden layer. Let us look at it mathematically-

f(x) = 1, x>=0
= 0, x<0
Binary step

Drawback of Binary Function:

The gradient of the step function is zero, which puts the back propagation in an impasse. That is, if you calculate the derivative of f(x) with respect to x, it comes out to be 0, this indicates that there will be no change in the weights during backward propagation, and the new weights will always remain the same.

2. Linear Function

The problem in binary function leads us to linear function. We can define the linear function as:

f(x)=ax
Linear activation function

Here the activation varies proportionally to the input. When we differentiate the function with respect to x, the result is the coefficient of x, which is a constant.

f'(x) = a

The derivative of linear function does not become zero. nevertheless, it still impacts negatively on back propagation. What I mean to say is the weight updating factor remains constant. This hinders us to introduce variable weights in different iterations.

Drawback of Linear Function:

In this scenario, the neural network will not really improve the error since the gradient is the same for every iteration. The network will not be able to train well and capture the complex patterns from the data.

3. Sigmoid

The most renowned activation function is ‘Sigmoid’. It is one of the most widely used non-linear activation function. Sigmoid transforms the values between the range 0 and 1. If the input to the activation is less than 0.5, it gives zero as the output, whereas when the input is greater than 0.5, the output becomes high(1).

f(x) = 1/(1+e^-x)
Sigmoid activation function

One point to be highlighted in Sigmoid function is that it is non-linear. This essentially means -when I multiply neurons having sigmoid function as their activation function, the output is non linear as well.

Moreover, as you can see in the graph above, this is a smooth S-shaped function and is continuously differentiable. The derivative of this function is:

f'(x) = sigmoid(x)*(1-sigmoid(x))
Derivative of Sigmoid Function

The activation value of sigmoid function varies between 0 and 0.25.

Drawback of Sigmoid Function:

Additionally, the sigmoid function is not symmetric around zero. So output of all the neurons will be of the same sign. This can be addressed by scaling the sigmoid function which is exactly what happens in the tanh function.

There is also an other prominent drawback in Sigmoid function referred to as ‘Vanishing Gradient problem’ .

Vanishing Gradient Problem: The gradient value of Sigmoid is non zero from -4 to 4. However, the graphs gets broader beyond the limit and touches the 0 line. This implies that for values greater than 4 or less than -4, will have very small gradients. As the gradient value approaches zero, the network is not really learning.

4. Tanh

The tanh function is very similar to the sigmoid function. The only difference is that it is symmetric around the origin. The range of values in this case is from -1 to 1. Thus the inputs to the next layers will not always be of the same sign. The tanh function is defined as-

tanh(x)=2sigmoid(2x)-1
Tanh function

As you can see, the range of values is between -1 to 1. Apart from that, all other properties of tanh function are the same as that of the sigmoid function. Similar to sigmoid, the tanh function is continuous and differentiable at all points.

Let’s have a look at the gradient of the tanh function.

Tanh and its derivative

The gradient of the tanh function is steeper as compared to the sigmoid function. Usually tanh is preferred over the sigmoid function since it is zero centered and the gradients are not restricted to move in a certain direction.

Drawbacks of Tanh Function:

It is more or less similar to Sigmoid function except from being Symmetrical around the origin.

5. ReLU

The ReLU function is another non-linear activation function that has gained popularity in the deep learning domain. ReLU stands for Rectified Linear Unit. The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time.

This means that the neurons will only be deactivated if the output of the linear transformation is less than 0. The plot below will help you understand this better-

f(x)=max(0,x)
ReLU activation function

For the negative input values, the result is zero, that means the neuron does not get activated. Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh function.

Let’s look at the gradient of the ReLU function.

f'(x) = 1, x>=0
= 0, x<0
Gradient of ReLU function

If you look at the negative side of the graph, you will notice that the gradient value is zero. Due to this reason, during the back propagation process, the weights and biases for some neurons are not updated. This can create dead neurons which never get activated. This is taken care of by the ‘Leaky’ ReLU function.

6. Leaky ReLU

Leaky ReLU function is nothing but an improved version of the ReLU function. As we saw that for the ReLU function, the gradient is 0 for x<0, which would deactivate the neurons in that region.

Leaky ReLU is defined to address this problem. Instead of defining the Relu function as 0 for negative values of x, we define it as an extremely small linear component of x. Here is the mathematical expression-

f(x)= 0.1x, x<0
= x, x>=0
Leaky ReLU

By making this small modification, the gradient of the left side of the graph comes out to be a non zero value. Hence we would no longer encounter dead neurons in that region. Here is the derivative of the Leaky ReLU function

f'(x) = 1, x>=0
=0.01, x<0

Apart from Leaky ReLU, there are a few other variants of ReLU, the two most popular are — Parameterised ReLU function and Exponential ReLU.

Note: In the output layer, it is always recommended to use Sigmoid activation function so that the output can be precisely classified as either 0 or 1. Whereas in the rest hidden layers, either ReLU or Leaky ReLU works best.

What are weights ?

Weight is the parameter within a neural network that transforms input data within the network’s hidden layers. A neural network is a series of nodes, or neurons. Within each node is a set of inputs, weight, and a bias value. As an input enters the node, it gets multiplied by a weight value and the resulting output is either observed, or passed to the next layer in the neural network. Often the weights of a neural network are contained within the hidden layers of the network.

“weights are simply defined as the amount to which the input features affect the output”

What is a Loss Function?

When the output layer classifies the subject as a positive or a negative quantity, (for instance 1 for dog and 0 for not a dog), then it is inevitable that there will be errors for some observations as machines are not always perfect in prediction. So our job is to minimize the error function which is called as Loss function such that the prediction will be redeemed to classify the subject as precisely as possible.

Loss function is defined as the sum of (squares of differences in y actual and y predicted) of all the n records

Loss function

For this to happen( loss function to decrease), we have to introduce an indispensable concept underlying ANN which is “Back Propagation”.

Back Propagation:

During a back propagation, the computer automates itself in such a way that the weights are adjusted automatically by traversing backwards from output layer to the input layer. The main concern behind back propagation is to adjust the weights, for it reduces the Loss function.

forward and backward propagation

Batch Gradient descent:

The weights are adjusted in such a way that it satisfies the following formula:

updating weights during back propagation
  1. *Wx in the above formula is the new weight.
  2. Wx is the old weight
  3. a is the learning rate, which decides to what amount should the weight be dropped down in order to reach the global minima. The learning rate should neither be too small nor too large. If the learning rate is too small, then the weights never reach the weight at the global minima. Whereas, if the learning rate is too large then the weights oscillate between the sidewalls of ‘U’ shaped curve but never reaches the global minima.
  4. Finally the derivative represents the slope of the line at that weight. The main intuition behind this is, when the slope is negative, the old weight is augmented by a small amount. conversely, when the slope is positive, the old weight is reduced by a small amount.
weights reaching to the global minima

The gradient descent requires all the records of the data set in order to process the Loss Function and to further the updation of weights to ensure the Loss Function decreases.

Stochastic gradient descent:

batch gradient descent
  1. Unlike the Gradient descent which uses all the samples at one shot, SGD uses only one record at a time.
  2. In SGD, because it’s using only one example at a time, its path to the global minima is noisier than that of the batch gradient descent. However it is desirable as we are indifferent to the path, as long as it gives us the minimum and shorter training time.

Mini Batch Gradient Descent:

Mini-batch gradient descent uses n data points (instead of 1 sample in SGD) at each iteration.

Optimization is conducive for the weights to reach the global minima at a faster rate.

Now that we have ingrained in our brains regarding the differences between the three types of Gradient descents conceptually, we now analyze the distinction using a graph.

distinction between the three

Rates at which the three reaches the Global minima: Batch gradient descent > Mini Batch gradient descent > Stochastic Gradient descent.

Epoch?

An epoch is defined as the combination of one forward and one backward propagation. When we increase the epochs, the error will be more likely to diminish.

1 epoch = 1 forward propagation + 1 backward propagation

Dropout regularization:

To not over-fit the learning model to the input data, we employ dropout. It is a regularization technique for reducing over-fitting in neural networks by preventing complex co-adaptations on training data.

It is a very efficient way of performing model averaging with neural networks. The term “dropout” refers to dropping out units (both hidden and visible) in a neural network. A simple and powerful regularization technique for neural networks and deep learning models is dropout.

How does the drop out technique work?

random nodes being activated using dropout regularization.

Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for specific features providing some specialization. Neighboring neurons become to rely on this specialization, which if taken too far can result in a fragile model too specialized to the training data. This reliant on context for a neuron during training is referred to complex co-adaptations.

dropout when p=0.5

P is the Dropout factor which can be determined by Hyper parameter tuning. It is important to also note that 0≤p≤1.

You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.

The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to over-fit the training data.

What is the need for optimization Algorithms?

As we’ve discussed earlier in this article that, to train a neural network model, we must define a loss function in order to measure the difference between our model predictions and the actual label that was intended to predict. What we are looking for is a certain set of weights, with which the neural network can make an accurate prediction, which automatically leads to a lower value of the loss function.

I think you must know by now, that the mathematical method intuition it is called gradient descent

By periodically applying the gradient descent to the weights, we will eventually arrive at the optimal weights that minimize the loss function and allow the neural network to make better predictions.

In practice, this technique may confront certain barricades during training that can slow down the learning process or, in the worst case, even prevent the algorithm from finding the optimal weights.

These problems were on the one hand saddle points and local minima of the loss function, where the loss function becomes flat and the gradient goes to zero:

1. Saddle points 2. Local minima

As the gradient( the slope) becomes zero at flat surfaces, there is no way the weights are being further updated, this creates a dead lock and prevents the model from learning.

On the other hand, even if we have gradients that are not close to zero, the values of these gradients calculated for different data samples from the training set may vary in value and direction. We say that the gradients are noisy or have a lot of variances. This leads to a zigzag movement towards the optimal weights and can make learning much slower. In order to erase these difficulties we come up with optimization algorithm, for the gradient to traverse smoothly and swiftly to the global minima.

There are different optimization algorithms. I recommend you to click on the hyper links below to get a comprehensive understanding about each and every optimizer.

“ I’ve posted the code regarding applying Artificial Neural Networks on Churn-Modelling in my github. I would recommend you to download the code in your local systems and execute for a better understanding of how each layer functions.”

My github link to that code is given below:

--

--

K. Sai Chaitanya
K. Sai Chaitanya

No responses yet