Understanding Activation Functions in Depth
Last Updated :
21 Nov, 2024
In artificial neural networks, the activation function of a neuron determines its output for a given input. This output serves as the input for subsequent neurons in the network, continuing the process until the network solves the original problem.
Consider a binary classification problem, where the goal is to classify an input, such as an image, into one of two categories:
- 1: The image contains the correct object.
- 0: The image does not contain the correct object.
Here, the activation function helps decide between the two outputs (1 or 0).
The sigmoid function is commonly used in binary classification tasks to output probabilities that guide the classification decision.

Why Activation Functions Matter in Deep Learning?
Without activation functions, a neural network would behave like a simple linear model, unable to capture the complexity required for solving problems like image recognition or natural language processing. Activation functions introduce non-linearity, enabling the network to learn intricate patterns.
How Activation Functions Work?
Activation functions transforms the raw data into meaningful outputs. Mathematically, they operate on the weighted sum of inputs, adding non-linearity and enabling the network to solve complex problems.
1. Weighted Sum of Inputs
Each neuron computes a weighted sum of the inputs as:
[Tex]z = \sum_{i=1}^{m} w_i x_i + b[/Tex]
Here:
- [Tex]x_i[/Tex]​: Input features
- [Tex]w_i[/Tex]: Weights associated with each input
- [Tex]b[/Tex]: Bias term
This weighted sum [Tex]z[/Tex] becomes the input for the activation function.
2. Activation Function
The activation function transforms the weighted sum [Tex]Z[/Tex] into the neuron’s output. Mathematically:
[Tex]y=f(z)[/Tex]
Where [Tex]f(z)[/Tex] is the activation function. For example, ReLU, Sigmoid, Tanh.
We will discuss these activation function in detail in this the next section.
3. Gradient and Optimization
Activation functions play a vital role in backpropagation by contributing to the gradient calculation. The gradient of the activation function determines how weights are updated during training:
[Tex]\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w}[/Tex]
Here:
- [Tex]{\partial y}{\partial z}[/Tex]​: Derivative of the activation function
- [Tex]{\partial L}{\partial y}[/Tex]: Loss gradient
- [Tex]\frac{\partial z}{\partial w}[/Tex]: Weighted input gradient
Smooth activation functions like sigmoid and tanh can cause vanishing gradients, while ReLU mitigates this issue, making it popular in deep networks.
Types of Activation Functions
Activation functions are classified into two main categories: Linear Activation Functions and Non-linear Activation Functions.
1. Linear Activation Function
A linear activation function calculates the output as a linear combination of inputs. While simple, linear activation functions lack the non-linearity required for learning complex patterns, limiting their use in modern neural networks.
2. Non-linear Activation Functions
Non-linear activation functions enable the network to learn complex patterns by introducing non-linearity. This allows the model to generalize effectively across diverse datasets and differentiate outputs. Non-linear means the output cannot be expressed in the form of linear combination of inputs.
Key Terms for Non-linear Functions:
- Derivative: Represents the change in output (y-axis) with respect to changes in input (x-axis).
- Monotonic Function: A function that is consistently non-increasing or non-decreasing.
Non-linear activation functions are further divided based on their range and curves. Let’s examine each function in detail:
The sigmoid function is defined as:
[Tex]\sigma(x) = \frac{1}{1 + e^{-x}}[/Tex]
Derivative:
[Tex]\frac{d}{dx} \sigma(x) = \sigma(x) (1 – \sigma(x))[/Tex]
The derivative saturates for extreme values of [Tex]x[/Tex] (close to 0 for [Tex]x \to -\infty \text{ or } x \to \infty[/Tex]), which can lead to the vanishing gradient problem.
The ReLU function is defined as:
[Tex]f(x) = \max(0, x)[/Tex]
Derivative:
[Tex]f'(x) = \begin{cases} 1, & \text{if } x > 0 \\ 0, & \text{if } x \leq 0 \end{cases}[/Tex]
The derivative is simple but undefined at [Tex]x = 0[/Tex]. By convention, it is often set to 0 at this point.
3. Leaky ReLU
The Leaky ReLU function is an extension of ReLU, defined as:
[Tex]f(x) = \begin{cases} x, & \text{if } x > 0 \\ \alpha x, & \text{if } x \leq 0 \end{cases}[/Tex]
Where [Tex]\alpha[/Tex] is a small constant (e.g., 0.01).
Derivative:
[Tex]f'(x) = \begin{cases} 1, & \text{if } x > 0 \\ \alpha, & \text{if } x \leq 0 \end{cases}[/Tex]
The small gradient for [Tex]x \leq 0[/Tex] helps avoid the Dying ReLU problem.
4. Tanh (Hyperbolic Tangent)
The tanh function is defined as:
[Tex]\tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}}[/Tex]
Derivative:
[Tex]\frac{d}{dx} \tanh(x) = 1 – \tanh^2(x)[/Tex]
Like the sigmoid function, the tanh derivative also saturates for extreme values of x, leading to the vanishing gradient problem.
The softmax function is defined as:
[Tex]\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}[/Tex]​​
Derivative:
For an individual output [Tex]y_i[/Tex]​:
[Tex]\frac{\partial y_i}{\partial z_j} = y_i (1 – y_i), \text{if } i = j[/Tex]
[Tex]\frac{\partial y_i}{\partial z_j} = -y_i y_j, \text{if } i \neq j[/Tex]
The derivative depends on all outputs, making it suitable for multi-class classification problems.
Why is the Derivative of Activation Functions Important?
In the above section, we looked into the derivatives of the activation functions, now let’s understand the importance of the derivative of the activation functions.
The derivative of activation functions measure how sensitive the output of a function is to changes in its input. This sensitivity, expressed mathematically as the gradient, is essential for optimization during the training process.
1. Enabling Backpropagation
The backpropagation algorithm relies on the chain rule of calculus to compute the gradient of the loss function with respect to each weight in the network. This process requires the derivative of the activation function at each layer to adjust the weights effectively. The derivative ensures that the gradient flows backward through the network.
For example, the gradient computation for a neuron involves:
[Tex]\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w}[/Tex]
Here, [Tex]\frac{\partial y}{\partial z}[/Tex] represents the derivative of the activation function, which determines how the neuron’s output responds to changes in its input.
2. Guiding Weight Updates
During training, the gradient of the activation function helps in determining the magnitude and direction of weight updates. A small or zero derivative can slow down or even halt learning (e.g., vanishing gradients with sigmoid or tanh), while an appropriate derivative (like ReLU’s) allows efficient optimization.
3. Handling Vanishing and Exploding Gradients
The choice of activation function and its derivative directly impacts gradient stability:
- Vanishing Gradient Problem: Functions like sigmoid and tanh have derivatives close to zero for large or small inputs, which can cause gradients to shrink as they propagate back through the layers. This leads to slow or ineffective learning.
- Exploding Gradients: If derivatives become excessively large, the weights can update too drastically, destabilizing the model. Proper choice of activation functions (e.g., ReLU or Leaky ReLU) mitigates this issue.
4. Improving Convergence Speed
Smooth and well-behaved derivatives allow for faster convergence during training. Activation functions like ReLU with simple, consistent gradients (1 for positive inputs) speed up learning by maintaining meaningful gradients across layers.
5. Enabling Complex Learning
Non-linear activation functions owe their ability to learn complex patterns to their derivatives. The shape of the derivative determines the function’s behavior, enabling the network to model intricate relationships between inputs and outputs.
Similar Reads
Deep Learning Tutorial
Deep Learning tutorial covers the basics and more advanced topics, making it perfect for beginners and those with experience. Whether you're just starting or looking to expand your knowledge, this guide makes it easy to learn about the different technologies of Deep Learning. Deep Learning is a bran
5 min read
Introduction to Deep Learning
Artificial Neural Network
Introduction to Convolution Neural Network
Introduction to Convolution Neural Network
Convolutional Neural Network (CNN) is an advanced version of artificial neural networks (ANNs), primarily designed to extract features from grid-like matrix datasets. This is particularly useful for visual datasets such as images or videos, where data patterns play a crucial role. CNNs are widely us
8 min read
Digital Image Processing Basics
Digital Image Processing means processing digital image by means of a digital computer. We can also say that it is a use of computer algorithms, in order to get enhanced image either to extract some useful information. Digital image processing is the use of algorithms and mathematical models to proc
7 min read
Difference between Image Processing and Computer Vision
Image processing and Computer Vision both are very exciting field of Computer Science. Computer Vision: In Computer Vision, computers or machines are made to gain high-level understanding from the input digital images or videos with the purpose of automating tasks that the human visual system can do
2 min read
CNN | Introduction to Pooling Layer
Pooling layer is used in CNNs to reduce the spatial dimensions (width and height) of the input feature maps while retaining the most important information. It involves sliding a two-dimensional filter over each channel of a feature map and summarizing the features within the region covered by the fi
5 min read
CIFAR-10 Image Classification in TensorFlow
Prerequisites:Image ClassificationConvolution Neural Networks including basic pooling, convolution layers with normalization in neural networks, and dropout.Data Augmentation.Neural Networks.Numpy arrays.In this article, we are going to discuss how to classify images using TensorFlow. Image Classifi
8 min read
Implementation of a CNN based Image Classifier using PyTorch
Introduction: Introduced in the 1980s by Yann LeCun, Convolution Neural Networks(also called CNNs or ConvNets) have come a long way. From being employed for simple digit classification tasks, CNN-based architectures are being used very profoundly over much Deep Learning and Computer Vision-related t
9 min read
Convolutional Neural Network (CNN) Architectures
Convolutional Neural Network(CNN) is a neural network architecture in Deep Learning, used to recognize the pattern from structured arrays. However, over many years, CNN architectures have evolved. Many variants of the fundamental CNN Architecture This been developed, leading to amazing advances in t
11 min read
Object Detection vs Object Recognition vs Image Segmentation
Object Recognition: Object recognition is the technique of identifying the object present in images and videos. It is one of the most important applications of machine learning and deep learning. The goal of this field is to teach machines to understand (recognize) the content of an image just like
5 min read
YOLO v2 - Object Detection
In terms of speed, YOLO is one of the best models in object recognition, able to recognize objects and process frames at the rate up to 150 FPS for small networks. However, In terms of accuracy mAP, YOLO was not the state of the art model but has fairly good Mean average Precision (mAP) of 63% when
6 min read
Recurrent Neural Network
Natural Language Processing (NLP) Tutorial
Natural Language Processing (NLP) is the branch of Artificial Intelligence (AI) that gives the ability to machine understand and process human languages. Human languages can be in the form of text or audio format. Applications of NLPThe applications of Natural Language Processing are as follows: Voi
5 min read
Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging
Natural Language Toolkit (NLTK) is one of the largest Python libraries for performing various Natural Language Processing tasks. From rudimentary tasks such as text pre-processing to tasks like vectorized representation of text - NLTK's API has covered everything. In this article, we will accustom o
5 min read
Word Embeddings in NLP
Word Embeddings are numeric representations of words in a lower-dimensional space, capturing semantic and syntactic information. They play a vital role in Natural Language Processing (NLP) tasks. This article explores traditional and neural approaches, such as TF-IDF, Word2Vec, and GloVe, offering i
15+ min read
Introduction to Recurrent Neural Networks
Recurrent Neural Networks (RNNs) work a bit different from regular neural networks. In neural network the information flows in one direction from input to output. However in RNN information is fed back into the system after each step. Think of it like reading a sentence, when you're trying to predic
12 min read
Recurrent Neural Networks Explanation
Today, different Machine Learning techniques are used to handle different types of data. One of the most difficult types of data to handle and the forecast is sequential data. Sequential data is different from other types of data in the sense that while all the features of a typical dataset can be a
8 min read
Sentiment Analysis with an Recurrent Neural Networks (RNN)
Recurrent Neural Networks (RNNs) excel in sequence tasks such as sentiment analysis due to their ability to capture context from sequential data. In this article we will be apply RNNs to analyze the sentiment of customer reviews from Swiggy food delivery platform. The goal is to classify reviews as
3 min read
Short term Memory
In the wider community of neurologists and those who are researching the brain, It is agreed that two temporarily distinct processes contribute to the acquisition and expression of brain functions. These variations can result in long-lasting alterations in neuron operations, for instance through act
5 min read
What is LSTM - Long Short Term Memory?
Long Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural Network (RNN) designed by Hochreiter & Schmidhuber. LSTMs can capture long-term dependencies in sequential data making them ideal for tasks like language translation, speech recognition and time series forecasting. Unli
7 min read
Long Short Term Memory Networks Explanation
Prerequisites: Recurrent Neural Networks To solve the problem of Vanishing and Exploding Gradients in a Deep Recurrent Neural Network, many variations were developed. One of the most famous of them is the Long Short Term Memory Network(LSTM). In concept, an LSTM recurrent unit tries to "remember" al
7 min read
LSTM - Derivation of Back propagation through time
Long Short-Term Memory (LSTM) are a type of neural network designed to handle long-term dependencies by handling the vanishing gradient problem. One of the fundamental techniques used to train LSTMs is Backpropagation Through Time (BPTT) where we have sequential data. In this article we summarize ho
4 min read
Text Generation using Recurrent Long Short Term Memory Network
LSTMs are a type of neural network that are well-suited for tasks involving sequential data such as text generation. They are particularly useful because they can remember long-term dependencies in the data which is crucial when dealing with text that often has context that spans over multiple words
6 min read