Open In App

Understanding Activation Functions in Depth

Last Updated : 21 Nov, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

In artificial neural networks, the activation function of a neuron determines its output for a given input. This output serves as the input for subsequent neurons in the network, continuing the process until the network solves the original problem.

Consider a binary classification problem, where the goal is to classify an input, such as an image, into one of two categories:

  • 1: The image contains the correct object.
  • 0: The image does not contain the correct object.

Here, the activation function helps decide between the two outputs (1 or 0).

The sigmoid function is commonly used in binary classification tasks to output probabilities that guide the classification decision.

Why Activation Functions Matter in Deep Learning?

Without activation functions, a neural network would behave like a simple linear model, unable to capture the complexity required for solving problems like image recognition or natural language processing. Activation functions introduce non-linearity, enabling the network to learn intricate patterns.

How Activation Functions Work?

Activation functions transforms the raw data into meaningful outputs. Mathematically, they operate on the weighted sum of inputs, adding non-linearity and enabling the network to solve complex problems.

1. Weighted Sum of Inputs

Each neuron computes a weighted sum of the inputs as:

[Tex]z = \sum_{i=1}^{m} w_i x_i + b[/Tex]

Here:

  • [Tex]x_i[/Tex]​: Input features
  • [Tex]w_i[/Tex]: Weights associated with each input
  • [Tex]b[/Tex]: Bias term

This weighted sum [Tex]z[/Tex] becomes the input for the activation function.

2. Activation Function

The activation function transforms the weighted sum [Tex]Z[/Tex] into the neuron’s output. Mathematically:

[Tex]y=f(z)[/Tex]

Where [Tex]f(z)[/Tex] is the activation function. For example, ReLU, Sigmoid, Tanh.

We will discuss these activation function in detail in this the next section.

3. Gradient and Optimization

Activation functions play a vital role in backpropagation by contributing to the gradient calculation. The gradient of the activation function determines how weights are updated during training:

[Tex]\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w}[/Tex]

Here:

  • [Tex]{\partial y}{\partial z}[/Tex]​: Derivative of the activation function
  • [Tex]{\partial L}{\partial y}[/Tex]: Loss gradient
  • [Tex]\frac{\partial z}{\partial w}[/Tex]: Weighted input gradient

Smooth activation functions like sigmoid and tanh can cause vanishing gradients, while ReLU mitigates this issue, making it popular in deep networks.

Types of Activation Functions

Activation functions are classified into two main categories: Linear Activation Functions and Non-linear Activation Functions.

1. Linear Activation Function

A linear activation function calculates the output as a linear combination of inputs. While simple, linear activation functions lack the non-linearity required for learning complex patterns, limiting their use in modern neural networks.

2. Non-linear Activation Functions

Non-linear activation functions enable the network to learn complex patterns by introducing non-linearity. This allows the model to generalize effectively across diverse datasets and differentiate outputs. Non-linear means the output cannot be expressed in the form of linear combination of inputs.

Key Terms for Non-linear Functions:

  • Derivative: Represents the change in output (y-axis) with respect to changes in input (x-axis).
  • Monotonic Function: A function that is consistently non-increasing or non-decreasing.

Non-linear activation functions are further divided based on their range and curves. Let’s examine each function in detail:

1. Sigmoid Function

The sigmoid function is defined as:

[Tex]\sigma(x) = \frac{1}{1 + e^{-x}}[/Tex]

Derivative:

[Tex]\frac{d}{dx} \sigma(x) = \sigma(x) (1 – \sigma(x))[/Tex]

The derivative saturates for extreme values of [Tex]x[/Tex] (close to 0 for [Tex]x \to -\infty \text{ or } x \to \infty[/Tex]), which can lead to the vanishing gradient problem.

2. ReLU (Rectified Linear Unit)

The ReLU function is defined as:

[Tex]f(x) = \max(0, x)[/Tex]

Derivative:

[Tex]f'(x) = \begin{cases} 1, & \text{if } x > 0 \\ 0, & \text{if } x \leq 0 \end{cases}[/Tex]

The derivative is simple but undefined at [Tex]x = 0[/Tex]. By convention, it is often set to 0 at this point.

3. Leaky ReLU

The Leaky ReLU function is an extension of ReLU, defined as:

[Tex]f(x) = \begin{cases} x, & \text{if } x > 0 \\ \alpha x, & \text{if } x \leq 0 \end{cases}[/Tex]

Where [Tex]\alpha[/Tex] is a small constant (e.g., 0.01).

Derivative:

[Tex]f'(x) = \begin{cases} 1, & \text{if } x > 0 \\ \alpha, & \text{if } x \leq 0 \end{cases}[/Tex]

The small gradient for [Tex]x \leq 0[/Tex] helps avoid the Dying ReLU problem.

4. Tanh (Hyperbolic Tangent)

The tanh function is defined as:

[Tex]\tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}}[/Tex]

Derivative:

[Tex]\frac{d}{dx} \tanh(x) = 1 – \tanh^2(x)[/Tex]

Like the sigmoid function, the tanh derivative also saturates for extreme values of x, leading to the vanishing gradient problem.

5. Softmax

The softmax function is defined as:

[Tex]\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}[/Tex]​​

Derivative:
For an individual output [Tex]y_i[/Tex]​:

[Tex]\frac{\partial y_i}{\partial z_j} = y_i (1 – y_i), \text{if } i = j[/Tex]

[Tex]\frac{\partial y_i}{\partial z_j} = -y_i y_j, \text{if } i \neq j[/Tex]

The derivative depends on all outputs, making it suitable for multi-class classification problems.

Why is the Derivative of Activation Functions Important?

In the above section, we looked into the derivatives of the activation functions, now let’s understand the importance of the derivative of the activation functions.

The derivative of activation functions measure how sensitive the output of a function is to changes in its input. This sensitivity, expressed mathematically as the gradient, is essential for optimization during the training process.

1. Enabling Backpropagation

The backpropagation algorithm relies on the chain rule of calculus to compute the gradient of the loss function with respect to each weight in the network. This process requires the derivative of the activation function at each layer to adjust the weights effectively. The derivative ensures that the gradient flows backward through the network.

For example, the gradient computation for a neuron involves:

[Tex]\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w}[/Tex]

Here, [Tex]\frac{\partial y}{\partial z}[/Tex] represents the derivative of the activation function, which determines how the neuron’s output responds to changes in its input.

2. Guiding Weight Updates

During training, the gradient of the activation function helps in determining the magnitude and direction of weight updates. A small or zero derivative can slow down or even halt learning (e.g., vanishing gradients with sigmoid or tanh), while an appropriate derivative (like ReLU’s) allows efficient optimization.

3. Handling Vanishing and Exploding Gradients

The choice of activation function and its derivative directly impacts gradient stability:

  • Vanishing Gradient Problem: Functions like sigmoid and tanh have derivatives close to zero for large or small inputs, which can cause gradients to shrink as they propagate back through the layers. This leads to slow or ineffective learning.
  • Exploding Gradients: If derivatives become excessively large, the weights can update too drastically, destabilizing the model. Proper choice of activation functions (e.g., ReLU or Leaky ReLU) mitigates this issue.

4. Improving Convergence Speed

Smooth and well-behaved derivatives allow for faster convergence during training. Activation functions like ReLU with simple, consistent gradients (1 for positive inputs) speed up learning by maintaining meaningful gradients across layers.

5. Enabling Complex Learning

Non-linear activation functions owe their ability to learn complex patterns to their derivatives. The shape of the derivative determines the function’s behavior, enabling the network to model intricate relationships between inputs and outputs.




Next Article
Article Tags :
Practice Tags :

Similar Reads

three90RightbarBannerImg