Activation Functions

Characteristics of good activation function

Nonlinear

When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator

a universal function is a computable function capable of calculating any other computable function If we use Linear functions throughout in network then the network is the same as perceptron (single layer network)

non linear

When the range is finite, gradient-based optimization methods are more stable because it limits the weights. When the range is infinite, gradient-based optimization methods are more efficient but for smaller learning rates. because weights updation doesn't have a limit on the activation function. You can refer to $1^{s t}$ article in the above list. You will find, weight updation is dependent on the activation function also.

Range and Domain: The domain of a function $f (x)$ is the set of all values for which the function is defined, and the range of the function is the set of all values that $f$ takes.

For example take function $f (x) = sin (x)$ which is sine function. Its range is $[- 1, 1]$ and domain is $(- \infty, + \infty)$

sine function

Continuously differentiable

A continuously differentiable function $f (x)$ is a function whose derivative function $f^{'} (x)$ is also continuous in it's domain. Youtube: Continuity Basic Introduction, Point, Infinite, & Jump Discontinuity, Removable & Nonremovable

types of discontinuity

In the below image, the function is a binary step function and it is discontinuous at $x = 0$ and it is jump discontinuity. As it is not differentiable at $x = 0$ , so gradient-based methods can make no progress with it.

binary step function

Monotonic

In calculus, a function $f$ defined on a subset of the real numbers with real values is called monotonic if and only if it is either entirely non-increasing, or entirely non-decreasing.

Identity Function is monotonic function $f (x) = x$ and $f (x) = sin (x)$ is non-monotonic function.

monotonic & non-monotonic

When the activation function is monotonic, the error surface associated with a single-layer model is guaranteed to be convex.

Monotonic Derivative

Smooth functions with a monotonic derivative have been shown to generalize better in some cases. I think it's because of the local minima problem. While training sometimes networks are stuck at local minima instead of global minima.

Local vs Global Minimum

Approximates identity near the origin

Usually, the weights $W$ and bias $b$ are initialized with values close to zero by the gradient descent method. Consequently, $W X^{T} + b$ or in our case $W X + b$ (Check Above Series) will be close to zero.

If $f$ approximates the identity function near zero, its gradient will be approximately equal to its input.

In other words, $\partial f \approx W X^{T} + b \Leftarrow\Rightarrow W X^{T} + b \approx 0$ . In terms of gradient descend, it is a strong gradient that helps the training algorithm to converge faster.

Activation Functions

From onward $f (x)$ is equation of activation function and $f^{'} (x)$ is derivative of that activation function which is required during backpropagation. We will see the most used activation function and you can find others on Wikipedia page Link. All function graphs are taken from book named Guide to Convolutional Neural Networks: A Practical Application to Traffic-Sign Detection and Classification written by Aghdam, Hamed Habibi and Heravi, Elnaz Jahani

1. Sigmoid

$f (x) f^{'} (x) = \frac{1}{1 + e ^{- x}} = f (x) (1 - f (x))$

Sigmoid

Pros:

It is nonlinear, so it can be used to activate hidden layers in a neural network
It is differentiable everywhere, so gradient-based backpropagation can be used with it

Cons:

The gradient for inputs that are far from the origin is near zero, so gradient-based learning is slow for saturated neurons using sigmoid i.e. vanishing gradients problem
When used as the final activation in a classifier, the sum of all classes doesn’t necessarily total 1
For these reasons, the sigmoid activation function is not used in deep architectures since training the network become nearly impossible

Characteristic	Yes/No
Range	$(0, 1)$
Order of Continuity	$C^{\infty}$
Monotonic	Yes
Monotonic Derivative	No
Approximates Identity near origin	No

2. Hyperbolic Tangent

$f (x) f^{'} (x) = tanh x = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}} = 1 - (f (x))^{2}$

hyperbolic tangent function

The hyperbolic tangent activation function is in fact a rescaled version of the sigmoid function.

Pros:

It is nonlinear, so it can be used to activate hidden layers in a neural network
It is differentiable everywhere, so gradient-based backpropagation can be used with it
It is preferred over the sigmoid function because it approximates the identity function near the origin

Cons:

As $∣ x ∣$ increases, it may suffer from vanishing gradient problems like sigmoid.
When used as the final activation in a classifier, the sum of all classes doesn’t necessarily total 1.

Characteristic	Yes/No
Range	$(- 1, 1)$
Order of Continuity	$C^{\infty}$
Monotonic	Yes
Monotonic Derivative	No
Approximates Identity near origin	Yes

3. Rectified Linear Unit

$f (x) f^{'} (x) = ma x (0, x) = {0, x, for x \leq 0 for x > 0 = {0, 1, for x \leq 0 for x > 0$

RELU

Pros:

Computationally very efficient
Its derivative in R+ is always 1 and it does not saturate in R+ (No vanishing gradient problem)
Good choice for deep networks
The problem of the dead neuron may affect learning but it makes it more efficient at the time of inference because we can remove these dead neurons. Cons:
The function does not approximate the identity function near the origin.
It may produce dead neurons. A dead neuron always returns 0 for every sample in the dataset. This affects the accuracy of the model.

This happens because the weight of dead neuron have been adjusted such that $W X^{T} + b$ for the neuron is always negative.

Characteristic	Yes/No
Range	$[0, \infty)$
Order of Continuity	$C^{0}$
Monotonic	Yes
Monotonic Derivative	Yes
Approximates Identity near origin	No

4. Leaky Rectified Linear Unit

$f (x) f^{'} (x) = {0.01 x, x, for x \leq 0 for x > 0 = {0.01, 1, for x \leq 0 for x > 0$

Leaky RELU

In practice, leaky ReLU and ReLU may produce similar results. This might be due to the fact that the positive region of these function is identical. 0.01 can be changed with other values between $[0, 1]$

Pros:

As its gradient does not vanish in negative region as opposed to ReLU, it solves the problem of dead neuron.

Cons:

Characteristic	Yes/No
Range	$(- \infty, + \infty)$
Order of Continuity	$C^{\infty}$
Monotonic	Yes
Monotonic Derivative	Yes
Approximates Identity near origin	No

5. Parameterized Rectified Linear Unit

$f (x) f^{'} (x) = {αx, x, for x \leq 0 for x > 0 = {α, 1, for x \leq 0 for x > 0$

parameterized relu

It is the same as Leaky ReLU but $α$ is a learnable parameter that can be learned from data.

Parameter Updation

To update $α$ , we need gradient of activation function with respect to $α$ .

$\partial f (x) = {x, α, for x < 0 for x \geq 0$

Characteristic	Yes/No
Range	$(- \infty, + \infty)$
Order of Continuity	$C^{\infty}$
Monotonic	Yes
Monotonic Derivative	Yes, if $α \geq 0$
Approximates Identity near origin	Yes, if $α = 1$

6. Softsign

$Z^{[l]} a^{[l]} = W^{[l]} . a^{[l - 1]} + b^{[l]} = g^{[l]} (Z^{[l]})$

softsign

Pros:

The function is equal to zero at origin and its derivative at origin is equal to 1. Therefore, it approximates the identity function at the origin.
Comparing the function and its derivative with a hyperbolic tangent, we observe that it also saturates as $∣ x ∣$ increases. However, the saturation ratio of the softsign function is less than the hyperbolic tangent function which is a desirable property
the gradient of the softsign function near origin drops with a greater ratio compared with the hyperbolic tangent.
In terms of computational complexity, softsign requires less computation than the hyperbolic tangent function.

Cons:

saturates as $∣ x ∣$ increases

Characteristic	Yes/No
Range	$(- 1, 1)$
Order of Continuity	$C^{1}$
Monotonic	Yes
Monotonic Derivative	No
Approximates Identity near origin	Yes

7. Softplus

$f (x) f i (x) = ln (1 + e^{x}) = \frac{1}{1 + e ^{- x}}$

softplus

Pros:

In contrast to the ReLU which is not differentiable at origin, the softplus function is differentiable everywhere
The derivative of the softplus function is the sigmoid function which means the range of derivative is $[0, 1]$

Cons:

the derivative of softplus is also a smooth function that saturates as $∣ x ∣$ increases

Characteristic	Yes/No
Range	$(0, \infty)$
Order of Continuity	$C^{\infty}$
Monotonic	Yes
Monotonic Derivative	Yes
Approximates Identity near origin	No

Parikshit's Notebook