Exploding Gradients in Neural Networks

In this article, we are going to discuss the Exploding Gradients in Neural Networks in depth.


Mansoor Ahmed

a year ago | 4 min read


Exploding Gradients in Neural Networks is the way and scale calculated during the training of a neural network. It is used to keep informed of the network weights in the right path and by the right amount.

Exploding Gradients may collect during an update and outcome in very big gradients in deep networks or recurrent neural networks. These in sequence result in huge updates to the network weights. The standards of weights may develop as bulky as to overflow and result in NaN values at a risky.

The explosion takes place over exponential growth by frequently increasing gradients via the network layers. That network has values bigger than 1.0.

In this article, we are going to discuss the Exploding Gradients in Neural Networks in depth.


The Exploding gradients problem talks about a huge upsurge in the norm of the gradient during training. These measures are made happen by an explosion of enduring components. Those can produce exponentially more than short-range ones.

This consequences in an unstable network. That at greatest cannot learn from the training data. It makes the gradient descent step impossible to perform.

  • The objective function for very nonlinear deep neural networks a lot comprises sharp non-linearities in parameter space.
  • These result from the increase of numerous parameters.
  • These non-linearities provide an upswing to very high derivatives in some places.
  • A gradient descent update may toss the parameters very distant when the parameters crowd together to such a cliff region.
  • These maybe lose most of the optimization work that had been completed.

Cliffs and Exploding Gradients

Neural networks with various layers regularly have very steep regions like cliffs. This outcome is from the development of certain big weights together. The gradient inform step may change the parameters very far on the face of a really steep cliff structure. It is normally jumping off of the cliff structure overall.

The cliff may be risky whether we approach it from overhead or from underneath. Though, its most thoughtful values can be dodged using the gradient clipping heuristic. The simple idea is to recall that the gradient does not require the optimal step size. The gradient clipping heuristic occurs to decreases the step size to be small enough. That it is less possible to go outer the region where the gradient specifies the direction of about steepest descent.

Cliff structures are most common in the cost functions for recurrent neural networks, because such models involve the multiplication of many factors, with one factor for each time step. Long temporal sequences thus incur an extreme amount of multiplication.

Identification and catching of Exploding Gradients

The proof of identity of these gradient problems is difficult to understand before the training process is even started. We have to continually monitor the logs and record unexpected jumps in the cost function when the network is a deep recurrent one.

This would tell us whether these jumps are recurrent. And if the norm of the gradient is growing exponentially. The best way to do this is by checking logs in a visualization dashboard.

Fixation of Exploding Gradients

There are various methods to address the exploding gradients. Below is the list of some best-practice methods that we can use.

Re-Plan the Network Model

Exploding gradients can be addressed by replanting the network to have fewer layers in deep neural networks. There can similarly be some advantage in using a minor batch size through training the network.

Bringing up to date across fewer preceding time steps in training, named truncated in recurrent neural networks. Backpropagation over time can decrease the exploding gradient problem.

Long Short-Term Memory (LSTM) usage

The Gradient Exploding may happen provided the inherent instability in the training of this type of network in recurrent neural networks. For example, Backpropagation through time basically transforms the recurrent network into a deep multilayer Perceptron neural network.

Exploding gradients may be compact by using the Long Short-Term Memory (LSTM) memory units. It is possibly linked to gated-type neuron structures. This is actually a new finest exercise for recurrent neural networks for sequence prediction.

Use of Gradient Clipping

Exploding gradients may take place in very deep Multilayer Perceptron networks. These occur with huge batch sizes and LSTMs with very long input sequence lengths.

We can find for and limit the size of gradients in the training of the network if exploding gradients are quiet taking place. This is named gradient clipping.

Usage of Weight Regularization

Weight regularization is another method. This is used to check the size of network weights and implement a penalty to the network’s loss function for big weight values if exploding gradients are quite happening.

Gradient Clipping in Keras

Keras helps gradient clipping on every optimization algorithm. It supports the similar order applied to all layers in the model. Gradient clipping may be used with an optimization algorithm, for example, stochastic gradient descent, with an extra argument when configuring the optimization algorithm.

We can use two types of gradient clipping.

  • Gradient norm scaling
  • Gradient value clipping

Gradient Norm Scaling

It includes altering the derivatives of the loss function to have a known vector norm. This involves when the L2 vector norm of the gradient vector surpasses a threshold value.


We may identify a norm of 1.0. It means that if the vector norm for a gradient surpasses 1.0, then the values in the vector will be rescaled. Therefore, the norm of the vector equals 1.0. This may be used in Keras by stating the clipnorm argument on the optimizer.


# configure sgd with gradient norm clipping

opt = SGD(lr=0.01, momentum=0.9, clipnorm=1.0)

Gradient value clipping

Gradient value clipping includes clipping the derivatives of the loss function to have a provided value. That is involved if a gradient value is less than a bad threshold or more than the positive threshold.


We may state a norm of 0.5. This means that it is set to -0.5 if a gradient value was less than -0.5. It would be set to 0.5 if it is more than 0.5. This may be used in Keras by identifying the clipvalue argument on the optimizer.


# configure sgd with gradient value clipping

opt = SGD(lr=0.01, momentum=0.9, clipvalue=0.5)

For more details visit:


Created by

Mansoor Ahmed

Chemical Engineer, web developer and Tech writer







Related Articles