Why do we use dropout in artificial neural networks?

This article will explain what Dropout is, how does it work, and why we use it in artificial neural networks.

What is Dropout in artificial neural networks?

Dropout is a technique that prevents overfitting in artificial neural networks by randomly dropping units during training. As a result, the trained model works as an ensemble model consisting of multiple neural networks. At test time, the prediction of those ensembled networks is averaged in every layer to get the final model prediction.

Because during training, some parts of the network switch off, we can represent Dropout as training multiple neural networks and averaging their prediction during testing.

How does Dropout work during training?

When we use Dropout, we specify the probability of using the layer’s input value during a single mini-batch of training. The implementation calculates an element-wise product (marked as \(*\) in the equations below) of the vector of inputs into a layer and the vector of independent Bernoulli random variables which have probability \(p\) of being \(1\).

Assuming that:

  • \(l \in \{1, ..., L\}\) is the index of hidden layers in the neural network
  • \(l = 0\) denote the input layer index
  • \(z^{(l)}\) is the input vector in the layer \(l\)
  • \(y^{(l)}\) is the output vector of the layer \(l\), so \(y^{(0)}\) is the input layer
  • \(W^{(l)}\) is the vector of weights at layer \(l\)
  • \(b^{(l)}\) is a vector of biases at layer \(l\)
  • \(r^{(l)}\) is a vector of independent Bernoulli random variables with probability \(p\) of being \(1\) and probability \(1 - p\) of being \(0\)
  • \(f\) denotes an activation function

we can calculate the output of a layer like this:

\[\begin{align} \widetilde{y}^{(l)} &= r^{(l)} * y^{(l)} \\ z^{(l+1)} &= W^{(l+1)}\widetilde{y}^{(l)}+b^{(l+1)} \\ y^{(l+1)} &= f(z^{(l+1)}) \end{align}\]

When we write down the equations that describe the neural networks, we see that the dropped out neurons don’t disappear from the network, but their inputs are set to zeros.

Similarly, only the units with probability \(p = 1\) are affected during backpropagation, so the switched-off units ignore the input batch.

How does dropout work during testing?

When the model is ready, and we want to use it (or test it), we no longer switch off network units. Instead of that, the weights of each unit is multiplied by \(p\):

\[W^{(l)}_{test} = pW^{(l)}\]

This operation averages the predictions produced by all sub-networks generated by switching off parts of the neural network.

Dropout in Tensorflow

When we use Tensorflow, adding a dropout to a layer requires putting the Dropout operator after the layer definition. Additionally, we can use kerastuner to tune the dropout rate parameter:

import tensorflow as tf
import kerastuner as kt

def build_model(tuner):
    input_to_the_next_layer = tf.keras.layers.Dropout(
      tuner.Float('dropout', 0, 0.8, step=0.1, default=0.5))(previous_layer)

tuner = kt.Hyperband(

You can find more about using kerastuner in this article: Using keras-tuner to tune hyperparameters of a TensorFlow model.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.

Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.

Do you want to work with me at riskmethods?

REMOTE position (available in Poland or Germany)