Why do we use dropout in artificial neural networks?

This article will explain what Dropout is, how does it work, and why we use it in artificial neural networks.

What is Dropout in artificial neural networks?

Dropout is a technique that prevents overfitting in artificial neural networks by randomly dropping units during training. As a result, the trained model works as an ensemble model consisting of multiple neural networks. At test time, the prediction of those ensembled networks is averaged in every layer to get the final model prediction.

Because during training, some parts of the network switch off, we can represent Dropout as training multiple neural networks and averaging their prediction during testing.

How does Dropout work during training?

When we use Dropout, we specify the probability of using the layer’s input value during a single mini-batch of training. The implementation calculates an element-wise product (marked as \(*\) in the equations below) of the vector of inputs into a layer and the vector of independent Bernoulli random variables which have probability \(p\) of being \(1\).

Assuming that:

  • \(l \in \{1, ..., L\}\) is the index of hidden layers in the neural network
  • \(l = 0\) denote the input layer index
  • \(z^{(l)}\) is the input vector in the layer \(l\)
  • \(y^{(l)}\) is the output vector of the layer \(l\), so \(y^{(0)}\) is the input layer
  • \(W^{(l)}\) is the vector of weights at layer \(l\)
  • \(b^{(l)}\) is a vector of biases at layer \(l\)
  • \(r^{(l)}\) is a vector of independent Bernoulli random variables with probability \(p\) of being \(1\) and probability \(1 - p\) of being \(0\)
  • \(f\) denotes an activation function

we can calculate the output of a layer like this:

\[\begin{align} \widetilde{y}^{(l)} &= r^{(l)} * y^{(l)} \\ z^{(l+1)} &= W^{(l+1)}\widetilde{y}^{(l)}+b^{(l+1)} \\ y^{(l+1)} &= f(z^{(l+1)}) \end{align}\]

When we write down the equations that describe the neural networks, we see that the dropped out neurons don’t disappear from the network, but their inputs are set to zeros.

Similarly, only the units with probability \(p = 1\) are affected during backpropagation, so the switched-off units ignore the input batch.

How does dropout work during testing?

When the model is ready, and we want to use it (or test it), we no longer switch off network units. Instead of that, the weights of each unit is multiplied by \(p\):

\[W^{(l)}_{test} = pW^{(l)}\]

This operation averages the predictions produced by all sub-networks generated by switching off parts of the neural network.

Dropout in Tensorflow

When we use Tensorflow, adding a dropout to a layer requires putting the Dropout operator after the layer definition. Additionally, we can use kerastuner to tune the dropout rate parameter:

import tensorflow as tf
import kerastuner as kt

def build_model(tuner):
    input_to_the_next_layer = tf.keras.layers.Dropout(
      tuner.Float('dropout', 0, 0.8, step=0.1, default=0.5))(previous_layer)

tuner = kt.Hyperband(

You can find more about using kerastuner in this article: Using keras-tuner to tune hyperparameters of a TensorFlow model.

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!


Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.