In this article, I am going to summarize the facts about dealing with underfitting and overfitting in deep learning which I have learned from Andrew Ng’s course.

Ideal model

Before we start, we must decide what the best possible performance of a deep learning model is. We may find the best possible result by checking the accuracy achieved by human experts when working on the same test dataset.

We need it because that information allows us to set reasonable goals and have expectations that can be achieved. After all, if human experts make mistakes in 5% of the cases, and we use a dataset prepared by those experts for training the model, we can’t expect 100% accuracy (or even 95% accuracy).


The first kind of problem is underfitting. It occurs when we observe that the model’s training set error is significantly larger than the expected error of an ideal model. In machine learning, when the model performs poorly even on the training set, we say that the model has a high bias.

According to Andrew Ng, the best methods of dealing with an underfitting model is trying a bigger neural network (adding new layers or increasing the number of neurons in existing layers) or training the model a little bit longer.

Why does he give such advice? An underfitting neural network may be not powerful enough to capture the patterns in the training dataset. In this case, adding more layers or increasing the number of neurons may help overcome the problem.

Increasing the training time is advised because it is possible that the underperforming model has not found the most optimal values of its parameters yet.


Overfitting occurs when the model performs well when it is evaluated using the training set, but cannot achieve good accuracy when the test dataset is used. This kind of problem is called “high variance,” and it usually means that the model cannot generalize the insights from the training dataset.

Andrew Ng suggests that the best solution to overfitting is getting more data and using regularization.

Such a solution is suggested, because the model may not get enough training examples to learn the patterns properly, so adding news observations to the training dataset may increase the chance of getting a better model.

On the other hand, it is possible that the neural network is too complicated and because of that, it can deal adequately only with the training set examples. If it were a human writing an exam at school, we would say that he/she has memorized the homework, but he/she did not learn the concept. The same may happen to a machine learning model.

Because of that, we use regularization techniques to force the model to generalize the insights. In deep learning, we can choose a few such techniques:

  • L1 and L2 methods are used to modify the cost function in a way that penalizes more complex models.

  • Dropout is used to prevent the model from relying too much on one input because the input may disappear

  • We use early stopping to prevent “memorization” of training examples

  • Data augmentation — to make the training dataset more diverse

Andrew Ng also makes a critical comment on using dropout. He tells that we should first focus on finding the proper hyperparameters and don’t bother about dropout. After we have good hyperparameters, we may start adding and tuning dropout. Those two tasks can be done separately and to avoid complexity we should not try to do them at the same time.

Older post

How to reduce memory usage in Pandas

Fit more data in the same amount of memory

Newer post

How to choose the right mini-batch size in deep learning

Andrew Ng recommendation about mini batch size