Which hyperparameters of deep learning model are important and how to find them

This is another of my blog post based on the notes I have made while watching Andrew Ng’s deep learning course.

According to Andrew Ng, the critical hyperparameters (in the order of their importance) are:

  • learning rate

  • momentum (if you are using gradient descent with momentum), mini-batch size, and the number of hidden units in layers

  • the number of layers and learning rate decay

When we look at the list, we may notice that Andrew Ng thinks that the hyperparameter is more significant when it has a stronger influence on the weights produced during training.

Andrew Ng recommends to stop using grid search and replace it with a random search of hyperparameters. This advice is based on an observation that it is better to try a more diverse set of parameters than keep an important parameter (for example learning rate) unchanged for a while to tweak some less crucial parameters (like mini-batch size or the number of layers).

Would you like to help fight youth unemployment while getting mentoring experience?

Develhope is looking for tutors (part-time, freelancers) for their upcoming Data Engineer Courses.

The role of a tutor is to be the point of contact for students, guiding them throughout the 6-month learning program. The mentor supports learners through 1:1 meetings, giving feedback on assignments, and responding to messages in Discord channels—no live teaching sessions.

Expected availability: 15h/week. You can schedule the 1:1 sessions whenever you want, but the sessions must happen between 9 - 18 (9 am - 6 pm) CEST Monday-Friday.

Check out their job description.

(free advertisement, no affiliate links)

Coarse to fine hyperparameter search

This approach consists of two steps. First, we look for the best hyperparameters using random search. Later, we repeat the random search, but limit the range to values similar to the best hyperparameters from the previous step.

We do it because that allows us to quickly find hyperparameters which are good enough and later continue tweaking them.

Use log scale for searching hyperparameters

The final advice suggested by Andrew Ng is using a logarithmic scale when randomly choosing hyperparameters. We do it because when we generate uniformly distributed random numbers, we are more likely to get more diverse result.

For example, if we want values between 1 and 0.0001, we can randomly generate the exponent and use a constant base:

n = -4 * np.random.rand(4)
parameter = 10**n

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.