The optimal learning rate during fine-tuning of an artificial neural network

I used a pre-learned resnet50 network. I had just finished transfer learning (training only the last layer of a pre-learned neural network on a new dataset) and the only thing I needed to do was fine-tuning the network so it could fit better to the categories of my dataset.

Fortunately, I decided to use fast.ai, so the task was relatively easy. Just set the learning rates, give the network new examples and let it learn. It turned out that the learning rate was a little bit problematic.

1
2
learn.unfreeze()
learn.fit_one_cycle(number_of_epochs, max_lr=[first_layers, middle_layers, last_layers])

Usually, when we train a neural network, we determine the optimal learning rate using the method described in Leslie N. Smith’s paper: “Cyclical Learning Rates for Training Neural Networks.”

That was precisely what I have done. I used the lr_find function available in fast.ai to get the chart of loss values and their corresponding learning rates.

Remember that 1e-02 = 1 * 10^(-2) and 1e-01 = 1 * 10^(-1)
Remember that 1e-02 = 1 * 10^(-2) and 1e-01 = 1 * 10^(-1)

The optimal learning rate is supposed to be the value that gives us the fastest decrease in loss. It seemed that something between 1e-2 and 1e-1 would do the job.

To be sure I plotted a chart of loss derivatives with respect to the learning rate using Pavel Surmenok’s code.

Now, I was sure. The optimal value was right in between of 1e-2 and 1e-1, so I set the learning rate of the last layers to 0.055. For the first and middle layers, I set 1e-5 and 1e-4 respectively, because I did not want to change them a lot. It was wrong. The results were terrible. In every epoch of the training, both training loss and validation loss were getting worse and worse.

What have I forgotten?

That is the method of finding the optimal learning rate for training a new network. The task I was doing was just fine-tuning. I should not use the value that change the loss a lot. My optimal values were the ones that make only small differences. Let’s look at the derivatives chart again. What makes the smallest difference? The loss values between 1e-6 and 1e-4!

Why?

Remember that I used a pre-learned model, trained it on my dataset and them just had to unfreeze all layers and do some fine tuning. At this stage, it is too late to make massive changes. It is the last tweak. Like adding salt to a dish while cooking. Just like the “Salt bae” ;)

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.