Understanding the softmax activation function

The softmax activation function is used in neural networks when we want to build a multi-class classifier which solves the problem of assigning an instance to one class when the number of possible classes is larger than two.

In this article, I am going to explain the reason why we use softmax and how it works.


We use softmax as the output function of the last layer in neural networks (if the network has n layers, the n-th layer is the softmax function). This fact is important because the purpose of the last layer is to turn the score produced by the neural network into values that can be interpreted by humans.

To understand the softmax function, we must look at the output of the (n-1)th layer. In this layer, the values get multiplied by some weights, passed through an activation function and aggregated into a vector which contains one value per every class of the model.

For the sake of this example, we may interpret the score produced by the layer n-1 as the number of votes in a weighted voting system in which some neurons are more important than others.

Max arg

Here comes the tricky part. The softmax function is, in fact, an arg max function. That means that it does not return the largest value from the input, but the position of the largest values.

We interpret the result of the softmax function as the probability of the class, so the softmax function works in the following way:

Given a vector of numbers (the scores from the (n-1)th layer) It returns the probability of the largest value being the i-th element of the vector. For example, if I have an input:

X = [13, 31, 5]

and pass it to the softmax function, I am going to get:

array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12])

99.9999985 % probability that the largest value in the array is in the second position.

Why not 100%?

That may look strange. I mean, we see what the largest value is. Why didn’t it return 100%?

To understand it we must remember about the fact that neural networks are universal function approximators. We can build a neural network that approximates the value of any mathematical function, but that is just an approximation, not an exact result. We use softmax to embrace that uncertainty and turn it into a probability interpretable by people.

If you ever see, 1 as the result of softmax function it is likely caused by limitations of the floating-point arithmetic algorithms which make some values not-representable in computer memory.

softmax([0, 100, 0])
//array([3.72007598e-44, 1.00000000e+00, 3.72007598e-44])

Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!


Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • MLOps engineer by day
  • AI and data engineering consultant by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
  • Mastodon: @mikulskibartosz@mathstodon.xyz

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.