Understanding the softmax activation function

The softmax activation function is used in neural networks when we want to build a multi-class classifier which solves the problem of assigning an instance to one class when the number of possible classes is larger than two.

In this article, I am going to explain the reason why we use softmax and how it works.


We use softmax as the output function of the last layer in neural networks (if the network has n layers, the n-th layer is the softmax function). This fact is important because the purpose of the last layer is to turn the score produced by the neural network into values that can be interpreted by humans.

To understand the softmax function, we must look at the output of the (n-1)th layer. In this layer, the values get multiplied by some weights, passed through an activation function and aggregated into a vector which contains one value per every class of the model.

For the sake of this example, we may interpret the score produced by the layer n-1 as the number of votes in a weighted voting system in which some neurons are more important than others.

Subscribe to the newsletter and join the free email course.

Max arg

Here comes the tricky part. The softmax function is, in fact, an arg max function. That means that it does not return the largest value from the input, but the position of the largest values.

We interpret the result of the softmax function as the probability of the class, so the softmax function works in the following way:

Given a vector of numbers (the scores from the (n-1)th layer) It returns the probability of the largest value being the i-th element of the vector. For example, if I have an input:

X = [13, 31, 5]

and pass it to the softmax function, I am going to get:

array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12])

99.9999985 % probability that the largest value in the array is in the second position.

Why not 100%?

That may look strange. I mean, we see what the largest value is. Why didn’t it return 100%?

To understand it we must remember about the fact that neural networks are universal function approximators. We can build a neural network that approximates the value of any mathematical function, but that is just an approximation, not an exact result. We use softmax to embrace that uncertainty and turn it into a probability interpretable by people.

If you ever see, 1 as the result of softmax function it is likely caused by limitations of the floating-point arithmetic algorithms which make some values not-representable in computer memory.

softmax([0, 100, 0])
//array([3.72007598e-44, 1.00000000e+00, 3.72007598e-44])

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.

Bartosz Mikulski
Bartosz Mikulski * data/machine learning engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.