Understanding the softmax activation function
The softmax activation function is used in neural networks when we want to build a multi-class classifier which solves the problem of assigning an instance to one class when the number of possible classes is larger than two.
In this article, I am going to explain the reason why we use softmax and how it works.
We use softmax as the output function of the last layer in neural networks (if the network has n layers, the n-th layer is the softmax function). This fact is important because the purpose of the last layer is to turn the score produced by the neural network into values that can be interpreted by humans.
To understand the softmax function, we must look at the output of the (n-1)th layer. In this layer, the values get multiplied by some weights, passed through an activation function and aggregated into a vector which contains one value per every class of the model.
For the sake of this example, we may interpret the score produced by the layer n-1 as the number of votes in a weighted voting system in which some neurons are more important than others.
Here comes the tricky part. The softmax function is, in fact, an arg max function. That means that it does not return the largest value from the input, but the position of the largest values.
We interpret the result of the softmax function as the probability of the class, so the softmax function works in the following way:
Given a vector of numbers (the scores from the (n-1)th layer) It returns the probability of the largest value being the i-th element of the vector. For example, if I have an input:
1 X = [13, 31, 5]
and pass it to the softmax function, I am going to get:
1 array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12])
99.9999985 % probability that the largest value in the array is in the second position.
Why not 100%?
That may look strange. I mean, we see what the largest value is. Why didn’t it return 100%?
To understand it we must remember about the fact that neural networks are universal function approximators. We can build a neural network that approximates the value of any mathematical function, but that is just an approximation, not an exact result. We use softmax to embrace that uncertainty and turn it into a probability interpretable by people.
If you ever see, 1 as the result of softmax function it is likely caused by limitations of the floating-point arithmetic algorithms which make some values not-representable in computer memory.
1 2 softmax([0, 100, 0]) //array([3.72007598e-44, 1.00000000e+00, 3.72007598e-44])
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- How to choose the right mini-batch size in deep learning
- The optimal learning rate during fine-tuning of an artificial neural network
- Understanding the Keras layer input shapes
- How to automatically select the hyperparameters of a ResNet neural network
- Which hyperparameters of deep learning model are important and how to find them
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz