Using Boltzmann distribution as the exploration policy in TensorFlow-agent reinforcement learning models

In this article, I am going to show you how to use Boltzmann policy in TensorFlow-Agent, how to configure the policy, and what is the expected result of various configuration options.

Use Boltzmann policy with DQN Agent

While using the deep Q-network agent as our reinforcement learning model, we can easily configure Boltzmann policy by specifying the boltzmann_temperature parameter in the DQNAgent constructor.

1
2
3
4
5
6
7
8
9
10
from tf_agents.agents.dqn import dqn_agent

#tf_env is the environment implementation, q_network is the neural network used as the model

agent = dqn_agent.DqnAgent(
    tf_env.time_step_spec(),
    tf_env.action_spec(),
    q_network=q_net,
    boltzmann_temperature = 0.8, #<-- this parameter configures Boltzmann policy
    optimizer=tf.train.AdamOptimizer(0.001))

It is important to remember that we cannot use both epsilon_greedy and boltzmann_temperature parameters at the same time because those are two different exploration methods and cannot be used at the same time.

In the DQNAgent code, there is the following if statement:

1
2
3
4
5
6
7
8
# DQNAgent implementation in Tensorflow-Agents
# https://github.com/tensorflow/agents/blob/a155216ded2ad151359c6f719149aacc9503b5f5/tf_agents/agents/dqn/dqn_agent.py#L285
if boltzmann_temperature is not None:
      collect_policy = boltzmann_policy.BoltzmannPolicy(
          policy, temperature=self._boltzmann_temperature)
else:
    collect_policy = epsilon_greedy_policy.EpsilonGreedyPolicy(
        policy, epsilon=self._epsilon_greedy)

We see that the boltzmann_temperature is used to create the proper exploration policy object (called collect_policy in Tensorflow-Agent code).

How does it work

While exploring, the agent creates an action distribution. This distribution describes how optimal an action is according to the data gathered by the agent. If you want, you can say that the action distribution describes the agent’s belief about the optimal action.

In the Boltzmann policy implementation, the original action distribution gets divided by the temperature parameter. Because of that, Boltzmann policy turns the agent’s exploration behavior into a spectrum between picking the action randomly (random policy) and always picking the most optimal action (greedy policy).

1
2
3
4
5
6
# BoltzmannPolicy implementation in Tensorflow-Agents
# https://github.com/tensorflow/agents/blob/a155216ded2ad151359c6f719149aacc9503b5f5/tf_agents/policies/boltzmann_policy.py#L67
def _apply_temperature(self, dist):
    """Change the action distribution to incorporate the temperature."""
    logits = dist.logits / self._get_temperature_value()
    return dist.copy(logits=logits)

If we specify a very small temperature value, the differences between original action probabilities become more substantial, so the action with the highest probability is even more likely to be selected. If the temperature parameter is very close to zero, it turns the Boltzmann policy into a greedy policy because the most probable action gets selected all the time.

On the other hand, a huge value of the temperature parameter dominates the original action distribution. As a result, there are almost no differences between probabilities, and we end up with a random policy.


Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * big data engineer * conference speaker * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group