Using Boltzmann distribution as the exploration policy in TensorFlow-agent reinforcement learning models

In this article, I am going to show you how to use Boltzmann policy in TensorFlow-Agent, how to configure the policy, and what is the expected result of various configuration options.

Use Boltzmann policy with DQN Agent

While using the deep Q-network agent as our reinforcement learning model, we can easily configure Boltzmann policy by specifying the boltzmann_temperature parameter in the DQNAgent constructor.

1
2
3
4
5
6
7
8
9
10
from tf_agents.agents.dqn import dqn_agent

#tf_env is the environment implementation, q_network is the neural network used as the model

agent = dqn_agent.DqnAgent(
    tf_env.time_step_spec(),
    tf_env.action_spec(),
    q_network=q_net,
    boltzmann_temperature = 0.8, #<-- this parameter configures Boltzmann policy
    optimizer=tf.train.AdamOptimizer(0.001))

It is important to remember that we cannot use both epsilon_greedy and boltzmann_temperature parameters at the same time because those are two different exploration methods and cannot be used at the same time.

In the DQNAgent code, there is the following if statement:

1
2
3
4
5
6
7
8
# DQNAgent implementation in Tensorflow-Agents
# https://github.com/tensorflow/agents/blob/a155216ded2ad151359c6f719149aacc9503b5f5/tf_agents/agents/dqn/dqn_agent.py#L285
if boltzmann_temperature is not None:
      collect_policy = boltzmann_policy.BoltzmannPolicy(
          policy, temperature=self._boltzmann_temperature)
else:
    collect_policy = epsilon_greedy_policy.EpsilonGreedyPolicy(
        policy, epsilon=self._epsilon_greedy)

We see that the boltzmann_temperature is used to create the proper exploration policy object (called collect_policy in Tensorflow-Agent code).

How does it work

While exploring, the agent creates an action distribution. This distribution describes how optimal an action is according to the data gathered by the agent. If you want, you can say that the action distribution describes the agent’s belief about the optimal action.

In the Boltzmann policy implementation, the original action distribution gets divided by the temperature parameter. Because of that, Boltzmann policy turns the agent’s exploration behavior into a spectrum between picking the action randomly (random policy) and always picking the most optimal action (greedy policy).

1
2
3
4
5
6
# BoltzmannPolicy implementation in Tensorflow-Agents
# https://github.com/tensorflow/agents/blob/a155216ded2ad151359c6f719149aacc9503b5f5/tf_agents/policies/boltzmann_policy.py#L67
def _apply_temperature(self, dist):
    """Change the action distribution to incorporate the temperature."""
    logits = dist.logits / self._get_temperature_value()
    return dist.copy(logits=logits)

If we specify a very small temperature value, the differences between original action probabilities become more substantial, so the action with the highest probability is even more likely to be selected. If the temperature parameter is very close to zero, it turns the Boltzmann policy into a greedy policy because the most probable action gets selected all the time.

On the other hand, a huge value of the temperature parameter dominates the original action distribution. As a result, there are almost no differences between probabilities, and we end up with a random policy.

Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • MLOps engineer by day
  • AI and data engineering consultant by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
  • Mastodon: @mikulskibartosz@mathstodon.xyz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.